Platform Architecture

The Qualiz platform is a cloud-agnostic, Kubernetes-based, microservices-driven ETL and Data Quality solution.
It is designed to orchestrate data ingestion, transformation, quality checks, deduplication, enrichment, and AI-assisted data operations at scale.
The platform is extensible, highly available, and supports multi-tenancy through project-level isolation and role-based access control.

1755004027807

¶

Architectural Goals¶

Scalability – Handle multiple large-scale pipelines concurrently.
Extensibility – Add new operators and processing engines without service downtime.
Observability – Full visibility into pipeline execution, audit logs, and system health.
Security – Strong identity management, RBAC, and secure secrets handling.
Portability – Deployable on any major cloud provider or on-prem Kubernetes cluster.
AI Integration – Native AI support for cleansing rule generation and job monitoring.

High-Level Architecture¶

The platform consists of four main layers:

1. Presentation Layer¶

Webapp (React) – Provides the UI for pipeline creation, data source configuration, monitoring, and audit viewing.
Authentication (Keycloak) – Central identity provider using OIDC for both UI and service-to-service authentication.

2. Application Layer (Microservices)¶

Backend API – Core orchestration API for pipelines, DAG generation, Airbyte integrations, and audit management.
AI API – Interfaces with Ollama for AI-assisted cleansing rules and anomaly detection.
Notification Service – Sends email and planned webhook/slack alerts.
Custom Operator Logic – Encapsulated in microservices or container images for specific ETL tasks.

3. Processing & Orchestration Layer¶

Airflow – Primary workflow orchestrator; dynamically executes DAGs generated from the UI.
Airbyte – Manages data ingestion from various sources to destinations (triggered via Backend API).
Apache Beam – Cluster-based execution for distributed data processing.
Custom Operators – Python task runner, SQL task runner, cleansing, deduplication, notification, sub-job invoker.

4. Data & Storage Layer¶

PostgreSQL – Metadata store for pipelines, configurations, audit logs, and lineage data.
MinIO – S3-compatible object store for staging data, artifacts, and logs.
ELK Stack – Centralized logging and search capabilities.
Prometheus/Grafana – Metrics collection and dashboarding.

Key Platform Components¶

Component	Responsibility
Webapp	Pipeline builder UI, monitoring, audit dashboards.
Backend API	Pipeline orchestration, DAG generation, Airbyte integration, audit logging.
AI API	AI inference for cleansing rules & job monitoring (Ollama models).
Airbyte	Connector management for data ingestion.
Airflow	DAG scheduling, execution, and task orchestration.
Custom Operators	Encapsulate business-specific ETL & quality checks.
PostgreSQL	Metadata, pipeline definitions, audit logs.
MinIO	Object storage for intermediate and final artifacts.
ELK	Log ingestion, search, and visualization.
Keycloak	Authentication & authorization provider.

Deployment & Infrastructure¶

Runtime: Kubernetes cluster (cloud or on-prem).
Namespace Segregation:
platform – Core microservices (backend, AI, Airflow, Airbyte, Keycloak).
infra – Storage, ingress, logging, monitoring.
tenant-* – Optional per-tenant connector runtime.
Ingress: NGINX/Traefik with TLS termination (cert-manager).
Storage Classes: Block storage for DBs, object storage for MinIO.

Data Flow¶

Pipeline Creation – User designs pipeline in UI → backend stores config in PostgreSQL → generates DAG file.
Pipeline Execution – Airflow picks DAG → executes operators (Airbyte, cleansing, deduplication, SQL, Python, Beam).
Ingestion – Airbyte connectors run in Kubernetes pods → write output to MinIO or target DB.
Processing – Custom tasks transform, enrich, deduplicate data.
AI Features – Backend calls AI API for suggestions/monitoring → results stored in DB.
Audit & Monitoring – Logs and execution metadata stored in PostgreSQL and ELK → UI displays results.
Notification – Email/slack/webhook alerts on pipeline events.

Security & Access Control¶

Identity – Managed via Keycloak OIDC.
Authorization – Role-based access control (RBAC) with project-level scopes.
Secrets – Stored in Kubernetes Secrets or Vault (encrypted at rest).
Network Policies – Restrict inter-service communication.

Observability¶

Logs – Collected via ELK stack (Elasticsearch, Logstash/Fluentd, Kibana).
Metrics – Prometheus exporters for Airflow, Airbyte, custom services.
Audit – Job & task-level immutable records.

Scalability & Resilience¶

Task Execution Scalability
The platform is designed to handle heavy data processing workloads by enabling scalable execution of ETL tasks:
Apache Beam Integration:
Provides a distributed, cluster-based execution environment for Python tasks and complex data transformations.
- Supports multiple runners including Apache Flink and Apache Spark clusters, enabling flexible execution depending on workload and environment.
- Runs pipelines with parallelism, windowing, and fault tolerance critical for large streaming and batch data workloads.
- Enables horizontal scaling by leveraging the underlying cluster’s autoscaling and resource management features.
Airflow KubernetesExecutor:
Each ETL task runs in its own Kubernetes pod, allowing dynamic scaling of tasks in parallel based on cluster resources.
Airbyte Connectors:
Data ingestion connectors run as independent Kubernetes jobs or pods, which can be scaled out for parallel data pulls/pushes.
Fault Tolerance & Resilience
Apache Beam runners provide exactly-once or at-least-once guarantees depending on runner and pipeline configuration.
Airflow’s task retries and DAG-level error handling ensure pipeline resilience.
MinIO and PostgreSQL provide highly available and durable storage for intermediate data and metadata.
Resource Efficiency
The platform supports workload-specific resource requests and limits for pods to optimize cluster resource usage.
Apache Beam pipelines leverage autoscaling capabilities of underlying runners to dynamically adjust compute resources during execution.