Building Blocks

Building Blocks of the ETL Platform¶

The platform’s ETL pipelines are built from modular task types, each addressing a specific stage of the data processing lifecycle. These tasks can be combined in any order to form complex workflows, with execution managed by Apache Airflow and scalable processing supported by Apache Beam.

Data Ingestion¶

Purpose: Bring data into the platform from diverse sources and deliver it to the desired destinations.
Features:
Custom UI for creating connections to data sources and destinations.
Mapping definitions to align incoming data structures with target schemas.
Supports batch and incremental ingestion.
Built-in connectors for popular databases, file systems, and APIs.
Execution Modes:
- Runs as independent ingestion tasks inside Airflow DAGs.
- Can execute in parallel for multiple sources/destinations.

Cleansing¶

Purpose: Standardize, validate, and enrich raw data before further processing or analysis.
Features:
AI-assisted cleansing rule generation based on column profiling.
Built-in transformations: format normalization, missing value handling, type casting.
Enrichment from reference datasets or APIs.
Execution Modes:
Runs as a Java task in Airflow.
For large datasets, executes on Apache Beam for distributed processing.

Deduplication¶

Purpose: Identify and remove duplicate records to ensure data accuracy and consistency.
Features:
Multiple algorithms for duplicate detection:
- Exact match
- Fuzzy matching
- Composite key matching
Configurable thresholds for match confidence.
Execution Modes:
Small datasets: Java task in Airflow worker.
Large datasets: Distributed deduplication using Apache Beam on Flink/Spark runners.

Custom Tasks¶

Purpose: Enable flexible transformations and logic beyond predefined tasks.
Features:
Run custom SQL queries against supported databases.
Execute arbitrary Python scripts for data transformation, ML scoring, or external API calls.
Integrate with external services or tools as needed.
Execution Modes:
Local execution in Airflow worker pods.
Cluster execution with Apache Beam for heavy computation workloads.
Beam runner flexibility: Flink and Spark clusters supported.