liminfo

Data Pipeline Reference

Free reference guide: Data Pipeline Reference

15 results

About Data Pipeline Reference

The Data Pipeline Reference is a concise, searchable guide to the core patterns, architectures, and concepts in modern data engineering. It covers 4 categories: Patterns (ETL, ELT, Batch, Streaming, CDC), Architecture (Lambda, Kappa, Data Lake, Data Warehouse, Lakehouse, Medallion), Concepts (Schema-on-Read vs Schema-on-Write), and Quality (Exactly-once semantics, Idempotency). Each entry explains the concept, names the real tools associated with it, and provides a quick example of when and how it applies.

This reference is designed for data engineers, analytics engineers, and data architects who need to quickly distinguish between architectural approaches and recall the tradeoffs. The Patterns section covers the two dominant integration styles: ETL (Transform before loading, used with tools like Informatica, Talend, SSIS) and ELT (Transform inside the data warehouse using dbt, Fivetran, Airbyte). It also covers Batch processing (periodic execution via Spark, Airflow), Streaming (continuous low-latency processing with Kafka, Flink, Kinesis), and CDC (Change Data Capture using Debezium to stream database change logs in real time).

The Architecture section covers the major data storage and processing paradigms. Lambda Architecture combines a Batch Layer for accurate recomputation with a Speed Layer for real-time approximation, merged by a Serving Layer. Kappa Architecture simplifies this by treating all data as streams. Data Lake stores raw data in any format (S3, ADLS, GCS) with schema applied at read time. Data Warehouse (Snowflake, BigQuery, Redshift) enforces schema at write time for structured analytics. Lakehouse (Delta Lake, Apache Iceberg, Apache Hudi) unifies both with ACID transactions and open formats. The Medallion Architecture organizes data into Bronze (raw), Silver (cleaned), and Gold (business-aggregated) layers.

Key Features

  • Covers 4 categories: Patterns, Architecture, Concepts, Quality with real tool names for each
  • ETL vs ELT comparison with specific tool ecosystems (Informatica/Talend vs dbt/Fivetran/Airbyte)
  • Batch vs Streaming tradeoffs: MapReduce/Spark Batch/Airflow vs Kafka/Kinesis/Flink/Spark Streaming
  • CDC (Change Data Capture) with Debezium and Kafka for real-time database change streaming
  • Lambda Architecture (Batch + Speed + Serving layers) and Kappa Architecture (stream-only) compared
  • Data Lake vs Data Warehouse vs Lakehouse — schema-on-read/write, columnar storage, ACID transactions
  • Medallion Architecture: Bronze/Silver/Gold data quality layers explained with purpose of each
  • Data quality concepts: Exactly-once processing guarantee, idempotency, MERGE/UPSERT patterns

Frequently Asked Questions

What is the difference between ETL and ELT?

ETL (Extract-Transform-Load) transforms data before loading it into the destination. The transformation happens in a separate processing layer, using tools like Informatica, Talend, or SSIS. ELT (Extract-Load-Transform) loads raw data first into the data warehouse, then transforms it inside the warehouse using its native compute power. Tools like dbt, Fivetran, and Airbyte follow the ELT pattern. ELT is preferred in modern cloud data warehouses because warehouse compute is cheap and transformations can be rerun without re-extracting data.

When should I use batch processing vs streaming?

Batch processing runs periodically (hourly, daily, weekly) and processes large volumes of accumulated data at once. It is simpler to implement, tolerates higher latency, and suits use cases like nightly reports and daily aggregations. Streaming processes data continuously with low latency (seconds or milliseconds) as it arrives. Use streaming for real-time dashboards, fraud detection, live recommendations, or any use case where acting on fresh data immediately has business value. Tools: Spark Batch and Airflow for batch; Kafka, Flink, and Kinesis for streaming.

What is CDC (Change Data Capture) and why is it important?

CDC captures every insert, update, and delete operation from a database transaction log and streams those changes to downstream systems in real time. Instead of querying entire tables periodically (which is slow and misses deletes), CDC reads the database binlog (MySQL) or WAL (PostgreSQL) to produce a stream of precise change events. Debezium is the most popular open-source CDC connector, typically paired with Kafka. CDC is essential for keeping data warehouses, caches, and microservices in sync with low latency.

What is the Medallion Architecture (Bronze/Silver/Gold)?

The Medallion Architecture organizes data in a lakehouse into three quality tiers. Bronze receives raw, unmodified data as ingested from sources — preserving full history, including duplicates and errors. Silver applies cleaning, deduplication, type casting, and basic transformations to create a reliable, normalized dataset. Gold contains business-level aggregations and domain-specific views ready for BI tools and machine learning. This layered approach makes data lineage traceable and allows re-processing any layer without losing the original data.

What is the difference between a Data Lake and a Data Warehouse?

A Data Lake stores raw data in its original format (CSV, JSON, Parquet, images, logs) without enforcing a schema upfront — this is schema-on-read. You apply the schema when querying. Tools: S3, ADLS, GCS. A Data Warehouse stores structured, pre-modeled data with schema enforced at write time — this is schema-on-write. It uses columnar storage for fast analytical queries. Tools: Snowflake, BigQuery, Redshift. A Data Lake is flexible but harder to maintain quality; a Data Warehouse is highly performant for structured analytics but less flexible for raw data exploration.

What is a Lakehouse and how does it relate to Delta Lake and Apache Iceberg?

A Lakehouse is an architecture that combines the low-cost, flexible storage of a Data Lake with the ACID transaction semantics, schema enforcement, and query performance of a Data Warehouse. Delta Lake (Databricks), Apache Iceberg, and Apache Hudi are open table formats that add transactional capabilities (ACID transactions, time travel, schema evolution) on top of cloud object storage like S3. They enable concurrent reads and writes, rollback, and incremental processing while keeping data in open file formats like Parquet.

What does exactly-once processing mean in streaming pipelines?

Exactly-once processing guarantees that each event in a stream is processed exactly one time — not zero times (at-most-once) and not more than once (at-least-once). Achieving exactly-once requires two things: idempotent producers (so retried sends do not create duplicates) and transactional consumers (so partial processing can be rolled back). Kafka supports exactly-once semantics through transactional APIs and idempotent producers. This is critical for financial transactions, billing, and any use case where duplicate processing has real consequences.

What is idempotency in data pipelines and why does it matter?

An idempotent pipeline produces the same result when run multiple times as it does when run once. This is essential for fault-tolerant pipelines because failures and retries are inevitable. Achieving idempotency typically means using UPSERT (MERGE) statements instead of INSERT, using a unique key to detect and ignore duplicates, and designing transformations so re-running them on already-processed data does not create errors. Idempotent pipelines are safe to retry on failure, safe to backfill, and safe to re-run after a schema change.