Data Modeling: Ingestion Design

Designing your ingestion architecture

Simple explanation

Ingestion design is like planning a kitchen’s supply chain before opening a restaurant.

Where do ingredients come from (sources)? How often do deliveries arrive — daily truck (batch) or continuous conveyor belt (streaming)? What containers do they come in (file formats)? Which dock do they arrive at (ingestion tool)? Do you own the storage room or rent it (managed vs external)?

Get these decisions wrong and you spend months refactoring. Get them right and your data pipeline practically builds itself.

Extraction types

Extraction Type	How It Works	Best For
Full extraction	Copy ALL data from source every time	Small reference tables, initial loads
Incremental extraction	Copy only NEW or CHANGED records since last run	Large transactional tables, frequent updates
CDC (Change Data Capture)	Capture individual row-level changes (insert, update, delete)	Real-time sync, audit trail

Mei Lin uses incremental extraction for Freshmart’s daily POS data (millions of transactions). For the product catalogue (5,000 items), she uses full extraction because it’s small and simpler.

Choosing an ingestion tool

Feature	Lakeflow Connect	Notebooks	Azure Data Factory
Best for	SaaS/database connectors	Custom logic, complex transforms	Enterprise orchestration
Code required?	Low-code configuration	Python/SQL/Scala	Low-code + custom activities
Streaming support	Yes	Yes (Structured Streaming)	Limited
Built into Databricks	Yes (native)	Yes (native)	No (Azure service)
Exam scenario	Ingest from Salesforce, SAP	Custom ETL with business logic	Orchestrate multi-service pipeline

Exam decision tree:

Ingesting from a SaaS app or standard database? → Lakeflow Connect
Need custom transformation logic during ingestion? → Notebooks
Orchestrating across Azure services (not just Databricks)? → Azure Data Factory

Azure Data Factory integration pattern

ADF doesn’t replace Databricks — it orchestrates it:

ADF pipeline triggers on schedule
Copy Activity moves data from source to ADLS bronze layer
Databricks Notebook Activity transforms bronze to silver to gold
ADF monitors the entire flow

Exam tip: If the question mentions orchestrating across multiple Azure services, ADF is the answer. If ingestion is purely within Databricks, use Lakeflow Connect or notebooks.

Batch vs streaming

Aspect	Batch	Streaming
Data arrives	In chunks (hourly, daily)	Continuously
Latency	Minutes to hours	Seconds to minutes
Cost	Lower (compute runs only during batch)	Higher (continuous compute)
Complexity	Simpler	More complex (state, checkpoints)
Use case	Nightly ETL, reporting	Fraud detection, IoT, real-time dashboards

Tomás at NovaPay uses streaming for fraud detection. Ravi at DataPulse uses batch for nightly client reporting.

Table formats

Format	ACID	Schema Enforcement	Time Travel	Default Choice?
Delta Lake	Yes	Yes	Yes	Default for Databricks-native workloads
Iceberg	Yes	Yes	Yes	Multi-engine interoperability
Parquet	No	No	No	Read-only analytics, archival
CSV	No	No	No	Data exchange
JSON	No	No	No	Semi-structured, API responses

Delta vs Iceberg

Delta Lake = native Databricks format. Best performance, all features.
Iceberg = multi-engine format. Choose when data must be read by Trino, Flink, or other engines outside Databricks.

Exam tip: If “interoperability with non-Databricks engines” is mentioned → Iceberg. Otherwise → Delta.

Managed vs external tables (design decision)

When to Choose	Managed	External
Standard lakehouse tables	Yes
DROP should delete data	Yes
Data shared across systems		Yes
Pre-existing data in ADLS		Yes
DROP should keep data files		Yes

Question

What are the three extraction types for data ingestion?

Click or press Enter to reveal answer

Answer

Full extraction (all data every time — small tables), Incremental extraction (new/changed records only — large tables), CDC (individual row-level changes — real-time sync).

Click to flip back

Question

When should you use Lakeflow Connect vs notebooks vs Azure Data Factory?

Click or press Enter to reveal answer

Answer

Lakeflow Connect: SaaS/database connectors. Notebooks: custom transformation logic. ADF: orchestrating pipelines across multiple Azure services beyond Databricks.

Click to flip back

Question

Why is Delta Lake the default table format?

Click or press Enter to reveal answer

Answer

Delta Lake provides ACID transactions, schema enforcement, time travel, and OPTIMIZE/VACUUM. It's the native lakehouse format with the best Databricks performance.

Click to flip back

Knowledge check

Knowledge Check

Ravi needs to ingest customer data from Salesforce into DataPulse's lakehouse. The data should arrive daily with minimal custom code. Which tool is most appropriate?

Knowledge Check

Tomás needs fraud detection within 5 seconds of a transaction. Historical reporting refreshes every 6 hours. Which loading methods should he use?

Next up: SCD, Granularity & Temporal Tables — slowly changing dimensions, granularity decisions, and history tables.