Data Modeling: Ingestion Design
Choose the right ingestion tools, loading methods, table formats, and managed vs external tables — the architectural decisions that shape your entire lakehouse.
Designing your ingestion architecture
Ingestion design is like planning a kitchen’s supply chain before opening a restaurant.
Where do ingredients come from (sources)? How often do deliveries arrive — daily truck (batch) or continuous conveyor belt (streaming)? What containers do they come in (file formats)? Which dock do they arrive at (ingestion tool)? Do you own the storage room or rent it (managed vs external)?
Get these decisions wrong and you spend months refactoring. Get them right and your data pipeline practically builds itself.
Extraction types
| Extraction Type | How It Works | Best For |
|---|---|---|
| Full extraction | Copy ALL data from source every time | Small reference tables, initial loads |
| Incremental extraction | Copy only NEW or CHANGED records since last run | Large transactional tables, frequent updates |
| CDC (Change Data Capture) | Capture individual row-level changes (insert, update, delete) | Real-time sync, audit trail |
Mei Lin uses incremental extraction for Freshmart’s daily POS data (millions of transactions). For the product catalogue (5,000 items), she uses full extraction because it’s small and simpler.
Choosing an ingestion tool
| Feature | Lakeflow Connect | Notebooks | Azure Data Factory |
|---|---|---|---|
| Best for | SaaS/database connectors | Custom logic, complex transforms | Enterprise orchestration |
| Code required? | Low-code configuration | Python/SQL/Scala | Low-code + custom activities |
| Streaming support | Yes | Yes (Structured Streaming) | Limited |
| Built into Databricks | Yes (native) | Yes (native) | No (Azure service) |
| Exam scenario | Ingest from Salesforce, SAP | Custom ETL with business logic | Orchestrate multi-service pipeline |
Exam decision tree:
- Ingesting from a SaaS app or standard database? → Lakeflow Connect
- Need custom transformation logic during ingestion? → Notebooks
- Orchestrating across Azure services (not just Databricks)? → Azure Data Factory
Azure Data Factory integration pattern
ADF doesn’t replace Databricks — it orchestrates it:
- ADF pipeline triggers on schedule
- Copy Activity moves data from source to ADLS bronze layer
- Databricks Notebook Activity transforms bronze to silver to gold
- ADF monitors the entire flow
Exam tip: If the question mentions orchestrating across multiple Azure services, ADF is the answer. If ingestion is purely within Databricks, use Lakeflow Connect or notebooks.
Batch vs streaming
| Aspect | Batch | Streaming |
|---|---|---|
| Data arrives | In chunks (hourly, daily) | Continuously |
| Latency | Minutes to hours | Seconds to minutes |
| Cost | Lower (compute runs only during batch) | Higher (continuous compute) |
| Complexity | Simpler | More complex (state, checkpoints) |
| Use case | Nightly ETL, reporting | Fraud detection, IoT, real-time dashboards |
Tomás at NovaPay uses streaming for fraud detection. Ravi at DataPulse uses batch for nightly client reporting.
Table formats
| Format | ACID | Schema Enforcement | Time Travel | Default Choice? |
|---|---|---|---|---|
| Delta Lake | Yes | Yes | Yes | Yes — always use this |
| Iceberg | Yes | Yes | Yes | Multi-engine interoperability |
| Parquet | No | No | No | Read-only analytics, archival |
| CSV | No | No | No | Data exchange |
| JSON | No | No | No | Semi-structured, API responses |
Delta vs Iceberg
- Delta Lake = native Databricks format. Best performance, all features.
- Iceberg = multi-engine format. Choose when data must be read by Trino, Flink, or other engines outside Databricks.
Exam tip: If “interoperability with non-Databricks engines” is mentioned → Iceberg. Otherwise → Delta.
Managed vs external tables (design decision)
| When to Choose | Managed | External |
|---|---|---|
| Standard lakehouse tables | Yes | |
| DROP should delete data | Yes | |
| Data shared across systems | Yes | |
| Pre-existing data in ADLS | Yes | |
| DROP should keep data files | Yes |
🎬 Video coming soon
Knowledge check
Ravi needs to ingest customer data from Salesforce into DataPulse's lakehouse. The data should arrive daily with minimal custom code. Which tool is most appropriate?
Tomás needs fraud detection within 5 seconds of a transaction. Historical reporting refreshes every 6 hours. Which loading methods should he use?
Next up: SCD, Granularity & Temporal Tables — slowly changing dimensions, granularity decisions, and history tables.