Ingesting Data: Lakeflow Connect & Notebooks
Ingest data using Lakeflow Connect's pre-built connectors and custom notebook code β batch and streaming patterns for getting data into your lakehouse.
Two paths to ingestion
Lakeflow Connect is like a pre-built plumbing kit. Notebooks are like custom plumbing you build yourself.
Lakeflow Connect: pick your source (Salesforce, SAP, databases), configure connection details, and data flows automatically. No coding β just configuration.
Notebooks: write Python or SQL code to read data, transform it, and write it to your lakehouse. Full control, but you build and maintain everything.
Lakeflow Connect
Batch ingestion with Lakeflow Connect
Ravi uses Lakeflow Connect to ingest CRM data from Salesforce:
- Create a connection β specify the source system and credentials
- Configure ingestion β select tables, choose full or incremental sync
- Schedule β set the refresh cadence (hourly, daily)
- Monitor β track ingestion status in the Lakeflow dashboard
Lakeflow Connect automatically handles schema mapping, type conversion, and incremental extraction using watermark columns or change tracking.
Streaming ingestion with Lakeflow Connect
For sources that support change streams (databases with CDC enabled), Lakeflow Connect can stream changes continuously:
- Source sends changes β Lakeflow Connect reads the change stream
- Writes to Delta table in near-real-time
- Handles schema evolution β new columns in the source are automatically added
Notebook-based ingestion
Batch ingestion with notebooks
# Read CSV files from ADLS landing zone
raw_df = (spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("abfss://landing@storage.dfs.core.windows.net/sales/*.csv"))
# Write to Delta table
(raw_df.write
.format("delta")
.mode("append")
.saveAsTable("bronze.raw_sales"))
Streaming ingestion with notebooks
# Read streaming data from a source
stream_df = (spark.readStream
.format("delta")
.table("bronze.raw_transactions"))
# Transform and write as a streaming query
(stream_df
.filter("amount > 0")
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/checkpoints/silver_txn")
.toTable("silver.valid_transactions"))
Key concept: Streaming uses readStream and writeStream instead of read and write. The checkpoint location tracks what data has been processed.
| Feature | Lakeflow Connect | Notebooks |
|---|---|---|
| Setup effort | Low (configuration) | High (code + testing) |
| Custom logic | Limited | Unlimited |
| Error handling | Built-in retries | You implement |
| Schema evolution | Automatic | Manual or with mergeSchema |
| Monitoring | Lakeflow dashboard | Spark UI + custom logging |
| Best for | Standard sources, quick setup | Complex transforms, custom sources |
Exam tip: Schema evolution in notebook ingestion
When source schemas change (new columns added), notebook ingestion can fail. Enable schema evolution:
# Allow new columns to be added automatically
df.write.option("mergeSchema", "true").mode("append").saveAsTable("my_table")Or set it at the table level:
ALTER TABLE my_table SET TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true');Exam tip: If the question mentions βsource schema changesβ or βnew columns addedβ β mergeSchema is the answer.
π¬ Video coming soon
Knowledge check
Mei Lin is ingesting data from 15 different Freshmart suppliers. Each supplier sends daily CSV files to an ADLS landing zone. Some suppliers occasionally add new columns. What ingestion approach handles this best?
Next up: Ingesting Data: SQL Methods & CDC β CTAS, CREATE OR REPLACE, COPY INTO, and change data capture feeds.