Testing & Databricks Asset Bundles (Declarative Automation…

Testing strategy

Simple explanation

Testing is taste-testing your food at every stage of cooking.

Unit test: taste each ingredient individually. Integration test: taste the combined sauce. End-to-end test: taste the full dish. UAT: have a customer taste it before putting it on the menu.

Without testing, you serve bad data and only find out when the CEO’s dashboard is wrong.

Testing layers

Test Type	What It Tests	How	When
Unit test	Individual functions/transforms	pytest with mock data	Every commit
Integration test	Components working together	Test tables in dev workspace	Every PR merge
End-to-end test	Full pipeline bronze → gold	Run pipeline on test data in staging	Before production deploy
UAT	Business rules and output quality	Stakeholders validate sample output	Before production release

Unit testing example

# test_transforms.py
from transforms import clean_amount, validate_date

def test_clean_amount_removes_negatives():
    assert clean_amount(-50) is None
    assert clean_amount(100) == 100.0

def test_validate_date_rejects_future():
    assert validate_date("2099-01-01") is False
    assert validate_date("2026-04-01") is True

Integration testing

# Run in a dev workspace with test data
test_df = spark.createDataFrame([
    (1, "Alice", 100.0, "2026-04-01"),
    (2, None, -50.0, "2099-01-01"),  # should be filtered out
], ["id", "name", "amount", "date"])

result = run_silver_pipeline(test_df)
assert result.count() == 1  # only valid row
assert result.filter("name = 'Alice'").count() == 1

Databricks Asset Bundles (DABs)

Asset Bundles package your entire project into a deployable unit:

# databricks.yml — bundle configuration
bundle:
  name: freshmart-etl

workspace:
  host: https://adb-1234567890.1.azuredatabricks.net

resources:
  jobs:
    nightly_etl:
      name: "Freshmart Nightly ETL"
      tasks:
        - task_key: ingest
          notebook_task:
            notebook_path: ./notebooks/01_ingest.py
          job_cluster_key: etl_cluster
        - task_key: transform
          depends_on:
            - task_key: ingest
          notebook_task:
            notebook_path: ./notebooks/02_transform.py
          job_cluster_key: etl_cluster

  pipelines:
    quality_pipeline:
      name: "Freshmart Quality Pipeline"
      target: freshmart_silver
      libraries:
        - notebook:
            path: ./pipelines/quality_checks.sql

targets:
  dev:
    workspace:
      host: https://adb-dev.azuredatabricks.net
  prod:
    workspace:
      host: https://adb-prod.azuredatabricks.net

Deploy via CLI

# Validate the bundle
databricks bundle validate

# Deploy to dev environment
databricks bundle deploy --target dev

# Run a specific job
databricks bundle run nightly_etl --target dev

# Deploy to production
databricks bundle deploy --target prod

Deploy via REST API

Bundle deployment is primarily CLI-driven (databricks bundle deploy), and the CLI uses REST APIs internally. You don’t typically call the REST API directly for bundle deployment.

# The CLI abstracts REST API calls:
databricks bundle deploy --target prod
# ↕ internally calls multiple REST APIs:
#   - /api/2.1/jobs/create or /api/2.1/jobs/reset
#   - /api/2.0/workspace/import
#   - /api/2.0/pipelines/create or update

Exam tip: Bundles and REST APIs

The exam expects you to know that:

Bundles are deployed via CLI (databricks bundle deploy), not by calling REST APIs directly
The CLI uses REST APIs internally to create/update jobs, pipelines, and workspace objects
For programmatic deployment in CI/CD, the CLI is invoked in pipeline steps (GitHub Actions, Azure DevOps)
Direct REST API calls (e.g., /api/2.1/jobs/create) are used for individual resource management, not for deploying complete bundles

CI/CD with Asset Bundles

A typical CI/CD pipeline:

Developer pushes code to feature branch
CI pipeline (GitHub Actions/Azure DevOps) runs:
- databricks bundle validate — check config syntax
- pytest — run unit tests
- databricks bundle deploy --target dev — deploy to dev
- Integration tests in dev workspace
PR merged → deploy to staging, run E2E tests
Release → databricks bundle deploy --target prod

Question

What are the four testing levels for data engineering?

Click or press Enter to reveal answer

Answer

Unit tests (individual functions, every commit), integration tests (connected components, every PR), end-to-end tests (full pipeline, before deploy), UAT (business validation, before release).

Click to flip back

Question

What are Databricks Asset Bundles?

Click or press Enter to reveal answer

Answer

DABs package notebooks, jobs, pipelines, and configuration into a deployable unit defined in databricks.yml. Deploy via CLI (databricks bundle deploy) or REST API. Supports multiple environments (dev/staging/prod).

Click to flip back

Question

How do you deploy a bundle to different environments?

Click or press Enter to reveal answer

Answer

Define environments in databricks.yml with different workspace hosts. Deploy with: databricks bundle deploy --target dev (or staging/prod). Each target has its own workspace configuration.

Click to flip back

Knowledge check

Knowledge Check

Dr. Sarah Okafor needs to deploy Athena Group's ETL pipeline to three environments (dev, staging, prod) with the same code but different workspace URLs. Which tool should she use?

Next up: Monitoring Clusters & Troubleshooting — cluster monitoring, job repair, and Spark troubleshooting.