Databricks Asset Bundles Integration¶
This page covers Lakehouse Plumber’s integration with Databricks Asset Bundles (DAB), enabling seamless deployment and management of generated DLT pipelines as bundle resources.
Overview¶
What are Databricks Asset Bundles?
Databricks Asset Bundles (DAB) provide a unified way to deploy and manage Databricks resources like jobs, pipelines, and notebooks using declarative YAML configuration. Bundles enable version control, environment management, and CI/CD integration for your entire Databricks workspace.
If you are not familiar with DABs, please refer to the Databricks Asset Bundles documentation
What LHP Does with Bundles
Lakehouse Plumber does NOT replace Databricks Asset Bundles or Databricks CLI. It generates the pipeline resource YAML files that DABs use for deployment.
Capabilities at a Glance¶
The following table summarizes what you can do with LHP’s bundle integration:
Capability |
Description |
Learn More |
|---|---|---|
Pipeline Resource Generation |
Auto-generate DAB pipeline YAML files from flowgroups |
|
Pipeline Configuration |
Customize compute, runtime, notifications per pipeline |
|
Job Configuration |
Configure orchestration jobs with schedules and alerts |
|
Multi-Job Orchestration |
Split pipelines into separate jobs by layer or domain |
|
Dependency Analysis |
Auto-detect pipeline dependencies and execution order |
|
Orchestration Job Generation |
Generate DAB jobs with proper task dependencies |
Visual Overview
flowchart LR
A["📁 pipelines/<br/>YAML Configs"] --> B["🔧 lhp generate"]
B --> C["🐍 generated/<br/>Python Files"]
B --> D["📋 resources/lhp/<br/>Pipeline YAMLs"]
P["⚙️ pipeline_config.yaml<br/>(optional)"] -.-> B
J["⚙️ job_config.yaml<br/>(optional)"] -.-> E
E["📊 lhp deps"] --> F["📋 resources/<br/>Job YAMLs"]
D --> G["🚀 databricks bundle deploy"]
F --> G
style A fill:#e1f5fe
style C fill:#f3e5f5
style D fill:#e8f5e8
style F fill:#fff3e0
style G fill:#ffebee
style P fill:#fffde7
style J fill:#fffde7
Prerequisites & Setup¶
Requirements
Python 3.11+ (3.12 recommended)
Databricks workspace with Unity Catalog enabled
Databricks CLI v0.200+ installed and configured
LakehousePlumber installed:
pip install lakehouse-plumber
Databricks CLI Setup
Install and configure the Databricks CLI:
Note
Follow the steps here to install Databricks CLI
# Configure authentication
databricks configure --token
# Verify connection
databricks workspace list
How LHP Integrates with DABs¶
LHP integrates with Databricks Asset Bundles by generating resource files that DABs use for deployment.
What LHP Does
Generates resource YAML files for each pipeline in the
resources/lhp/directorySynchronizes resource files with generated Python notebooks automatically
Maintains resource file consistency by cleaning up obsolete resources
Supports customization through pipeline and job configuration files
What LHP Does NOT Do
Replace or modify your
databricks.ymlfileDeploy resources to Databricks (use
databricks bundle deploy)Manage non-LHP resources in the
resources/directory
Benefits of Using Bundles with LHP
Unified Deployment: Deploy pipelines, jobs, and configurations together
Environment Management: Separate dev/staging/prod configurations
Version Control: Track resource changes alongside pipeline code
CI/CD Integration: Automated deployments through Databricks CLI
Resource Cleanup: Automatic cleanup of deleted pipelines
Project Structure¶
Your project should have this structure when using bundles:
my-data-platform/
├── databricks.yml # Bundle configuration (you manage this)
├── lhp.yaml # LHP project config
├── pipelines/ # LHP pipeline definitions (you create these)
│ ├── 01_raw_ingestion/
│ ├── 02_bronze/
│ └── 03_silver/
├── substitutions/ # Environment configs (you create these)
│ ├── dev.yaml
│ └── prod.yaml
├── config/ # Optional configuration files
│ ├── pipeline_config.yaml
│ └── job_config.yaml
├── resources/ # Bundle resources
│ ├── lhp/ # LHP-managed (auto-generated, do NOT modify)
│ │ ├── raw_ingestion.pipeline.yml
│ │ └── bronze_layer.pipeline.yml
│ └── custom.job.yml # Your custom DAB files (LHP won't touch)
└── generated/ # Generated Python files (auto-generated, do NOT modify)
├── raw_ingestion/
└── bronze_layer/
Note
Coexistence with Your DAB Files
LHP manages its resource files ONLY in the resources/lhp/ subdirectory. You can
safely place your own Databricks Asset Bundle files directly in resources/.
LHP will never modify or delete files outside resources/lhp/.
Warning
Files in
resources/lhp/with the"Generated by LakehousePlumber"header will be automatically overwritten by LHP during generation.Do not manually edit files in
resources/lhp/- your changes will be lost.
Getting Started¶
Follow these steps to set up bundle integration in your LHP project.
Step 1: Initialize Project with Bundle Support
lhp init --bundle my-data-platform
cd my-data-platform
Note
The --bundle flag creates a databricks.yml file and the resources/lhp directory.
Step 2: Configure databricks.yml
Edit databricks.yml to add your Databricks workspace details:
bundle:
name: my-data-platform
targets:
dev:
workspace:
host: https://your-workspace.cloud.databricks.com
default: true
See also
Refer to Databricks official documentation for more configuration options: Databricks Bundle Configuration
Step 3: Create Your First Pipeline
Create a pipeline configuration in the pipelines/ folder. See Getting Started for detailed examples.
Step 4: Generate Code and Resources
lhp generate -e dev
You should see output like:
🔄 Syncing bundle resources with generated files...
✅ Updated 1 bundle resource file(s)
Step 5: Verify Generated Resources
Check the generated resource file:
cat resources/lhp/your_pipeline.pipeline.yml
Step 6: Deploy to Databricks
# Validate bundle configuration
databricks bundle validate --target dev
# Deploy bundle to Databricks
databricks bundle deploy --target dev
# Verify deployment
databricks bundle status --target dev
Bundle Resource Synchronization¶
How Resource Sync Works
When bundle support is enabled (databricks.yml exists), LHP automatically:
Generates resource YAML files using Jinja2 templates for each pipeline
Uses glob patterns to automatically discover all files in pipeline directories
Removes obsolete resource files for deleted pipelines
Maintains environment-specific configurations
Important
LHP will NOT edit your
databricks.ymlfileIt only creates/updates pipeline YAML files in
resources/lhp/You can add custom bundle resources directly in
resources/
Generated Resource File Example
# Generated by LakehousePlumber - Bundle Resource for bronze_load
resources:
pipelines:
bronze_load_pipeline:
name: bronze_load_pipeline
catalog: main
schema: lhp_${bundle.target}
libraries:
- glob:
include: ../../generated/bronze_load/**
root_path: ${workspace.file_path}/generated/bronze_load/
configuration:
bundle.sourcePath: ${workspace.file_path}/generated
Why Glob Patterns Instead of Notebooks?
Lakeflow pipelines now use Python files as their source (notebooks are legacy)
Glob patterns automatically discover all Python files in pipeline directories
New files are included automatically without resource file updates
Configuration Management¶
LakehousePlumber provides two configuration files to customize how your pipelines and orchestration jobs are deployed to Databricks:
Pipeline Configuration - Controls DLT pipeline settings (compute, runtime, notifications)
Job Configuration - Controls orchestration job settings (schedules, concurrency, alerts)
Pipeline Configuration¶
Overview
Pipeline Configuration controls Delta Live Tables (DLT) pipeline-level settings such as compute resources, runtime environment, processing mode, and monitoring.
Configuration File Format
Create a multi-document YAML file with project-level defaults and per-pipeline overrides:
# Project-level defaults (applied to all pipelines)
project_defaults:
serverless: true
edition: ADVANCED
channel: CURRENT
continuous: false
---
# Pipeline-specific configuration
pipeline:
- bronze_load
serverless: false
continuous: true
clusters:
- label: default
node_type_id: Standard_D16ds_v5
autoscale:
min_workers: 2
max_workers: 10
---
pipeline:
- silver_load
serverless: true
notifications:
- email_recipients:
- team@company.com
alerts:
- on-update-failure
Configuration Options
Option |
Type |
Description |
|---|---|---|
|
string |
Unity Catalog name (supports LHP tokens) |
|
string |
Schema/database name (supports LHP tokens) |
|
boolean |
Use serverless compute (default: |
|
string |
DLT edition: |
|
string |
Runtime channel: |
|
boolean |
Enable continuous processing (streaming) |
|
boolean |
Enable Photon engine (non-serverless only) |
|
list |
Cluster configurations for non-serverless pipelines |
|
list |
Email notifications and alert settings |
|
dict |
Custom tags for the pipeline |
|
dict |
Event logging configuration. Can also be set project-wide in |
|
dict |
Runtime environment config (dependencies, etc.). Passed through as-is to Databricks. |
|
dict |
Pipeline-level Spark/DLT configuration key-value pairs. All values must be strings. |
Usage
# Specify config file when generating
lhp generate -e dev --pipeline-config config/pipeline_config.yaml
# Short flag version
lhp generate -e dev -pc config/pipeline_config.yaml
Configuration Precedence
Configurations are merged in order (later overrides earlier):
Default values - Built-in LHP defaults (
serverless: true,edition: ADVANCED)Project defaults - Values from the
project_defaultssectionPipeline-specific - Values from pipeline-specific sections (highest priority)
Note
Lists (like notifications and clusters) are replaced entirely, not appended.
Catalog and Schema Configuration
You can define catalog and schema in pipeline config to control where each pipeline writes data:
---
pipeline:
- bronze_load
catalog: "${catalog}" # Token from substitutions/dev.yaml
schema: "${bronze_schema}" # Token from substitutions/dev.yaml
---
pipeline:
- gold_analytics
catalog: "analytics_prod" # Literal value (same across environments)
schema: "${gold_schema}" # Token (varies by environment)
Important
Both catalog AND schema must be defined together (partial definition raises an error).
Why Catalog and Schema Are Required¶
Every Databricks Lakeflow Declarative Pipeline requires a default catalog and default schema. These set the Unity Catalog location where unqualified table references resolve, and are used by the pipeline UI for table discovery, event log storage, and schema browsing.
While LHP generates fully-qualified table names (e.g., catalog.schema.table) in the pipeline
code — meaning the default catalog/schema do not affect where data is written — Databricks still
requires these fields on the pipeline resource definition.
The simplest approach is to define catalog and schema in project_defaults, using
substitution tokens so values resolve per-environment from your substitutions/{env}.yaml files:
project_defaults:
catalog: "${catalog}"
schema: "${schema}"
This covers all pipelines. Pipelines that need a different schema can override with a per-pipeline section:
---
pipeline: my_special_pipeline
catalog: "${catalog}"
schema: "${special_schema}"
Deprecated since version 0.7.8: In previous versions, LHP auto-detected catalog/schema values from generated Python files
and populated databricks.yml variables (default_pipeline_catalog,
default_pipeline_schema). This auto-detection is deprecated and will be removed in
version 1.0.0. Starting in v1.0.0, pipeline_config.yaml (--pipeline-config / -pc)
will be required for bundle projects.
Full Configuration Substitution
All fields in pipeline_config.yaml support LHP token substitution, not just catalog/schema:
---
pipeline:
- production_ingestion
clusters:
- label: default
node_type_id: "${pipeline_node_type}" # Token for sizing
policy_id: "${pipeline_policy_id}" # Token for policy
notifications:
- email_recipients:
- "${ops_team_email}" # Token for email
tags:
environment: "${environment_name}" # Token for env tag
This enables complete environment-specific configuration from your substitutions/{env}.yaml files.
Environment Dependencies¶
Databricks DLT pipelines support an environment section for specifying pip package
dependencies that are installed at pipeline startup. LHP passes this section through
as-is to the generated bundle resource.
Input Configuration
---
pipeline: my_pipeline
catalog: "${catalog}"
schema: "${schema}"
serverless: true
environment:
dependencies:
- "msal==1.31.0"
- "requests>=2.28.0"
Generated Output
environment:
dependencies:
- msal==1.31.0
- requests>=2.28.0
Note
The environment section supports LHP token substitution just like all other
pipeline config fields. For example, you can use "msal==${msal_version}" and
define msal_version in your substitutions/{env}.yaml files.
Pipeline Configuration Entries¶
Databricks DLT pipelines support a configuration block for setting pipeline-level
Spark and DLT configuration properties (e.g., pipelines.incompatibleViewCheck.enabled).
LHP renders user-defined configuration entries alongside the mandatory bundle.sourcePath
entry in the generated bundle resource.
Input Configuration
---
pipeline: my_pipeline
catalog: "${catalog}"
schema: "${schema}"
serverless: true
configuration:
"pipelines.incompatibleViewCheck.enabled": "false"
"spark.databricks.delta.minFileSize": "134217728"
Generated Output
configuration:
bundle.sourcePath: ${workspace.file_path}/generated/${bundle.target}
pipelines.incompatibleViewCheck.enabled: "false"
spark.databricks.delta.minFileSize: "134217728"
Note
The configuration section supports LHP token substitution just like all other
pipeline config fields. For example, you can use "${min_file_size}" and
define min_file_size in your substitutions/{env}.yaml files.
Warning
The
bundle.sourcePathentry is managed by LHP and cannot be overridden. If included in user configuration, it will be silently ignored.All configuration values must be quoted strings in the YAML input. Unquoted booleans (
false) or numbers (134217728) will be rejected during validation.
Monitoring Pipeline Alias¶
When using event log monitoring (monitoring: in lhp.yaml), use the
__eventlog_monitoring reserved keyword in pipeline_config.yaml to configure the
monitoring pipeline without hardcoding its dynamic name. At generation time, the alias
resolves to the actual monitoring pipeline name.
See also
For complete details on the monitoring pipeline alias, behavior rules, and examples, see Pipeline Monitoring.
Event Log Configuration¶
Databricks DLT pipelines support an event_log section that configures where pipeline
event logs are stored. LHP supports project-level event logging (in lhp.yaml) that
automatically applies to all pipelines, and pipeline-level overrides or opt-outs through
pipeline_config.yaml.
See also
For complete event log configuration reference, table naming rules, pipeline-level overrides, and monitoring pipeline setup, see Pipeline Monitoring.
Job Configuration¶
Overview
Job Configuration controls Databricks orchestration job settings for dependency-based pipeline execution.
Configuration File Format
# Project-level defaults (applied to all jobs)
project_defaults:
max_concurrent_runs: 1
performance_target: STANDARD
queue:
enabled: true
tags:
managed_by: lakehouse_plumber
---
# Job-specific configuration
job_name:
- bronze_ingestion_job
max_concurrent_runs: 2
performance_target: PERFORMANCE_OPTIMIZED
timeout_seconds: 7200
email_notifications:
on_failure:
- data-engineering@company.com
schedule:
quartz_cron_expression: "0 0 2 * * ?"
timezone_id: "America/New_York"
Configuration Options
Option |
Default |
Description |
|---|---|---|
|
|
Maximum number of concurrent job runs |
|
|
|
|
|
Enable job queueing |
|
None |
Job-level timeout in seconds |
|
None |
Key-value pairs for job tags |
|
None |
Email alerts (on_start, on_success, on_failure) |
|
None |
Webhook alerts (on_start, on_success, on_failure) |
|
None |
Job access permissions |
|
None |
Cron schedule configuration |
Usage
# Generate orchestration job with config
lhp deps --job-config config/job_config.yaml --bundle-output
# Short flag version
lhp deps -jc config/job_config.yaml --bundle-output
Pass-through Fields (Unknown Keys)
Any top-level job_config key that is not in the table above is rendered verbatim into the generated orchestration job YAML. This lets you use any Databricks Jobs API field — including fields added after your LHP release — without waiting for LHP to add explicit support.
Common pass-through examples:
project_defaults:
trigger:
file_arrival:
url: "s3://my-bucket/landing-zone/"
min_time_between_triggers_seconds: 60
wait_after_last_change_seconds: 30
pause_status: UNPAUSED
project_defaults:
continuous:
pause_status: UNPAUSED
project_defaults:
run_as:
service_principal_name: "<sp-application-id>"
project_defaults:
# git_source, health, parameters, environments, edit_mode,
# budget_policy_id, … all pass through in the same way.
budget_policy_id: "your-policy-id"
edit_mode: EDITABLE
Note
Author-specified key order is preserved. LHP does not validate pass-through keys against the Databricks API — if you misspell a field, Databricks will reject it at deploy time, not LHP.
Merge Behavior
Configs are deep-merged: DEFAULT → project_defaults → job-specific
# Example: Tags are deep-merged
project_defaults.tags: {managed_by: "lhp", environment: "dev"}
job-specific.tags: {layer: "bronze", environment: "prod"}
# Result: {managed_by: "lhp", environment: "prod", layer: "bronze"}
Note
Nested dicts are deep-merged, but lists are REPLACED (not appended).
Multi-Job Orchestration¶
Overview
LakehousePlumber supports generating multiple orchestration jobs instead of a single job, enabling better organization for large projects.
When to Use Multi-Job Mode
Project Segregation: Separate jobs by data layer (bronze, silver, gold)
Resource Optimization: Different compute requirements per job
Team Ownership: Assign jobs to different teams
SLA Management: Run critical jobs with higher priority/resources
Cost Control: Apply different schedules and timeout policies
Enabling Multi-Job Mode
Add the job_name property to your flowgroup YAMLs:
pipeline: data_bronze
flowgroup: customer_ingestion
job_name:
- bronze_ingestion_job # Assigns this flowgroup to a specific job
actions:
- name: load_customer
type: load
# ... rest of configuration
Validation Rules
All-or-nothing: If ANY flowgroup has
job_name, ALL must have itFormat: Alphanumeric, underscore, and hyphen only (
^[a-zA-Z0-9_-]+$)Pipeline filter restriction: Cannot use
--pipelineflag with multi-job mode
Generated Files
When using multi-job mode, LHP generates:
resources/
├── bronze_ingestion_job.job.yml # Individual job
├── silver_transform_job.job.yml # Individual job
├── gold_analytics_job.job.yml # Individual job
└── my_project_master.job.yml # Master orchestration job
The master job coordinates all individual jobs using job_task references with proper dependencies.
Configuration Templates¶
When you initialize a project, configuration templates are created in config/:
config/job_config.yaml.tmpl- Job configuration template with examplesconfig/pipeline_config.yaml.tmpl- Pipeline configuration template with examples
Getting Started with Templates
# Copy and customize templates
cp config/job_config.yaml.tmpl config/job_config.yaml
cp config/pipeline_config.yaml.tmpl config/pipeline_config.yaml
# Edit with your settings, then use
lhp generate -e dev -pc config/pipeline_config.yaml
lhp deps -jc config/job_config.yaml --bundle-output
Best Practices¶
Environment-Specific Configuration¶
Different environments typically need different settings. We recommend maintaining separate configuration files for each environment.
Recommended file structure:
config/
├── pipeline_config-dev.yaml
├── pipeline_config-prod.yaml
├── job_config-dev.yaml
└── job_config-prod.yaml
Common differences by environment:
Setting |
Development |
Production |
|---|---|---|
Cluster size |
Smaller nodes (cost efficiency) |
Larger nodes (performance) |
Concurrency |
Lower (1-2 concurrent runs) |
Higher (3+ concurrent runs) |
Notifications |
Minimal or none |
Full alerting to ops teams |
Timeouts |
Relaxed (for debugging) |
Strict (SLA enforcement) |
Performance target |
|
|
See also
For complete CI/CD integration patterns including environment-specific deployment workflows, see CI/CD Reference.
CLI Quick Reference¶
Initialize Project with Bundles
lhp init --bundle my-project
Generate Code and Resources
# Basic generation
lhp generate -e dev
# With pipeline config
lhp generate -e dev -pc config/pipeline_config.yaml
# Force regeneration
lhp generate -e dev --force
# Disable bundle sync
lhp generate -e dev --no-bundle
Generate Orchestration Jobs
# Generate job with dependency analysis
lhp deps --format job --job-name my_etl --bundle-output
# With custom config
lhp deps -jc config/job_config.yaml --bundle-output
Deploy to Databricks
# Validate bundle
databricks bundle validate --target dev
# Deploy
databricks bundle deploy --target dev
# Run a specific pipeline
databricks bundle run my_pipeline --target dev
See also
For dependency analysis commands and output formats, see Dependency Analysis & Job Generation.
Advanced Topics¶
Bundle Sync Behavior
Bundle synchronization happens automatically after successful generation:
Triggers: After
lhp generatecompletes successfullyScope: Processes all generated Python files in output directory
Cleanup: Removes resource files for deleted/excluded pipelines
Idempotent: Safe to run multiple times
Multi-Environment Setup
Configure multiple environments with different settings in databricks.yml:
bundle:
name: acmi-data-platform
targets:
dev:
workspace:
host: https://dev-workspace.cloud.databricks.com
root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}
prod:
workspace:
host: https://prod-workspace.cloud.databricks.com
root_path: /Shared/.bundle/${bundle.name}/${bundle.target}
Troubleshooting
Issue |
Solution |
|---|---|
Bundle sync not triggered |
Ensure |
Resource files not generated |
Check generated Python files exist and are valid |
Bundle validation fails |
Verify YAML syntax in generated resource files |
Deployment permission errors |
Check workspace permissions and bundle target paths |
Obsolete resources not cleaned up |
Run |