Databricks Asset Bundles Integration¶

This page covers Lakehouse Plumber’s integration with Databricks Asset Bundles (DAB), enabling seamless deployment and management of generated DLT pipelines as bundle resources.

Overview¶

What are Databricks Asset Bundles?

Databricks Asset Bundles (DAB) provide a unified way to deploy and manage Databricks resources like jobs, pipelines, and notebooks using declarative YAML configuration. Bundles enable version control, environment management, and CI/CD integration for your entire Databricks workspace.

If you are not familiar with DABs, please refer to the Databricks Asset Bundles documentation

What LHP Does with Bundles

Lakehouse Plumber does NOT replace Databricks Asset Bundles or Databricks CLI. It generates the pipeline resource YAML files that DABs use for deployment.

Capabilities at a Glance¶

The following table summarizes what you can do with LHP’s bundle integration:

Capability	Description	Learn More
Pipeline Resource Generation	Auto-generate DAB pipeline YAML files from flowgroups	Bundle Resource Synchronization
Pipeline Configuration	Customize compute, runtime, notifications per pipeline	Pipeline Configuration
Job Configuration	Configure orchestration jobs with schedules and alerts	Job Configuration
Multi-Job Orchestration	Split pipelines into separate jobs by layer or domain	Multi-Job Orchestration
Dependency Analysis	Auto-detect pipeline dependencies and execution order	Dependency Analysis & Job Generation
Orchestration Job Generation	Generate DAB jobs with proper task dependencies	Dependency Analysis & Job Generation

Visual Overview

        flowchart LR
    A["📁 pipelines/<br/>YAML Configs"] --> B["🔧 lhp generate"]
    B --> C["🐍 generated/<br/>Python Files"]
    B --> D["📋 resources/lhp/<br/>Pipeline YAMLs"]

    P["⚙️ pipeline_config.yaml<br/>(optional)"] -.-> B
    J["⚙️ job_config.yaml<br/>(optional)"] -.-> E

    E["📊 lhp deps"] --> F["📋 resources/<br/>Job YAMLs"]
    D --> G["🚀 databricks bundle deploy"]
    F --> G

    style A fill:#e1f5fe
    style C fill:#f3e5f5
    style D fill:#e8f5e8
    style F fill:#fff3e0
    style G fill:#ffebee
    style P fill:#fffde7
    style J fill:#fffde7

Prerequisites & Setup¶

Requirements

Python 3.11+ (3.12 recommended)
Databricks workspace with Unity Catalog enabled
Databricks CLI v0.200+ installed and configured
LakehousePlumber installed: pip install lakehouse-plumber

Databricks CLI Setup

Install and configure the Databricks CLI:

Note

Follow the steps here to install Databricks CLI

# Configure authentication
databricks configure --token

# Verify connection
databricks workspace list

How LHP Integrates with DABs¶

LHP integrates with Databricks Asset Bundles by generating resource files that DABs use for deployment.

What LHP Does

Generates resource YAML files for each pipeline in the resources/lhp/ directory
Synchronizes resource files with generated Python notebooks automatically
Maintains resource file consistency by cleaning up obsolete resources
Supports customization through pipeline and job configuration files

What LHP Does NOT Do

Replace or modify your databricks.yml file
Deploy resources to Databricks (use databricks bundle deploy)
Manage non-LHP resources in the resources/ directory

Benefits of Using Bundles with LHP

Unified Deployment: Deploy pipelines, jobs, and configurations together
Environment Management: Separate dev/staging/prod configurations
Version Control: Track resource changes alongside pipeline code
CI/CD Integration: Automated deployments through Databricks CLI
Resource Cleanup: Automatic cleanup of deleted pipelines

Project Structure¶

Your project should have this structure when using bundles:

my-data-platform/
├── databricks.yml          # Bundle configuration (you manage this)
├── lhp.yaml                # LHP project config
├── pipelines/              # LHP pipeline definitions (you create these)
│   ├── 01_raw_ingestion/
│   ├── 02_bronze/
│   └── 03_silver/
├── substitutions/          # Environment configs (you create these)
│   ├── dev.yaml
│   └── prod.yaml
├── config/                 # Optional configuration files
│   ├── pipeline_config.yaml
│   └── job_config.yaml
├── resources/              # Bundle resources
│   ├── lhp/                # LHP-managed (auto-generated, do NOT modify)
│   │   ├── raw_ingestion.pipeline.yml
│   │   └── bronze_layer.pipeline.yml
│   └── custom.job.yml      # Your custom DAB files (LHP won't touch)
└── generated/              # Generated Python files (auto-generated, do NOT modify)
    ├── raw_ingestion/
    └── bronze_layer/

Note

Coexistence with Your DAB Files

LHP manages its resource files ONLY in the resources/lhp/ subdirectory. You can safely place your own Databricks Asset Bundle files directly in resources/. LHP will never modify or delete files outside resources/lhp/.

Warning

Files in resources/lhp/ with the "Generated by LakehousePlumber" header will be automatically overwritten by LHP during generation.
Do not manually edit files in resources/lhp/ - your changes will be lost.

Getting Started¶

Follow these steps to set up bundle integration in your LHP project.

Step 1: Initialize Project with Bundle Support

lhp init --bundle my-data-platform
cd my-data-platform

Note

The --bundle flag creates a databricks.yml file and the resources/lhp directory.

Step 2: Configure databricks.yml

Edit databricks.yml to add your Databricks workspace details:

databricks.yml¶

bundle:
  name: my-data-platform

targets:
  dev:
    workspace:
      host: https://your-workspace.cloud.databricks.com
    default: true

See also

Refer to Databricks official documentation for more configuration options: Databricks Bundle Configuration

Step 3: Create Your First Pipeline

Create a pipeline configuration in the pipelines/ folder. See Getting Started for detailed examples.

Step 4: Generate Code and Resources

lhp generate -e dev

You should see output like:

🔄 Syncing bundle resources with generated files...
✅ Updated 1 bundle resource file(s)

Step 5: Verify Generated Resources

Check the generated resource file:

cat resources/lhp/your_pipeline.pipeline.yml

Step 6: Deploy to Databricks

# Validate bundle configuration
databricks bundle validate --target dev

# Deploy bundle to Databricks
databricks bundle deploy --target dev

# Verify deployment
databricks bundle status --target dev

Bundle Resource Synchronization¶

How Resource Sync Works

When bundle support is enabled (databricks.yml exists), LHP automatically:

Generates resource YAML files using Jinja2 templates for each pipeline
Uses glob patterns to automatically discover all files in pipeline directories
Removes obsolete resource files for deleted pipelines
Maintains environment-specific configurations

Important

LHP will NOT edit your databricks.yml file
It only creates/updates pipeline YAML files in resources/lhp/
You can add custom bundle resources directly in resources/

Generated Resource File Example

resources/lhp/bronze_load.pipeline.yml¶

# Generated by LakehousePlumber - Bundle Resource for bronze_load
resources:
  pipelines:
    bronze_load_pipeline:
      name: bronze_load_pipeline
      catalog: main
      schema: lhp_${bundle.target}

      libraries:
        - glob:
            include: ../../generated/bronze_load/**

      root_path: ${workspace.file_path}/generated/bronze_load/

      configuration:
        bundle.sourcePath: ${workspace.file_path}/generated

Why Glob Patterns Instead of Notebooks?

Lakeflow pipelines now use Python files as their source (notebooks are legacy)
Glob patterns automatically discover all Python files in pipeline directories
New files are included automatically without resource file updates

Configuration Management¶

LakehousePlumber provides two configuration files to customize how your pipelines and orchestration jobs are deployed to Databricks:

Pipeline Configuration - Controls DLT pipeline settings (compute, runtime, notifications)
Job Configuration - Controls orchestration job settings (schedules, concurrency, alerts)

Pipeline Configuration¶

Overview

Pipeline Configuration controls Delta Live Tables (DLT) pipeline-level settings such as compute resources, runtime environment, processing mode, and monitoring.

Configuration File Format

Create a multi-document YAML file with project-level defaults and per-pipeline overrides:

config/pipeline_config.yaml¶

# Project-level defaults (applied to all pipelines)
project_defaults:
  serverless: true
  edition: ADVANCED
  channel: CURRENT
  continuous: false

---
# Pipeline-specific configuration
pipeline:
  - bronze_load
serverless: false
continuous: true
clusters:
  - label: default
    node_type_id: Standard_D16ds_v5
    autoscale:
      min_workers: 2
      max_workers: 10

---
pipeline:
  - silver_load
serverless: true
notifications:
  - email_recipients:
      - team@company.com
    alerts:
      - on-update-failure

Configuration Options

Option	Type	Description
`catalog`	string	Unity Catalog name (supports LHP tokens)
`schema`	string	Schema/database name (supports LHP tokens)
`serverless`	boolean	Use serverless compute (default: `true`)
`edition`	string	DLT edition: `CORE`, `PRO`, or `ADVANCED`
`channel`	string	Runtime channel: `CURRENT` or `PREVIEW`
`continuous`	boolean	Enable continuous processing (streaming)
`photon`	boolean	Enable Photon engine (non-serverless only)
`clusters`	list	Cluster configurations for non-serverless pipelines
`notifications`	list	Email notifications and alert settings
`tags`	dict	Custom tags for the pipeline
`event_log`	dict	Event logging configuration. Can also be set project-wide in `lhp.yaml`. See Event Log Configuration.
`environment`	dict	Runtime environment config (dependencies, etc.). Passed through as-is to Databricks.
`configuration`	dict	Pipeline-level Spark/DLT configuration key-value pairs. All values must be strings.

Usage

# Specify config file when generating
lhp generate -e dev --pipeline-config config/pipeline_config.yaml

# Short flag version
lhp generate -e dev -pc config/pipeline_config.yaml

Configuration Precedence

Configurations are merged in order (later overrides earlier):

Default values - Built-in LHP defaults (serverless: true, edition: ADVANCED)
Project defaults - Values from the project_defaults section
Pipeline-specific - Values from pipeline-specific sections (highest priority)

Note

Lists (like notifications and clusters) are replaced entirely, not appended.

Catalog and Schema Configuration

You can define catalog and schema in pipeline config to control where each pipeline writes data:

---
pipeline:
  - bronze_load
catalog: "${catalog}"          # Token from substitutions/dev.yaml
schema: "${bronze_schema}"     # Token from substitutions/dev.yaml

---
pipeline:
  - gold_analytics
catalog: "analytics_prod"     # Literal value (same across environments)
schema: "${gold_schema}"       # Token (varies by environment)

Important

Both catalog AND schema must be defined together (partial definition raises an error).

Why Catalog and Schema Are Required¶

Every Databricks Lakeflow Declarative Pipeline requires a default catalog and default schema. These set the Unity Catalog location where unqualified table references resolve, and are used by the pipeline UI for table discovery, event log storage, and schema browsing.

While LHP generates fully-qualified table names (e.g., catalog.schema.table) in the pipeline code — meaning the default catalog/schema do not affect where data is written — Databricks still requires these fields on the pipeline resource definition.

The simplest approach is to define catalog and schema in project_defaults, using substitution tokens so values resolve per-environment from your substitutions/{env}.yaml files:

config/pipeline_config.yaml¶

project_defaults:
  catalog: "${catalog}"
  schema: "${schema}"

This covers all pipelines. Pipelines that need a different schema can override with a per-pipeline section:

---
pipeline: my_special_pipeline
catalog: "${catalog}"
schema: "${special_schema}"

Deprecated since version 0.7.8: In previous versions, LHP auto-detected catalog/schema values from generated Python files and populated databricks.yml variables (default_pipeline_catalog, default_pipeline_schema). This auto-detection is deprecated and will be removed in version 1.0.0. Starting in v1.0.0, pipeline_config.yaml (--pipeline-config / -pc) will be required for bundle projects.

Full Configuration Substitution

All fields in pipeline_config.yaml support LHP token substitution, not just catalog/schema:

---
pipeline:
  - production_ingestion
clusters:
  - label: default
    node_type_id: "${pipeline_node_type}"    # Token for sizing
    policy_id: "${pipeline_policy_id}"       # Token for policy
notifications:
  - email_recipients:
      - "${ops_team_email}"                  # Token for email
tags:
  environment: "${environment_name}"         # Token for env tag

This enables complete environment-specific configuration from your substitutions/{env}.yaml files.

Environment Dependencies¶

Databricks DLT pipelines support an environment section for specifying pip package dependencies that are installed at pipeline startup. LHP passes this section through as-is to the generated bundle resource.

Input Configuration

config/pipeline_config.yaml¶

---
pipeline: my_pipeline
catalog: "${catalog}"
schema: "${schema}"
serverless: true
environment:
  dependencies:
    - "msal==1.31.0"
    - "requests>=2.28.0"

Generated Output

resources/lhp/my_pipeline.pipeline.yml (excerpt)¶

environment:
  dependencies:
    - msal==1.31.0
    - requests>=2.28.0

Note

The environment section supports LHP token substitution just like all other pipeline config fields. For example, you can use "msal==${msal_version}" and define msal_version in your substitutions/{env}.yaml files.

Pipeline Configuration Entries¶

Databricks DLT pipelines support a configuration block for setting pipeline-level Spark and DLT configuration properties (e.g., pipelines.incompatibleViewCheck.enabled). LHP renders user-defined configuration entries alongside the mandatory bundle.sourcePath entry in the generated bundle resource.

Input Configuration

config/pipeline_config.yaml¶

---
pipeline: my_pipeline
catalog: "${catalog}"
schema: "${schema}"
serverless: true
configuration:
  "pipelines.incompatibleViewCheck.enabled": "false"
  "spark.databricks.delta.minFileSize": "134217728"

Generated Output

resources/lhp/my_pipeline.pipeline.yml (excerpt)¶

configuration:
  bundle.sourcePath: ${workspace.file_path}/generated/${bundle.target}
  pipelines.incompatibleViewCheck.enabled: "false"
  spark.databricks.delta.minFileSize: "134217728"

Note

The configuration section supports LHP token substitution just like all other pipeline config fields. For example, you can use "${min_file_size}" and define min_file_size in your substitutions/{env}.yaml files.

Warning

The bundle.sourcePath entry is managed by LHP and cannot be overridden. If included in user configuration, it will be silently ignored.
All configuration values must be quoted strings in the YAML input. Unquoted booleans (false) or numbers (134217728) will be rejected during validation.

Monitoring Pipeline Alias¶

When using event log monitoring (monitoring: in lhp.yaml), use the __eventlog_monitoring reserved keyword in pipeline_config.yaml to configure the monitoring pipeline without hardcoding its dynamic name. At generation time, the alias resolves to the actual monitoring pipeline name.

See also

For complete details on the monitoring pipeline alias, behavior rules, and examples, see Pipeline Monitoring.

Event Log Configuration¶

Databricks DLT pipelines support an event_log section that configures where pipeline event logs are stored. LHP supports project-level event logging (in lhp.yaml) that automatically applies to all pipelines, and pipeline-level overrides or opt-outs through pipeline_config.yaml.

See also

For complete event log configuration reference, table naming rules, pipeline-level overrides, and monitoring pipeline setup, see Pipeline Monitoring.

Job Configuration¶

Overview

Job Configuration controls Databricks orchestration job settings for dependency-based pipeline execution.

Configuration File Format

config/job_config.yaml¶

# Project-level defaults (applied to all jobs)
project_defaults:
  max_concurrent_runs: 1
  performance_target: STANDARD
  queue:
    enabled: true
  tags:
    managed_by: lakehouse_plumber

---
# Job-specific configuration
job_name:
  - bronze_ingestion_job
max_concurrent_runs: 2
performance_target: PERFORMANCE_OPTIMIZED
timeout_seconds: 7200
email_notifications:
  on_failure:
    - data-engineering@company.com
schedule:
  quartz_cron_expression: "0 0 2 * * ?"
  timezone_id: "America/New_York"

Configuration Options

Option	Default	Description
`max_concurrent_runs`	`1`	Maximum number of concurrent job runs
`performance_target`	`STANDARD`	`STANDARD` or `PERFORMANCE_OPTIMIZED`
`queue.enabled`	`true`	Enable job queueing
`timeout_seconds`	None	Job-level timeout in seconds
`tags`	None	Key-value pairs for job tags
`email_notifications`	None	Email alerts (on_start, on_success, on_failure)
`webhook_notifications`	None	Webhook alerts (on_start, on_success, on_failure)
`permissions`	None	Job access permissions
`schedule`	None	Cron schedule configuration

Usage

# Generate orchestration job with config
lhp deps --job-config config/job_config.yaml --bundle-output

# Short flag version
lhp deps -jc config/job_config.yaml --bundle-output

Pass-through Fields (Unknown Keys)

Any top-level job_config key that is not in the table above is rendered verbatim into the generated orchestration job YAML. This lets you use any Databricks Jobs API field — including fields added after your LHP release — without waiting for LHP to add explicit support.

Common pass-through examples:

File-arrival trigger¶

project_defaults:
  trigger:
    file_arrival:
      url: "s3://my-bucket/landing-zone/"
      min_time_between_triggers_seconds: 60
      wait_after_last_change_seconds: 30
    pause_status: UNPAUSED

Continuous job¶

project_defaults:
  continuous:
    pause_status: UNPAUSED

run_as service principal¶

project_defaults:
  run_as:
    service_principal_name: "<sp-application-id>"

Any other Databricks Jobs API field¶

project_defaults:
  # git_source, health, parameters, environments, edit_mode,
  # budget_policy_id, … all pass through in the same way.
  budget_policy_id: "your-policy-id"
  edit_mode: EDITABLE

Note

Author-specified key order is preserved. LHP does not validate pass-through keys against the Databricks API — if you misspell a field, Databricks will reject it at deploy time, not LHP.

Merge Behavior

Configs are deep-merged: DEFAULT → project_defaults → job-specific

# Example: Tags are deep-merged
project_defaults.tags:  {managed_by: "lhp", environment: "dev"}
job-specific.tags:      {layer: "bronze", environment: "prod"}
# Result:               {managed_by: "lhp", environment: "prod", layer: "bronze"}

Note

Nested dicts are deep-merged, but lists are REPLACED (not appended).

Multi-Job Orchestration¶

Overview

LakehousePlumber supports generating multiple orchestration jobs instead of a single job, enabling better organization for large projects.

When to Use Multi-Job Mode

Project Segregation: Separate jobs by data layer (bronze, silver, gold)
Resource Optimization: Different compute requirements per job
Team Ownership: Assign jobs to different teams
SLA Management: Run critical jobs with higher priority/resources
Cost Control: Apply different schedules and timeout policies

Enabling Multi-Job Mode

Add the job_name property to your flowgroup YAMLs:

pipelines/bronze/customer_ingestion.yaml¶

pipeline: data_bronze
flowgroup: customer_ingestion
job_name:
  - bronze_ingestion_job   # Assigns this flowgroup to a specific job

actions:
  - name: load_customer
    type: load
    # ... rest of configuration

Validation Rules

All-or-nothing: If ANY flowgroup has job_name, ALL must have it
Format: Alphanumeric, underscore, and hyphen only (^[a-zA-Z0-9_-]+$)
Pipeline filter restriction: Cannot use --pipeline flag with multi-job mode

Generated Files

When using multi-job mode, LHP generates:

resources/
├── bronze_ingestion_job.job.yml      # Individual job
├── silver_transform_job.job.yml      # Individual job
├── gold_analytics_job.job.yml        # Individual job
└── my_project_master.job.yml         # Master orchestration job

The master job coordinates all individual jobs using job_task references with proper dependencies.

Configuration Templates¶

When you initialize a project, configuration templates are created in config/:

config/job_config.yaml.tmpl - Job configuration template with examples
config/pipeline_config.yaml.tmpl - Pipeline configuration template with examples

Getting Started with Templates

# Copy and customize templates
cp config/job_config.yaml.tmpl config/job_config.yaml
cp config/pipeline_config.yaml.tmpl config/pipeline_config.yaml

# Edit with your settings, then use
lhp generate -e dev -pc config/pipeline_config.yaml
lhp deps -jc config/job_config.yaml --bundle-output

Best Practices¶

Environment-Specific Configuration¶

Different environments typically need different settings. We recommend maintaining separate configuration files for each environment.

Recommended file structure:

config/
├── pipeline_config-dev.yaml
├── pipeline_config-prod.yaml
├── job_config-dev.yaml
└── job_config-prod.yaml

Common differences by environment:

Setting	Development	Production
Cluster size	Smaller nodes (cost efficiency)	Larger nodes (performance)
Concurrency	Lower (1-2 concurrent runs)	Higher (3+ concurrent runs)
Notifications	Minimal or none	Full alerting to ops teams
Timeouts	Relaxed (for debugging)	Strict (SLA enforcement)
Performance target	`STANDARD`	`PERFORMANCE_OPTIMIZED`

See also

For complete CI/CD integration patterns including environment-specific deployment workflows, see CI/CD Reference.

CLI Quick Reference¶

Initialize Project with Bundles

lhp init --bundle my-project

Generate Code and Resources

# Basic generation
lhp generate -e dev

# With pipeline config
lhp generate -e dev -pc config/pipeline_config.yaml

# Force regeneration
lhp generate -e dev --force

# Disable bundle sync
lhp generate -e dev --no-bundle

Generate Orchestration Jobs

# Generate job with dependency analysis
lhp deps --format job --job-name my_etl --bundle-output

# With custom config
lhp deps -jc config/job_config.yaml --bundle-output

Deploy to Databricks

# Validate bundle
databricks bundle validate --target dev

# Deploy
databricks bundle deploy --target dev

# Run a specific pipeline
databricks bundle run my_pipeline --target dev

See also

For dependency analysis commands and output formats, see Dependency Analysis & Job Generation.

Advanced Topics¶

Bundle Sync Behavior

Bundle synchronization happens automatically after successful generation:

Triggers: After lhp generate completes successfully
Scope: Processes all generated Python files in output directory
Cleanup: Removes resource files for deleted/excluded pipelines
Idempotent: Safe to run multiple times

Multi-Environment Setup

Configure multiple environments with different settings in databricks.yml:

databricks.yml¶

bundle:
  name: acmi-data-platform

targets:
  dev:
    workspace:
      host: https://dev-workspace.cloud.databricks.com
      root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}

  prod:
    workspace:
      host: https://prod-workspace.cloud.databricks.com
      root_path: /Shared/.bundle/${bundle.name}/${bundle.target}

Troubleshooting

Issue	Solution
Bundle sync not triggered	Ensure `databricks.yml` exists in project root
Resource files not generated	Check generated Python files exist and are valid
Bundle validation fails	Verify YAML syntax in generated resource files
Deployment permission errors	Check workspace permissions and bundle target paths
Obsolete resources not cleaned up	Run `lhp generate --force` to trigger full sync

Databricks Asset Bundles Integration¶

Overview¶

Capabilities at a Glance¶

Prerequisites & Setup¶

How LHP Integrates with DABs¶

Project Structure¶

Getting Started¶

Bundle Resource Synchronization¶

Configuration Management¶

Pipeline Configuration¶

Why Catalog and Schema Are Required¶

Environment Dependencies¶

Pipeline Configuration Entries¶

Monitoring Pipeline Alias¶

Event Log Configuration¶

Job Configuration¶

Multi-Job Orchestration¶

Configuration Templates¶

Best Practices¶

Environment-Specific Configuration¶

CLI Quick Reference¶

Advanced Topics¶

Related Documentation¶