Concepts & Architecture

At its core Lakehouse Plumber converts declarative YAML into regular Databricks Lakeflow Declarative Pipelines (ETL) Python code. The YAML files are intentionally simple and the heavy-lifting happens inside the Plumber engine at generation time. This page explains the key building blocks you will interact with.

FlowGroups

A FlowGroup represents a logical slice of your pipeline often a single source table or business entity. YAML files can contain one or multiple FlowGroups (see Multi-Flowgroup YAML Files for details on multi-flowgroup files).

Required keys in a FlowGroup YAML file

pipeline:  bronze_raw                 # pipeline name (logical)
flowgroup: customer_bronze_ingestion  # unique name for the flowgroup (logical)
actions:                              # list of steps in the flowgroup

Optional keys in a FlowGroup YAML file

job_name:
  - NCR            # Optional: Assign flowgroup to a specific orchestration job
variables:         # Optional: Define local variables for this flowgroup
  entity: customer
  table: customer_raw

The job_name property enables multi-job orchestration, allowing you to split your flowgroups into separate Databricks jobs rather than a single monolithic orchestration job. This is useful for:

  • Separate scheduling - Different jobs can run on different schedules (e.g., hourly POS data, daily ERP data)

  • Isolated execution - Jobs run independently with separate concurrency and resource settings

  • Modular organization - Group related flowgroups by source system, business domain, or data criticality

  • Flexible configuration - Each job can have its own tags, notifications, timeouts, and performance targets

Important

All-or-Nothing Rule: If job_name is defined for any flowgroup in your project, it must be defined for all flowgroups. This ensures consistent orchestration behavior and prevents configuration errors.

Example with multi-job orchestration:

pipelines/ncr/pos_transactions.yaml
pipeline: bronze_ncr
flowgroup: pos_transaction_bronze
job_name:
  - NCR  # Assigns this flowgroup to the "NCR" orchestration job

actions:
  - name: load_pos_data
    type: load
    source:
      type: cloudfiles
      path: "/mnt/landing/ncr/pos/*.parquet"
    target: v_pos_raw

When job_name is used:

  • Each unique job_name generates a separate Databricks job file (e.g., NCR.job.yml, SAP_SFCC.job.yml)

  • A master orchestration job is generated that coordinates execution across all jobs

  • Dependencies between jobs are automatically detected and handled in the master job

  • Per-job configuration is managed through multi-document job_config.yaml files

See also

For complete details on multi-job orchestration, job configuration, and the master orchestration job, see Databricks Asset Bundles Integration.

Note

FlowGroup vs Pipeline: - A FlowGroup represents a logical slice of your pipeline often a single source table or business entity.

  • A Pipeline is a logical grouping of FlowGroups. It is used to group the generated python files in the same folder.

  • Lakeflow Declarative Pipelines are declarative (as the name suggests) hence the order of the actions is determined at runtime by the Lakeflow engine based on the dependencies between the tables/views.

  • YAML files can contain one flowgroup (traditional) or multiple flowgroups (see Multi-Flowgroup YAML Files).

Actions

Every FlowGroup lists one or more Actions Actions come in three top-level types:

Type

Purpose

Load

Bring data into a temporary view (e.g. CloudFiles, Delta, JDBC, SQL, Python, custom_datasource).

Transform

Manipulate data in one or more steps (SQL, Python, schema adjustments, data-quality checks, temp tables…).

Write

Persist the final dataset to a streaming_table, materialized_view, or external sink (Kafka, Delta, custom API).

Note

  • You may chain zero or many Transform actions between a Load and a Write.

Important

  • the order of the actions is determined at runtime by the Lakeflow engine based on the dependencies between the tables/views, Not the order in the YAML file or the generated Python file.

For a complete catalogue of Action sub-types and their options see Actions Reference.

Presets

A Preset is a YAML file that provides default configuration snippets you can reuse across FlowGroups. Presets inject default values that are merged with explicit configurations in templates and flowgroups.

Common use cases:

  • Standardised table properties for all Bronze streaming tables

  • CloudFiles ingestion options (error handling, schema evolution)

  • Spark configuration tuning

Example preset file:

presets/cloudfiles_defaults.yaml
name: cloudfiles_defaults
version: "1.0"
description: "Standard CloudFiles options"

defaults:
  load_actions:
    cloudfiles:
      options:
        cloudFiles.rescuedDataColumn: "_rescued_data"
        ignoreCorruptFiles: "true"
        ignoreMissingFiles: "true"
        cloudFiles.maxFilesPerTrigger: 200

Usage in a FlowGroup:

presets:
  - cloudfiles_defaults

actions:
  - name: load_data
    type: load
    source:
      type: cloudfiles
      options:
        cloudFiles.format: csv  # Merged with preset options

For complete preset documentation see Presets Reference.

Templates

While presets inject reusable values, Templates inject reusable action patterns think of them as parametrised macros.

In a template file you define parameters and a list of actions that reference those parameters. Inside a FlowGroup you apply the template and provide actual arguments

Example of a template file:

templates/csv_ingestion_template.yaml
 1# This is a template for ingesting CSV files with schema enforcement
 2# It is used to generate the actions for the pipeline
 3# within the pipeline all it need to defined are the parameters for the table name and landing folder
 4# the template will generate the actions for the pipeline
 5
 6name: csv_ingestion_template
 7version: "1.0"
 8description: "Standard template for ingesting CSV files with schema enforcement"
 9
10presets:
11- bronze_layer
12
13parameters:
14- name: table_name
15   required: true
16   description: "Name of the table to ingest"
17- name: landing_folder
18   required: true
19   description: "Name of the landing folder"
20
21actions:
22- name: load_{{ table_name }}_csv
23   type: load
24   readMode : "stream"
25   operational_metadata: ["_source_file_path","_source_file_size","_source_file_modification_time","_record_hash"]
26   source:
27      type: cloudfiles
28      path: "${landing_volume}/{{ landing_folder }}/*.csv"
29      format: csv
30      options:
31      cloudFiles.format: csv
32      header: True
33      delimiter: "|"
34      cloudFiles.maxFilesPerTrigger: 11
35      cloudFiles.inferColumnTypes: False
36      cloudFiles.schemaEvolutionMode: "addNewColumns"
37      cloudFiles.rescuedDataColumn: "_rescued_data"
38      cloudFiles.schemaHints: "schemas/{{ table_name }}_schema.yaml"
39
40   target: v_{{ table_name }}_cloudfiles
41   description: "Load {{ table_name }} CSV files from landing volume"
42
43- name: write_{{ table_name }}_cloudfiles
44   type: write
45   source: v_{{ table_name }}_cloudfiles
46   write_target:
47      type: streaming_table
48      database: "${catalog}.${raw_schema}"
49      table: "{{ table_name }}"
50      description: "Write {{ table_name }} to raw layer"

Example of a flowgroup using the template:

pipelines/01_raw_ingestion/csv_ingestions/customer_ingestion.yaml
 1# This pipeline is used to ingest the customer table from the csv files into the raw schema
 2# Pipeline variable puts the generate files in the same folder for the pipeline to pick up
 3pipeline: raw_ingestions
 4# Flowgroup are conceptual artifacts and has no functional purpose
 5# there are used to group actions together in the generated files
 6flowgroup: customer_ingestion
 7
 8# Use the template to generate the actions for the pipeline
 9# Template parameters are used to pass in the table name and landing folder
10# The template will generate the actions for the pipeline
11use_template: csv_ingestion_template
12template_parameters:
13table_name: customer
14landing_folder: customer

Configuration Management

LakehousePlumber provides two configuration files to customize how your pipelines and orchestration jobs are deployed to Databricks:

  • Pipeline Configuration (pipeline_config.yaml) - Controls SDP pipeline settings like compute, runtime, notifications

  • Job Configuration (job_config.yaml) - Controls orchestration job settings like concurrency, schedules, permissions

See also

For complete configuration options, examples, and best practices, see the Configuration Management section in Databricks Asset Bundles Integration.

Substitutions & Secrets

LakehousePlumber supports environment-aware tokens, local variables, secret references, and file substitutions that make your pipeline definitions portable across environments.

See also

For the full reference on all substitution syntaxes, processing order, secret management, and file substitution support, see Substitutions & Secrets.

Operational Metadata

Operational metadata columns provide lineage, data provenance, and processing context. They are defined at the project level in lhp.yaml and can be selectively enabled at the preset, flowgroup, or action level with additive behavior.

See also

For the full reference on column definitions, target type compatibility, usage patterns, version requirements, and event log configuration, see Operational Metadata.

State Management & Smart Generation

Lakehouse Plumber keeps a small state file under .lhp_state.json that maps generated Python files to their source YAML. It records checksums and dependency links so that future lhp generate runs can:

  • re-process only new or stale FlowGroups.

  • skip files whose inputs did not change.

  • optionally clean up orphaned files when you delete YAML.

This behaviour is similar to Gradle’s incremental build or Terraform’s state management.

How state management works:

.lhp_state.json example
 1{
 2  "version": "1.0",
 3  "generated_files": {
 4    "customer_ingestion.py": {
 5      "source_yaml": "pipelines/bronze/customer_ingestion.yaml",
 6      "checksum": "a1b2c3d4e5f6",
 7      "environment": "dev",
 8      "dependencies": ["presets/bronze_layer.yaml"]
 9    }
10  }
11}

Benefits:

  • Faster regeneration - Only changed files are processed

  • Dependency tracking - Upstream changes trigger downstream regeneration

  • Cleanup support - Detect and remove orphaned generated files

  • CI/CD optimization - Skip unchanged pipeline generation in builds

Dependency Resolver

Transforms may reference earlier views or tables via the source field. LHP builds a directed acyclic graph (DAG) of these references, detects cycles, and ensures downstream FlowGroups regenerate when upstream definitions change.

See also

Dependency Analysis & Job Generation for the full 5-step resolution process, dependency chain examples, and the lhp deps command.

Pipeline Generation Workflow

The complete pipeline generation process follows this workflow:

        graph TD
    subgraph "Discovery Phase"
        A[Scan YAML Files] --> B[Apply Include Patterns]
        B --> C[Parse FlowGroups]
    end

    subgraph "Resolution Phase"
        C --> D[Apply Presets]
        D --> E[Expand Templates]
        E --> F[Apply Substitutions]
        F --> G[Validate Configuration]
    end

    subgraph "Generation Phase"
        G --> H[Resolve Dependencies]
        H --> I[Check State]
        I --> J{Changed?}
        J -->|Yes| K[Generate Code]
        J -->|No| L[Skip Generation]
        K --> M[Update State]
        L --> M
    end

    subgraph "Output"
        M --> N[Python DLT Files]
    end
    

Key optimization points:

  • Smart discovery - Include patterns reduce files to process

  • Incremental generation - State tracking skips unchanged files

  • Dependency awareness - Changes propagate to affected downstream files

  • Validation early - Catch errors before code generation

  • Parallel processing - Independent FlowGroups can be processed simultaneously

Troubleshooting

Common issues include state management problems (stale `.lhp_state.json`), dependency resolution failures, and performance with large projects.

See also

Error Reference for error codes, resolution steps, and general troubleshooting tips (state debugging, dependency debugging, performance optimization).

What’s Next?

Now that you understand the core building blocks of Lakehouse Plumber, explore these topics:

For hands-on examples and complete workflows, check out Getting Started.