Concepts & Architecture¶

At its core Lakehouse Plumber converts declarative YAML into regular Databricks Lakeflow Declarative Pipelines (ETL) Python code. The YAML files are intentionally simple and the heavy-lifting happens inside the Plumber engine at generation time. This page explains the key building blocks you will interact with.

FlowGroups¶

A FlowGroup represents a logical slice of your pipeline often a single source table or business entity. YAML files can contain one or multiple FlowGroups (see Multi-Flowgroup YAML Files for details on multi-flowgroup files).

Required keys in a FlowGroup YAML file

pipeline:  bronze_raw                 # pipeline name (logical)
flowgroup: customer_bronze_ingestion  # unique name for the flowgroup (logical)
actions:                              # list of steps in the flowgroup

Optional keys in a FlowGroup YAML file

job_name:
  - NCR            # Optional: Assign flowgroup to a specific orchestration job
variables:         # Optional: Define local variables for this flowgroup
  entity: customer
  table: customer_raw

The job_name property enables multi-job orchestration, allowing you to split your flowgroups into separate Databricks jobs rather than a single monolithic orchestration job. This is useful for:

Separate scheduling - Different jobs can run on different schedules (e.g., hourly POS data, daily ERP data)
Isolated execution - Jobs run independently with separate concurrency and resource settings
Modular organization - Group related flowgroups by source system, business domain, or data criticality
Flexible configuration - Each job can have its own tags, notifications, timeouts, and performance targets

Important

All-or-Nothing Rule: If job_name is defined for any flowgroup in your project, it must be defined for all flowgroups. This ensures consistent orchestration behavior and prevents configuration errors.

Example with multi-job orchestration:

pipelines/ncr/pos_transactions.yaml¶

pipeline: bronze_ncr
flowgroup: pos_transaction_bronze
job_name:
  - NCR  # Assigns this flowgroup to the "NCR" orchestration job

actions:
  - name: load_pos_data
    type: load
    source:
      type: cloudfiles
      path: "/mnt/landing/ncr/pos/*.parquet"
    target: v_pos_raw

When job_name is used:

Each unique job_name generates a separate Databricks job file (e.g., NCR.job.yml, SAP_SFCC.job.yml)
A master orchestration job is generated that coordinates execution across all jobs
Dependencies between jobs are automatically detected and handled in the master job
Per-job configuration is managed through multi-document job_config.yaml files

See also

For complete details on multi-job orchestration, job configuration, and the master orchestration job, see Databricks Asset Bundles Integration.

Note

FlowGroup vs Pipeline: - A FlowGroup represents a logical slice of your pipeline often a single source table or business entity.

A Pipeline is a logical grouping of FlowGroups. It is used to group the generated python files in the same folder.
Lakeflow Declarative Pipelines are declarative (as the name suggests) hence the order of the actions is determined at runtime by the Lakeflow engine based on the dependencies between the tables/views.
YAML files can contain one flowgroup (traditional) or multiple flowgroups (see Multi-Flowgroup YAML Files).

Actions¶

Every FlowGroup lists one or more Actions Actions come in three top-level types:

Type	Purpose
Load	Bring data into a temporary view (e.g. CloudFiles, Delta, JDBC, SQL, Python, custom_datasource).
Transform	Manipulate data in one or more steps (SQL, Python, schema adjustments, data-quality checks, temp tables…).
Write	Persist the final dataset to a streaming_table, materialized_view, or external sink (Kafka, Delta, custom API).

Note

You may chain zero or many Transform actions between a Load and a Write.

Important

the order of the actions is determined at runtime by the Lakeflow engine based on the dependencies between the tables/views, Not the order in the YAML file or the generated Python file.

For a complete catalogue of Action sub-types and their options see Actions Reference.

Presets¶

A Preset is a YAML file that provides default configuration snippets you can reuse across FlowGroups. Presets inject default values that are merged with explicit configurations in templates and flowgroups.

Common use cases:

Standardised table properties for all Bronze streaming tables
CloudFiles ingestion options (error handling, schema evolution)
Spark configuration tuning

Example preset file:

presets/cloudfiles_defaults.yaml¶

name: cloudfiles_defaults
version: "1.0"
description: "Standard CloudFiles options"

defaults:
  load_actions:
    cloudfiles:
      options:
        cloudFiles.rescuedDataColumn: "_rescued_data"
        ignoreCorruptFiles: "true"
        ignoreMissingFiles: "true"
        cloudFiles.maxFilesPerTrigger: 200

Usage in a FlowGroup:

presets:
  - cloudfiles_defaults

actions:
  - name: load_data
    type: load
    source:
      type: cloudfiles
      options:
        cloudFiles.format: csv  # Merged with preset options

For complete preset documentation see Presets Reference.

Templates¶

While presets inject reusable values, Templates inject reusable action patterns think of them as parametrised macros.

In a template file you define parameters and a list of actions that reference those parameters. Inside a FlowGroup you apply the template and provide actual arguments

Example of a template file:

templates/csv_ingestion_template.yaml¶

# This is a template for ingesting CSV files with schema enforcement
# It is used to generate the actions for the pipeline
# within the pipeline all it need to defined are the parameters for the table name and landing folder
# the template will generate the actions for the pipeline

name: csv_ingestion_template
version: "1.0"
description: "Standard template for ingesting CSV files with schema enforcement"

presets:
- bronze_layer

parameters:
- name: table_name
   required: true
   description: "Name of the table to ingest"
- name: landing_folder
   required: true
   description: "Name of the landing folder"

actions:
- name: load_{{ table_name }}_csv
   type: load
   readMode : "stream"
   operational_metadata: ["_source_file_path","_source_file_size","_source_file_modification_time","_record_hash"]
   source:
      type: cloudfiles
      path: "${landing_volume}/{{ landing_folder }}/*.csv"
      format: csv
      options:
      cloudFiles.format: csv
      header: True
      delimiter: "|"
      cloudFiles.maxFilesPerTrigger: 11
      cloudFiles.inferColumnTypes: False
      cloudFiles.schemaEvolutionMode: "addNewColumns"
      cloudFiles.rescuedDataColumn: "_rescued_data"
      cloudFiles.schemaHints: "schemas/{{ table_name }}_schema.yaml"

   target: v_{{ table_name }}_cloudfiles
   description: "Load {{ table_name }} CSV files from landing volume"

- name: write_{{ table_name }}_cloudfiles
   type: write
   source: v_{{ table_name }}_cloudfiles
   write_target:
      type: streaming_table
      database: "${catalog}.${raw_schema}"
      table: "{{ table_name }}"
      description: "Write {{ table_name }} to raw layer"

Example of a flowgroup using the template:

pipelines/01_raw_ingestion/csv_ingestions/customer_ingestion.yaml¶

# This pipeline is used to ingest the customer table from the csv files into the raw schema
# Pipeline variable puts the generate files in the same folder for the pipeline to pick up
pipeline: raw_ingestions
# Flowgroup are conceptual artifacts and has no functional purpose
# there are used to group actions together in the generated files
flowgroup: customer_ingestion

# Use the template to generate the actions for the pipeline
# Template parameters are used to pass in the table name and landing folder
# The template will generate the actions for the pipeline
use_template: csv_ingestion_template
template_parameters:
table_name: customer
landing_folder: customer

Configuration Management¶

LakehousePlumber provides two configuration files to customize how your pipelines and orchestration jobs are deployed to Databricks:

Pipeline Configuration (pipeline_config.yaml) - Controls SDP pipeline settings like compute, runtime, notifications
Job Configuration (job_config.yaml) - Controls orchestration job settings like concurrency, schedules, permissions

See also

For complete configuration options, examples, and best practices, see the Configuration Management section in Databricks Asset Bundles Integration.

Substitutions & Secrets¶

LakehousePlumber supports environment-aware tokens, local variables, secret references, and file substitutions that make your pipeline definitions portable across environments.

See also

For the full reference on all substitution syntaxes, processing order, secret management, and file substitution support, see Substitutions & Secrets.

Operational Metadata¶

Operational metadata columns provide lineage, data provenance, and processing context. They are defined at the project level in lhp.yaml and can be selectively enabled at the preset, flowgroup, or action level with additive behavior.

See also

For the full reference on column definitions, target type compatibility, usage patterns, version requirements, and event log configuration, see Operational Metadata.

State Management & Smart Generation¶

Lakehouse Plumber keeps a small state file under .lhp_state.json that maps generated Python files to their source YAML. It records checksums and dependency links so that future lhp generate runs can:

re-process only new or stale FlowGroups.
skip files whose inputs did not change.
optionally clean up orphaned files when you delete YAML.

This behaviour is similar to Gradle’s incremental build or Terraform’s state management.

How state management works:

.lhp_state.json example¶

{
  "version": "1.0",
  "generated_files": {
    "customer_ingestion.py": {
      "source_yaml": "pipelines/bronze/customer_ingestion.yaml",
      "checksum": "a1b2c3d4e5f6",
      "environment": "dev",
      "dependencies": ["presets/bronze_layer.yaml"]
    }
  }
}

Benefits:

Faster regeneration - Only changed files are processed
Dependency tracking - Upstream changes trigger downstream regeneration
Cleanup support - Detect and remove orphaned generated files
CI/CD optimization - Skip unchanged pipeline generation in builds

Dependency Resolver¶

Transforms may reference earlier views or tables via the source field. LHP builds a directed acyclic graph (DAG) of these references, detects cycles, and ensures downstream FlowGroups regenerate when upstream definitions change.

See also

Dependency Analysis & Job Generation for the full 5-step resolution process, dependency chain examples, and the lhp deps command.

Pipeline Generation Workflow¶

The complete pipeline generation process follows this workflow:

        graph TD
    subgraph "Discovery Phase"
        A[Scan YAML Files] --> B[Apply Include Patterns]
        B --> C[Parse FlowGroups]
    end

    subgraph "Resolution Phase"
        C --> D[Apply Presets]
        D --> E[Expand Templates]
        E --> F[Apply Substitutions]
        F --> G[Validate Configuration]
    end

    subgraph "Generation Phase"
        G --> H[Resolve Dependencies]
        H --> I[Check State]
        I --> J{Changed?}
        J -->|Yes| K[Generate Code]
        J -->|No| L[Skip Generation]
        K --> M[Update State]
        L --> M
    end

    subgraph "Output"
        M --> N[Python DLT Files]
    end

Key optimization points:

Smart discovery - Include patterns reduce files to process
Incremental generation - State tracking skips unchanged files
Dependency awareness - Changes propagate to affected downstream files
Validation early - Catch errors before code generation
Parallel processing - Independent FlowGroups can be processed simultaneously

Troubleshooting¶

Common issues include state management problems (stale `.lhp_state.json`), dependency resolution failures, and performance with large projects.

See also

Error Reference for error codes, resolution steps, and general troubleshooting tips (state debugging, dependency debugging, performance optimization).

What’s Next?¶

Now that you understand the core building blocks of Lakehouse Plumber, explore these topics:

Substitutions & Secrets - Environment tokens, local variables, secrets, and file substitution support.
Operational Metadata - Audit columns, version requirements, and event log configuration.
Templates Reference - Reuse common patterns across your pipelines.
Databricks Asset Bundles Integration - Deploy and manage your pipelines as code.
Dependency Analysis & Job Generation - Pipeline dependency analysis and orchestration job generation.

For hands-on examples and complete workflows, check out Getting Started.