Concepts & Architecture¶
At its core Lakehouse Plumber converts declarative YAML into regular Databricks Lakeflow Declarative Pipelines (ETL) Python code. The YAML files are intentionally simple and the heavy-lifting happens inside the Plumber engine at generation time. This page explains the key building blocks you will interact with.
FlowGroups¶
A FlowGroup represents a logical slice of your pipeline often a single source table or business entity. YAML files can contain one or multiple FlowGroups (see Multi-Flowgroup YAML Files for details on multi-flowgroup files).
Required keys in a FlowGroup YAML file
pipeline: bronze_raw # pipeline name (logical)
flowgroup: customer_bronze_ingestion # unique name for the flowgroup (logical)
actions: # list of steps in the flowgroup
Optional keys in a FlowGroup YAML file
job_name:
- NCR # Optional: Assign flowgroup to a specific orchestration job
variables: # Optional: Define local variables for this flowgroup
entity: customer
table: customer_raw
The job_name property enables multi-job orchestration, allowing you to split your flowgroups into separate Databricks jobs rather than a single monolithic orchestration job. This is useful for:
Separate scheduling - Different jobs can run on different schedules (e.g., hourly POS data, daily ERP data)
Isolated execution - Jobs run independently with separate concurrency and resource settings
Modular organization - Group related flowgroups by source system, business domain, or data criticality
Flexible configuration - Each job can have its own tags, notifications, timeouts, and performance targets
Important
All-or-Nothing Rule: If job_name is defined for any flowgroup in your project, it must be defined for all flowgroups. This ensures consistent orchestration behavior and prevents configuration errors.
Example with multi-job orchestration:
pipeline: bronze_ncr
flowgroup: pos_transaction_bronze
job_name:
- NCR # Assigns this flowgroup to the "NCR" orchestration job
actions:
- name: load_pos_data
type: load
source:
type: cloudfiles
path: "/mnt/landing/ncr/pos/*.parquet"
target: v_pos_raw
When job_name is used:
Each unique
job_namegenerates a separate Databricks job file (e.g.,NCR.job.yml,SAP_SFCC.job.yml)A master orchestration job is generated that coordinates execution across all jobs
Dependencies between jobs are automatically detected and handled in the master job
Per-job configuration is managed through multi-document
job_config.yamlfiles
See also
For complete details on multi-job orchestration, job configuration, and the master orchestration job, see Databricks Asset Bundles Integration.
Note
FlowGroup vs Pipeline: - A FlowGroup represents a logical slice of your pipeline often a single source table or business entity.
A Pipeline is a logical grouping of FlowGroups. It is used to group the generated python files in the same folder.
Lakeflow Declarative Pipelines are declarative (as the name suggests) hence the order of the actions is determined at runtime by the Lakeflow engine based on the dependencies between the tables/views.
YAML files can contain one flowgroup (traditional) or multiple flowgroups (see Multi-Flowgroup YAML Files).
Actions¶
Every FlowGroup lists one or more Actions Actions come in three top-level types:
Type |
Purpose |
|---|---|
Load |
Bring data into a temporary view (e.g. CloudFiles, Delta, JDBC, SQL, Python, custom_datasource). |
Transform |
Manipulate data in one or more steps (SQL, Python, schema adjustments, data-quality checks, temp tables…). |
Write |
Persist the final dataset to a streaming_table, materialized_view, or external sink (Kafka, Delta, custom API). |
Note
You may chain zero or many Transform actions between a Load and a Write.
Important
the order of the actions is determined at runtime by the Lakeflow engine based on the dependencies between the tables/views, Not the order in the YAML file or the generated Python file.
For a complete catalogue of Action sub-types and their options see Actions Reference.
Presets¶
A Preset is a YAML file that provides default configuration snippets you can reuse across FlowGroups. Presets inject default values that are merged with explicit configurations in templates and flowgroups.
Common use cases:
Standardised table properties for all Bronze streaming tables
CloudFiles ingestion options (error handling, schema evolution)
Spark configuration tuning
Example preset file:
name: cloudfiles_defaults
version: "1.0"
description: "Standard CloudFiles options"
defaults:
load_actions:
cloudfiles:
options:
cloudFiles.rescuedDataColumn: "_rescued_data"
ignoreCorruptFiles: "true"
ignoreMissingFiles: "true"
cloudFiles.maxFilesPerTrigger: 200
Usage in a FlowGroup:
presets:
- cloudfiles_defaults
actions:
- name: load_data
type: load
source:
type: cloudfiles
options:
cloudFiles.format: csv # Merged with preset options
For complete preset documentation see Presets Reference.
Templates¶
While presets inject reusable values, Templates inject reusable action patterns think of them as parametrised macros.
In a template file you define parameters and a list of actions that reference those parameters. Inside a FlowGroup you apply the template and provide actual arguments
Example of a template file:
1# This is a template for ingesting CSV files with schema enforcement
2# It is used to generate the actions for the pipeline
3# within the pipeline all it need to defined are the parameters for the table name and landing folder
4# the template will generate the actions for the pipeline
5
6name: csv_ingestion_template
7version: "1.0"
8description: "Standard template for ingesting CSV files with schema enforcement"
9
10presets:
11- bronze_layer
12
13parameters:
14- name: table_name
15 required: true
16 description: "Name of the table to ingest"
17- name: landing_folder
18 required: true
19 description: "Name of the landing folder"
20
21actions:
22- name: load_{{ table_name }}_csv
23 type: load
24 readMode : "stream"
25 operational_metadata: ["_source_file_path","_source_file_size","_source_file_modification_time","_record_hash"]
26 source:
27 type: cloudfiles
28 path: "${landing_volume}/{{ landing_folder }}/*.csv"
29 format: csv
30 options:
31 cloudFiles.format: csv
32 header: True
33 delimiter: "|"
34 cloudFiles.maxFilesPerTrigger: 11
35 cloudFiles.inferColumnTypes: False
36 cloudFiles.schemaEvolutionMode: "addNewColumns"
37 cloudFiles.rescuedDataColumn: "_rescued_data"
38 cloudFiles.schemaHints: "schemas/{{ table_name }}_schema.yaml"
39
40 target: v_{{ table_name }}_cloudfiles
41 description: "Load {{ table_name }} CSV files from landing volume"
42
43- name: write_{{ table_name }}_cloudfiles
44 type: write
45 source: v_{{ table_name }}_cloudfiles
46 write_target:
47 type: streaming_table
48 database: "${catalog}.${raw_schema}"
49 table: "{{ table_name }}"
50 description: "Write {{ table_name }} to raw layer"
Example of a flowgroup using the template:
1# This pipeline is used to ingest the customer table from the csv files into the raw schema
2# Pipeline variable puts the generate files in the same folder for the pipeline to pick up
3pipeline: raw_ingestions
4# Flowgroup are conceptual artifacts and has no functional purpose
5# there are used to group actions together in the generated files
6flowgroup: customer_ingestion
7
8# Use the template to generate the actions for the pipeline
9# Template parameters are used to pass in the table name and landing folder
10# The template will generate the actions for the pipeline
11use_template: csv_ingestion_template
12template_parameters:
13table_name: customer
14landing_folder: customer
Configuration Management¶
LakehousePlumber provides two configuration files to customize how your pipelines and orchestration jobs are deployed to Databricks:
Pipeline Configuration (
pipeline_config.yaml) - Controls SDP pipeline settings like compute, runtime, notificationsJob Configuration (
job_config.yaml) - Controls orchestration job settings like concurrency, schedules, permissions
See also
For complete configuration options, examples, and best practices, see the Configuration Management section in Databricks Asset Bundles Integration.
Substitutions & Secrets¶
LakehousePlumber supports environment-aware tokens, local variables, secret references, and file substitutions that make your pipeline definitions portable across environments.
See also
For the full reference on all substitution syntaxes, processing order, secret management, and file substitution support, see Substitutions & Secrets.
Operational Metadata¶
Operational metadata columns provide lineage, data provenance, and processing context.
They are defined at the project level in lhp.yaml and can be selectively enabled
at the preset, flowgroup, or action level with additive behavior.
See also
For the full reference on column definitions, target type compatibility, usage patterns, version requirements, and event log configuration, see Operational Metadata.
State Management & Smart Generation¶
Lakehouse Plumber keeps a small state file under .lhp_state.json that
maps generated Python files to their source YAML. It records checksums and
dependency links so that future lhp generate runs can:
re-process only new or stale FlowGroups.
skip files whose inputs did not change.
optionally clean up orphaned files when you delete YAML.
This behaviour is similar to Gradle’s incremental build or Terraform’s state management.
How state management works:
1{
2 "version": "1.0",
3 "generated_files": {
4 "customer_ingestion.py": {
5 "source_yaml": "pipelines/bronze/customer_ingestion.yaml",
6 "checksum": "a1b2c3d4e5f6",
7 "environment": "dev",
8 "dependencies": ["presets/bronze_layer.yaml"]
9 }
10 }
11}
Benefits:
Faster regeneration - Only changed files are processed
Dependency tracking - Upstream changes trigger downstream regeneration
Cleanup support - Detect and remove orphaned generated files
CI/CD optimization - Skip unchanged pipeline generation in builds
Dependency Resolver¶
Transforms may reference earlier views or tables via the source field. LHP builds a
directed acyclic graph (DAG) of these references, detects cycles, and ensures downstream
FlowGroups regenerate when upstream definitions change.
See also
Dependency Analysis & Job Generation for the full 5-step resolution process, dependency chain
examples, and the lhp deps command.
Pipeline Generation Workflow¶
The complete pipeline generation process follows this workflow:
graph TD
subgraph "Discovery Phase"
A[Scan YAML Files] --> B[Apply Include Patterns]
B --> C[Parse FlowGroups]
end
subgraph "Resolution Phase"
C --> D[Apply Presets]
D --> E[Expand Templates]
E --> F[Apply Substitutions]
F --> G[Validate Configuration]
end
subgraph "Generation Phase"
G --> H[Resolve Dependencies]
H --> I[Check State]
I --> J{Changed?}
J -->|Yes| K[Generate Code]
J -->|No| L[Skip Generation]
K --> M[Update State]
L --> M
end
subgraph "Output"
M --> N[Python DLT Files]
end
Key optimization points:
Smart discovery - Include patterns reduce files to process
Incremental generation - State tracking skips unchanged files
Dependency awareness - Changes propagate to affected downstream files
Validation early - Catch errors before code generation
Parallel processing - Independent FlowGroups can be processed simultaneously
Troubleshooting¶
Common issues include state management problems (stale `.lhp_state.json`), dependency
resolution failures, and performance with large projects.
See also
Error Reference for error codes, resolution steps, and general troubleshooting tips (state debugging, dependency debugging, performance optimization).
What’s Next?¶
Now that you understand the core building blocks of Lakehouse Plumber, explore these topics:
Substitutions & Secrets - Environment tokens, local variables, secrets, and file substitution support.
Operational Metadata - Audit columns, version requirements, and event log configuration.
Templates Reference - Reuse common patterns across your pipelines.
Databricks Asset Bundles Integration - Deploy and manage your pipelines as code.
Dependency Analysis & Job Generation - Pipeline dependency analysis and orchestration job generation.
For hands-on examples and complete workflows, check out Getting Started.