Lakehouse Plumber¶
Managing dozens of Lakeflow/DLT pipelines means thousands of lines of repetitive Python — inconsistent patterns, boilerplate sprawl, and painful maintenance across environments.
Lakehouse Plumber turns concise YAML actions into fully-featured Databricks Lakeflow Declarative Pipelines (formerly Delta Live Tables) — without hiding the Databricks platform you already know and love.
How LHP Solves It¶
Eliminates boilerplate — a template + 5-line config replaces 86 lines of Python per table.
Zero runtime overhead — pure code generation, not a runtime framework.
Transparent output — readable Python files, version-controlled and debuggable in the Databricks IDE.
Fits DataOps workflows — CI/CD, automated testing, multi-environment substitutions.
No lock-in — the output is plain Python & SQL you own and control.
Data democratization — power users create artifacts within platform standards.
Real-World Example
Instead of repeating 86 lines of Python per table, write a 5-line configuration:
pipeline: raw_ingestions
flowgroup: customer_ingestion
use_template: csv_ingestion_template
template_parameters:
table_name: customer
landing_folder: customer
Result: 4,300 lines of repetitive Python → 250 lines total (1 template + 50 simple configs). See Getting Started for the full template and generated output.
Quick Start¶
Get started in minutes:
pip install lakehouse-plumber
lhp init my_project --bundle
cd my_project
# Edit your YAML flowgroups (IntelliSense auto-configured)
lhp validate --env dev
lhp generate --env dev
# Inspect the generated/ directory — readable Python ready for Databricks
Note
New to LHP? Follow the Getting Started tutorial to build your first pipeline in 10 minutes.
Core Workflow¶
The execution model is deliberately simple:
graph LR
A[Load] --> B{0..N Transform}
B --> C[Write]
Load Ingest raw data from CloudFiles, Delta, JDBC, SQL, or custom Python.
Transform Apply zero or many transforms (SQL, Python, schema, data-quality, temp-tables…).
Write Persist results as Streaming Tables, Materialized Views, or Snapshots.
Features at a Glance¶
Pipeline Definition
Actions — Load | Transform | Write with many sub-types (see Actions Reference).
Sinks — Stream to external destinations: Delta tables, Kafka, Event Hubs, custom APIs.
CDC & SCD — change-data capture SCD type 1 and 2, and snapshot ingestion.
Append Flows — multi-source writes to a single streaming table.
Data-Quality — declarative expectations integrated into transforms, with optional quarantine mode for DLQ recycling.
Seeding — seed data from existing tables using Lakeflow native features.
Reusability
Presets & Templates — reuse patterns without copy-paste.
Local Variables — flowgroup-scoped variables (
%{var}) reduce repetition.Substitutions — environment-aware tokens & secret references.
Operations
Operational Metadata — custom audit columns and metadata.
Pipeline Monitoring — centralized event log aggregation and analysis (see Pipeline Monitoring).
Test Result Reporting — publish DQ expectation results to Azure DevOps, Delta tables, or custom systems (see Test Result Reporting (Publishing)).
Dependency Analysis — automatic dependency detection and orchestration job generation (see Dependency Analysis & Job Generation).
Smart State Management — regenerate only what changed; cleanup orphaned code.
Developer Experience
IntelliSense — VS Code schema hints & YAML completion (automatically configured).
Next Steps¶
Getting Started
Getting Started – a hands-on walk-through using the ACMI demo project.
Examples – real-world examples and sink configurations.
Configuration Guides
Concepts & Architecture – deep-dive into FlowGroups, Actions, presets, templates and more.
Substitutions & Secrets – environment tokens, local variables, and secret management.
Operational Metadata – audit columns, version requirements, and event log configuration.
Multi-Flowgroup YAML Files – reduce file proliferation with multiple flowgroups per YAML file.
Actions Reference – complete reference for all action types and sub-types.
Templates Reference – comprehensive guide to creating and using templates.
Dynamic Templates Guide – conditionals, loops, and advanced Jinja2 features.
Presets Reference – reusable default configurations.
Enterprise Best Practices – enterprise patterns for naming, structure, presets, and production readiness.
Pipeline Patterns – practical patterns for multi-source ingestion, path filtering, and fan-in architectures.
Quarantine (Dead Letter Queue) – quarantine mode with DLQ recycling for data quality transforms.
Deployment & Operations
Databricks Asset Bundles Integration – integrate with Databricks Asset Bundles for production deployments.
Pipeline Monitoring – centralized event log monitoring and analysis across all pipelines.
Test Result Reporting (Publishing) – publish test results to external systems.
Dependency Analysis & Job Generation – pipeline dependency analysis and orchestration job generation.
CI/CD Reference – CI/CD patterns, deployment strategies, and DataOps best practices.
Reference
CLI Reference – command-line reference.
Error Reference – error codes, causes, and resolution steps.
API Reference – REST API reference.