Getting Started¶
This short tutorial walks you through creating your first Lakehouse Plumber project and generating a Lakeflow Pipeliness (DLT) pipeline based on the ACME demo configuration that ships with the repository.
Prerequisites¶
Python 3.11+ (3.12 recommended)
Access to a Databricks workspace with DLT enabled (for actual deployment)
Git installed (optional but recommended)
Installation¶
# Create and activate a virtual environment (optional)
python -m venv .venv
source .venv/bin/activate
# Install Lakehouse Plumber CLI and its extras for docs & examples
pip install lakehouse-plumber
Option 1: Clone the ACME Example Pipeline¶
There is a companion repository that includes a fully working example (TPC-H retail dataset). Copy its raw-ingestion flow into your new project:
git clone https://github.com/Mmodarre/acme_edw.git
cd acme_edw
lhp validate --env dev
lhp generate --env dev
Option 2: Create a new pipeline configuration¶
Step 1: Project Initialisation¶
Use the `lhp init` command to scaffold a new repo-ready directory structure:
lhp init <my_spd_project>
cd <my_spd_project>
The command creates folders such as pipelines/, templates/,
substitutions/ and a starter lhp.yaml project file. It also includes
example template files with .tmpl extensions that you can use as starting points.
Note
VS Code IntelliSense: If you use VS Code, IntelliSense with autocomplete, validation, and documentation is automatically configured! Open any YAML file to see real-time validation and smart suggestions.
Step 2: Edit the project configuration file¶
Edit the lhp.yaml file to configure your project.
Note
The lhp.yaml file is the main entry point for configuring your project.
for full documentation on the project configuration file, see Concepts & Architecture.
Step 3: Create your environment configuration file¶
Create a substitutions/dev.yaml file to match your workspace catalog & storage paths.
You can either:
Rename the example template:
mv substitutions/dev.yaml.tmpl substitutions/dev.yamlCreate a new file following the same structure as the example template
Edit the file to configure tokens such as ${catalog} or ${secret:scope/key}
that will be replaced during code generation.
Tip
The .tmpl files created by lhp init contain working examples that you can
use as starting points. Simply rename them or copy their content to create your
working configuration files.
Step 4: Create your first pipeline configuration¶
Create a new pipeline configuration in the pipelines/ folder.
Tip
Understanding Pipeline Configuration Structure:
Pipeline: (line 1) specifies the pipeline name that contains this flowgroup. All YAML files sharing the same pipeline name will be organized together in the same directory during code generation.
Flowgroup: (line 2) represents a logical grouping of related actions within the pipeline and serves as an organizational construct without impacting runtime behavior.
Actions: (line 4) define the individual operations in the pipeline. They serve as the fundamental components that execute the data processing workflow:
Loads (lines 5-11) customer data from the Databricks samples catalog using Delta streaming
Transforms (lines 13-27) the raw data by renaming columns and standardizing field names
Writes (lines 29-35) the processed data to a bronze layer streaming table
Leverages substitutions like
${catalog}and${bronze_schema}for environment flexibility fromdev.yamlfileImplements medallion architecture by writing to the bronze schema for downstream processing
Enables streaming with
readMode: streamfor incremental read from Delta Change Data Feed (CDF)
Tip
Multi-Flowgroup Files:
You can define multiple flowgroups in a single YAML file to reduce file proliferation. This is useful when you have many similar flowgroups (e.g., SAP master data tables).
See Multi-Flowgroup YAML Files for detailed examples and syntax options.
1pipeline: tpch_sample_ingestion # Grouping of generated python files in the same folder
2flowgroup: customer_ingestion # Logical grouping for generated Python file
3
4actions:
5 - name: customer_sample_load # Unique action identifier
6 type: load # Action type: Load
7 readMode: stream # Read using streaming CDF
8 source:
9 type: delta # Source format: Delta Lake table
10 database: "samples.tpch" # Source database and schema in Unity Catalog
11 table: customer_sample # Source table name
12 target: v_customer_sample_raw # Target view name (temporary in-memory)
13 description: "Load customer sample table from Databricks samples catalog"
14
15 - name: transform_customer_sample # Unique action identifier
16 type: transform # Action type: Transform
17 transform_type: sql # Transform using SQL query
18 source: v_customer_sample_raw # Input view from previous action
19 target: v_customer_sample_cleaned # Output view name
20 sql: |
21 SELECT
22 c_custkey as customer_id,
23 c_name as name,
24 c_address as address,
25 c_nationkey as nation_id,
26 c_phone as phone,
27 c_acctbal as account_balance,
28 c_mktsegment as market_segment,
29 c_comment as comment
30 FROM stream(v_customer_sample_raw)
31 description: "Transform customer sample table"
32
33 - name: write_customer_sample_bronze # Unique action identifier
34 type: write # Action type: Write
35 source: v_customer_sample_cleaned # Input view from previous action
36 write_target:
37 type: streaming_table # Output as streaming table
38 database: "${catalog}.${bronze_schema}" # Target database.schema with substitutions
39 table: "tpch_sample_customer" # Final table name
40 description: "Write customer sample table to bronze schema"
Validate the Configuration¶
# Check for schema errors, missing secrets, circular dependencies …
lhp validate --env dev
If everything is green you will see ✅ All configurations are valid.
Generate DLT Code¶
# Create Python files in ./generated/ (default output dir)
lhp generate --env dev
# Include data quality tests (optional - for development/testing)
lhp generate --env dev --include-tests
Inspect the Output¶
Navigate to generated/tpch_sample_ingestion — each FlowGroup becomes a Python
file, formatted with black. These are standard
Lakeflow Declarative Pipeline scripts that you can run in
Databricks or commit to your repository. See Databricks Asset Bundles Integration for Asset Bundle integration.
This is the generated python file from the above YAML configuration:
1# Generated by LakehousePlumber
2# Pipeline: tpch_sample_ingestion
3# FlowGroup: customer_ingestion
4
5from pyspark import pipelines as dp
6
7# Pipeline Configuration
8PIPELINE_ID = "tpch_sample_ingestion"
9FLOWGROUP_ID = "customer_ingestion"
10
11# ============================================================================
12# SOURCE VIEWS
13# ============================================================================
14
15@dp.temporary_view()
16def v_customer_sample_raw():
17 """Load customer sample table from Databricks samples catalog"""
18 df = spark.readStream \
19 .table("samples.tpch.customer_sample")
20
21 return df
22
23
24# ============================================================================
25# TRANSFORMATION VIEWS
26# ============================================================================
27
28@dp.temporary_view(comment="Transform customer sample table")
29def v_customer_sample_cleaned():
30 """Transform customer sample table"""
31 return spark.sql("""SELECT
32c_custkey as customer_id,
33c_name as name,
34c_address as address,
35c_nationkey as nation_id,
36c_phone as phone,
37c_acctbal as account_balance,
38c_mktsegment as market_segment,
39c_comment as comment
40FROM stream(v_customer_sample_raw)""")
41
42
43# ============================================================================
44# TARGET TABLES
45# ============================================================================
46
47# Create the streaming table
48dp.create_streaming_table(
49 name="acmi_edw_dev.edw_bronze.tpch_sample_customer",
50 comment="Streaming table: tpch_sample_customer",
51 table_properties={"delta.autoOptimize.optimizeWrite": "true", "delta.enableChangeDataFeed": "true"})
52
53
54# Define append flow(s)
55@dp.append_flow(
56 target="acmi_edw_dev.edw_bronze.tpch_sample_customer",
57 name="f_customer_sample_bronze",
58 comment="Write customer sample table to bronze schema"
59)
60def f_customer_sample_bronze():
61 """Write customer sample table to bronze schema"""
62 # Streaming flow
63 df = spark.readStream.table("v_customer_sample_cleaned")
64
65 return df
Deploy on Databricks¶
Option 1: Manually create a Lakeflow Declarative Pipeline(ETL)
Create a Lakeflow Declarative Pipeline(ETL) in the Databricks UI.
Point the Notebook/Directory field to your
generated/folder in the workspace (or sync the files via Repos).
OR (create new python files and paste the generated code into them.)
Configure clusters & permissions, then click Validate.
Option 2: Use Asset Bundles
Working with Example Templates¶
When you run lhp init, several example template files are created to help you get started:
- Configuration Examples:
substitutions/dev.yaml.tmpl- Example environment configuration with common substitution variablessubstitutions/prod.yaml.tmpl- Production environment examplesubstitutions/tst.yaml.tmpl- Test environment example
- Pipeline Examples:
pipelines/01_raw_ingestion/- Complete ingestion pipeline examples for various data formatspipelines/02_bronze/- Bronze layer transformation examplespipelines/03_silver/- Silver layer examples with data quality
- Preset Examples:
presets/bronze_layer.yaml.tmpl- Reusable bronze layer configuration template
- Template Examples:
templates/standard_ingestion.yaml.tmpl- Standard ingestion pattern template
To use these examples:
Copy and rename template files:
cp substitutions/dev.yaml.tmpl substitutions/dev.yamlEdit the copied files to match your environment and requirements
Use them as references when creating your own configurations
Explore the comprehensive examples in the
pipelines/directory for different data ingestion patterns
Note
The .tmpl files are static examples containing LHP template syntax. They are not
Jinja2 templates for the init command, but rather complete working examples that you
can use as starting points for your own configurations.
Next Steps¶
Explore Presets and Templates to reduce duplication.
Add data-quality expectations to your transforms.
Add operational metadata to your actions.
Add Schema Hints to your Load actions.
Enable Change-Data-Feed (CDC) in bronze ingestions.
Continue reading the Concepts & Architecture section for deeper architectural details.