Getting Started

This short tutorial walks you through creating your first Lakehouse Plumber project and generating a Lakeflow Pipeliness (DLT) pipeline based on the ACME demo configuration that ships with the repository.

Prerequisites

  • Python 3.11+ (3.12 recommended)

  • Access to a Databricks workspace with DLT enabled (for actual deployment)

  • Git installed (optional but recommended)

Installation

# Create and activate a virtual environment (optional)
python -m venv .venv
source .venv/bin/activate

# Install Lakehouse Plumber CLI and its extras for docs & examples
pip install lakehouse-plumber

Option 1: Clone the ACME Example Pipeline

There is a companion repository that includes a fully working example (TPC-H retail dataset). Copy its raw-ingestion flow into your new project:

git clone https://github.com/Mmodarre/acme_edw.git
cd acme_edw
lhp validate --env dev
lhp generate --env dev

Option 2: Create a new pipeline configuration

Step 1: Project Initialisation

Use the `lhp init` command to scaffold a new repo-ready directory structure:

lhp init <my_spd_project>
cd <my_spd_project>

The command creates folders such as pipelines/, templates/, substitutions/ and a starter lhp.yaml project file. It also includes example template files with .tmpl extensions that you can use as starting points.

Note

VS Code IntelliSense: If you use VS Code, IntelliSense with autocomplete, validation, and documentation is automatically configured! Open any YAML file to see real-time validation and smart suggestions.

Step 2: Edit the project configuration file

Edit the lhp.yaml file to configure your project.

Note

The lhp.yaml file is the main entry point for configuring your project. for full documentation on the project configuration file, see Concepts & Architecture.

Step 3: Create your environment configuration file

Create a substitutions/dev.yaml file to match your workspace catalog & storage paths. You can either:

  1. Rename the example template: mv substitutions/dev.yaml.tmpl substitutions/dev.yaml

  2. Create a new file following the same structure as the example template

Edit the file to configure tokens such as ${catalog} or ${secret:scope/key} that will be replaced during code generation.

Tip

The .tmpl files created by lhp init contain working examples that you can use as starting points. Simply rename them or copy their content to create your working configuration files.

Step 4: Create your first pipeline configuration

Create a new pipeline configuration in the pipelines/ folder.

Tip

Understanding Pipeline Configuration Structure:

Pipeline: (line 1) specifies the pipeline name that contains this flowgroup. All YAML files sharing the same pipeline name will be organized together in the same directory during code generation.

Flowgroup: (line 2) represents a logical grouping of related actions within the pipeline and serves as an organizational construct without impacting runtime behavior.

Actions: (line 4) define the individual operations in the pipeline. They serve as the fundamental components that execute the data processing workflow:

  • Loads (lines 5-11) customer data from the Databricks samples catalog using Delta streaming

  • Transforms (lines 13-27) the raw data by renaming columns and standardizing field names

  • Writes (lines 29-35) the processed data to a bronze layer streaming table

  • Leverages substitutions like ${catalog} and ${bronze_schema} for environment flexibility from dev.yaml file

  • Implements medallion architecture by writing to the bronze schema for downstream processing

  • Enables streaming with readMode: stream for incremental read from Delta Change Data Feed (CDF)

Tip

Multi-Flowgroup Files:

You can define multiple flowgroups in a single YAML file to reduce file proliferation. This is useful when you have many similar flowgroups (e.g., SAP master data tables).

See Multi-Flowgroup YAML Files for detailed examples and syntax options.

pipelines/customer_sample.yaml
 1pipeline: tpch_sample_ingestion  # Grouping of generated python files in the same folder
 2flowgroup: customer_ingestion   # Logical grouping for generated Python file
 3
 4actions:
 5   - name: customer_sample_load     # Unique action identifier
 6     type: load                     # Action type: Load
 7     readMode: stream              # Read using streaming CDF
 8     source:
 9        type: delta                # Source format: Delta Lake table
10        database: "samples.tpch"   # Source database and schema in Unity Catalog
11        table: customer_sample     # Source table name
12     target: v_customer_sample_raw # Target view name (temporary in-memory)
13     description: "Load customer sample table from Databricks samples catalog"
14
15   - name: transform_customer_sample  # Unique action identifier
16     type: transform                  # Action type: Transform
17     transform_type: sql             # Transform using SQL query
18     source: v_customer_sample_raw   # Input view from previous action
19     target: v_customer_sample_cleaned  # Output view name
20     sql: |
21        SELECT
22        c_custkey as customer_id,
23        c_name as name,
24        c_address as address,
25        c_nationkey as nation_id,
26        c_phone as phone,
27        c_acctbal as account_balance,
28        c_mktsegment as market_segment,
29        c_comment as comment
30        FROM stream(v_customer_sample_raw)
31     description: "Transform customer sample table"
32
33   - name: write_customer_sample_bronze  # Unique action identifier
34     type: write                         # Action type: Write
35     source: v_customer_sample_cleaned   # Input view from previous action
36     write_target:
37        type: streaming_table            # Output as streaming table
38        database: "${catalog}.${bronze_schema}"  # Target database.schema with substitutions
39        table: "tpch_sample_customer"    # Final table name
40     description: "Write customer sample table to bronze schema"

Validate the Configuration

# Check for schema errors, missing secrets, circular dependencies …
lhp validate --env dev

If everything is green you will see ✅ All configurations are valid.

Generate DLT Code

# Create Python files in ./generated/ (default output dir)
lhp generate --env dev

# Include data quality tests (optional - for development/testing)
lhp generate --env dev --include-tests

Inspect the Output

Navigate to generated/tpch_sample_ingestion — each FlowGroup becomes a Python file, formatted with black. These are standard Lakeflow Declarative Pipeline scripts that you can run in Databricks or commit to your repository. See Databricks Asset Bundles Integration for Asset Bundle integration.

This is the generated python file from the above YAML configuration:

generated/tpch_sample_ingestion/customer_ingestion.py
 1# Generated by LakehousePlumber
 2# Pipeline: tpch_sample_ingestion
 3# FlowGroup: customer_ingestion
 4
 5from pyspark import pipelines as dp
 6
 7# Pipeline Configuration
 8PIPELINE_ID = "tpch_sample_ingestion"
 9FLOWGROUP_ID = "customer_ingestion"
10
11# ============================================================================
12# SOURCE VIEWS
13# ============================================================================
14
15@dp.temporary_view()
16def v_customer_sample_raw():
17   """Load customer sample table from Databricks samples catalog"""
18   df = spark.readStream \
19      .table("samples.tpch.customer_sample")
20
21   return df
22
23
24# ============================================================================
25# TRANSFORMATION VIEWS
26# ============================================================================
27
28@dp.temporary_view(comment="Transform customer sample table")
29def v_customer_sample_cleaned():
30   """Transform customer sample table"""
31   return spark.sql("""SELECT
32c_custkey as customer_id,
33c_name as name,
34c_address as address,
35c_nationkey as nation_id,
36c_phone as phone,
37c_acctbal as account_balance,
38c_mktsegment as market_segment,
39c_comment as comment
40FROM stream(v_customer_sample_raw)""")
41
42
43# ============================================================================
44# TARGET TABLES
45# ============================================================================
46
47# Create the streaming table
48dp.create_streaming_table(
49   name="acmi_edw_dev.edw_bronze.tpch_sample_customer",
50   comment="Streaming table: tpch_sample_customer",
51   table_properties={"delta.autoOptimize.optimizeWrite": "true", "delta.enableChangeDataFeed": "true"})
52
53
54# Define append flow(s)
55@dp.append_flow(
56   target="acmi_edw_dev.edw_bronze.tpch_sample_customer",
57   name="f_customer_sample_bronze",
58   comment="Write customer sample table to bronze schema"
59)
60def f_customer_sample_bronze():
61   """Write customer sample table to bronze schema"""
62   # Streaming flow
63   df = spark.readStream.table("v_customer_sample_cleaned")
64
65   return df

Deploy on Databricks

Option 1: Manually create a Lakeflow Declarative Pipeline(ETL)

  1. Create a Lakeflow Declarative Pipeline(ETL) in the Databricks UI.

  2. Point the Notebook/Directory field to your generated/ folder in the workspace (or sync the files via Repos).

OR (create new python files and paste the generated code into them.)

  1. Configure clusters & permissions, then click Validate.

Option 2: Use Asset Bundles

Databricks Asset Bundles Integration

Working with Example Templates

When you run lhp init, several example template files are created to help you get started:

Configuration Examples:
  • substitutions/dev.yaml.tmpl - Example environment configuration with common substitution variables

  • substitutions/prod.yaml.tmpl - Production environment example

  • substitutions/tst.yaml.tmpl - Test environment example

Pipeline Examples:
  • pipelines/01_raw_ingestion/ - Complete ingestion pipeline examples for various data formats

  • pipelines/02_bronze/ - Bronze layer transformation examples

  • pipelines/03_silver/ - Silver layer examples with data quality

Preset Examples:
  • presets/bronze_layer.yaml.tmpl - Reusable bronze layer configuration template

Template Examples:
  • templates/standard_ingestion.yaml.tmpl - Standard ingestion pattern template

To use these examples:

  1. Copy and rename template files: cp substitutions/dev.yaml.tmpl substitutions/dev.yaml

  2. Edit the copied files to match your environment and requirements

  3. Use them as references when creating your own configurations

  4. Explore the comprehensive examples in the pipelines/ directory for different data ingestion patterns

Note

The .tmpl files are static examples containing LHP template syntax. They are not Jinja2 templates for the init command, but rather complete working examples that you can use as starting points for your own configurations.

Next Steps

  • Explore Presets and Templates to reduce duplication.

  • Add data-quality expectations to your transforms.

  • Add operational metadata to your actions.

  • Add Schema Hints to your Load actions.

  • Enable Change-Data-Feed (CDC) in bronze ingestions.

  • Continue reading the Concepts & Architecture section for deeper architectural details.