Getting Started¶

This short tutorial walks you through creating your first Lakehouse Plumber project and generating a Lakeflow Pipeliness (DLT) pipeline based on the ACME demo configuration that ships with the repository.

Prerequisites¶

Python 3.11+ (3.12 recommended)
Access to a Databricks workspace with DLT enabled (for actual deployment)
Git installed (optional but recommended)

Installation¶

# Create and activate a virtual environment (optional)
python -m venv .venv
source .venv/bin/activate

# Install Lakehouse Plumber CLI and its extras for docs & examples
pip install lakehouse-plumber

Option 1: Clone the ACME Example Pipeline¶

There is a companion repository that includes a fully working example (TPC-H retail dataset). Copy its raw-ingestion flow into your new project:

git clone https://github.com/Mmodarre/acme_edw.git
cd acme_edw
lhp validate --env dev
lhp generate --env dev

Option 2: Create a new pipeline configuration¶

Step 1: Project Initialisation¶

Use the `lhp init` command to scaffold a new repo-ready directory structure:

lhp init <my_spd_project>
cd <my_spd_project>

The command creates folders such as pipelines/, templates/, substitutions/ and a starter lhp.yaml project file. It also includes example template files with .tmpl extensions that you can use as starting points.

Note

VS Code IntelliSense: If you use VS Code, IntelliSense with autocomplete, validation, and documentation is automatically configured! Open any YAML file to see real-time validation and smart suggestions.

Step 2: Edit the project configuration file¶

Edit the lhp.yaml file to configure your project.

Note

The lhp.yaml file is the main entry point for configuring your project. for full documentation on the project configuration file, see Concepts & Architecture.

Step 3: Create your environment configuration file¶

Create a substitutions/dev.yaml file to match your workspace catalog & storage paths. You can either:

Rename the example template: mv substitutions/dev.yaml.tmpl substitutions/dev.yaml
Create a new file following the same structure as the example template

Edit the file to configure tokens such as ${catalog} or ${secret:scope/key} that will be replaced during code generation.

Tip

The .tmpl files created by lhp init contain working examples that you can use as starting points. Simply rename them or copy their content to create your working configuration files.

Step 4: Create your first pipeline configuration¶

Create a new pipeline configuration in the pipelines/ folder.

Tip

Understanding Pipeline Configuration Structure:

Pipeline: (line 1) specifies the pipeline name that contains this flowgroup. All YAML files sharing the same pipeline name will be organized together in the same directory during code generation.

Flowgroup: (line 2) represents a logical grouping of related actions within the pipeline and serves as an organizational construct without impacting runtime behavior.

Actions: (line 4) define the individual operations in the pipeline. They serve as the fundamental components that execute the data processing workflow:

Loads (lines 5-11) customer data from the Databricks samples catalog using Delta streaming

Transforms (lines 13-27) the raw data by renaming columns and standardizing field names

Writes (lines 29-35) the processed data to a bronze layer streaming table

Leverages substitutions like ${catalog} and ${bronze_schema} for environment flexibility from dev.yaml file

Implements medallion architecture by writing to the bronze schema for downstream processing

Enables streaming with readMode: stream for incremental read from Delta Change Data Feed (CDF)

Tip

Multi-Flowgroup Files:

You can define multiple flowgroups in a single YAML file to reduce file proliferation. This is useful when you have many similar flowgroups (e.g., SAP master data tables).

See Multi-Flowgroup YAML Files for detailed examples and syntax options.

pipelines/customer_sample.yaml¶

pipeline: tpch_sample_ingestion  # Grouping of generated python files in the same folder
flowgroup: customer_ingestion   # Logical grouping for generated Python file

actions:
   - name: customer_sample_load     # Unique action identifier
     type: load                     # Action type: Load
     readMode: stream              # Read using streaming CDF
     source:
        type: delta                # Source format: Delta Lake table
        database: "samples.tpch"   # Source database and schema in Unity Catalog
        table: customer_sample     # Source table name
     target: v_customer_sample_raw # Target view name (temporary in-memory)
     description: "Load customer sample table from Databricks samples catalog"

   - name: transform_customer_sample  # Unique action identifier
     type: transform                  # Action type: Transform
     transform_type: sql             # Transform using SQL query
     source: v_customer_sample_raw   # Input view from previous action
     target: v_customer_sample_cleaned  # Output view name
     sql: |
        SELECT
        c_custkey as customer_id,
        c_name as name,
        c_address as address,
        c_nationkey as nation_id,
        c_phone as phone,
        c_acctbal as account_balance,
        c_mktsegment as market_segment,
        c_comment as comment
        FROM stream(v_customer_sample_raw)
     description: "Transform customer sample table"

   - name: write_customer_sample_bronze  # Unique action identifier
     type: write                         # Action type: Write
     source: v_customer_sample_cleaned   # Input view from previous action
     write_target:
        type: streaming_table            # Output as streaming table
        database: "${catalog}.${bronze_schema}"  # Target database.schema with substitutions
        table: "tpch_sample_customer"    # Final table name
     description: "Write customer sample table to bronze schema"

Validate the Configuration¶

# Check for schema errors, missing secrets, circular dependencies …
lhp validate --env dev

If everything is green you will see ✅ All configurations are valid.

Generate DLT Code¶

# Create Python files in ./generated/ (default output dir)
lhp generate --env dev

# Include data quality tests (optional - for development/testing)
lhp generate --env dev --include-tests

Inspect the Output¶

Navigate to generated/tpch_sample_ingestion — each FlowGroup becomes a Python file, formatted with black. These are standard Lakeflow Declarative Pipeline scripts that you can run in Databricks or commit to your repository. See Databricks Asset Bundles Integration for Asset Bundle integration.

This is the generated python file from the above YAML configuration:

generated/tpch_sample_ingestion/customer_ingestion.py¶

# Generated by LakehousePlumber
# Pipeline: tpch_sample_ingestion
# FlowGroup: customer_ingestion

from pyspark import pipelines as dp

# Pipeline Configuration
PIPELINE_ID = "tpch_sample_ingestion"
FLOWGROUP_ID = "customer_ingestion"

# ============================================================================
# SOURCE VIEWS
# ============================================================================

@dp.temporary_view()
def v_customer_sample_raw():
   """Load customer sample table from Databricks samples catalog"""
   df = spark.readStream \
      .table("samples.tpch.customer_sample")

   return df


# ============================================================================
# TRANSFORMATION VIEWS
# ============================================================================

@dp.temporary_view(comment="Transform customer sample table")
def v_customer_sample_cleaned():
   """Transform customer sample table"""
   return spark.sql("""SELECT
c_custkey as customer_id,
c_name as name,
c_address as address,
c_nationkey as nation_id,
c_phone as phone,
c_acctbal as account_balance,
c_mktsegment as market_segment,
c_comment as comment
FROM stream(v_customer_sample_raw)""")


# ============================================================================
# TARGET TABLES
# ============================================================================

# Create the streaming table
dp.create_streaming_table(
   name="acmi_edw_dev.edw_bronze.tpch_sample_customer",
   comment="Streaming table: tpch_sample_customer",
   table_properties={"delta.autoOptimize.optimizeWrite": "true", "delta.enableChangeDataFeed": "true"})


# Define append flow(s)
@dp.append_flow(
   target="acmi_edw_dev.edw_bronze.tpch_sample_customer",
   name="f_customer_sample_bronze",
   comment="Write customer sample table to bronze schema"
)
def f_customer_sample_bronze():
   """Write customer sample table to bronze schema"""
   # Streaming flow
   df = spark.readStream.table("v_customer_sample_cleaned")

   return df

Deploy on Databricks¶

Option 1: Manually create a Lakeflow Declarative Pipeline(ETL)

Create a Lakeflow Declarative Pipeline(ETL) in the Databricks UI.
Point the Notebook/Directory field to your generated/ folder in the workspace (or sync the files via Repos).

OR (create new python files and paste the generated code into them.)

Configure clusters & permissions, then click Validate.

Option 2: Use Asset Bundles

Databricks Asset Bundles Integration

Working with Example Templates¶

When you run lhp init, several example template files are created to help you get started:

Configuration Examples:

substitutions/dev.yaml.tmpl - Example environment configuration with common substitution variables
substitutions/prod.yaml.tmpl - Production environment example
substitutions/tst.yaml.tmpl - Test environment example

Pipeline Examples:

pipelines/01_raw_ingestion/ - Complete ingestion pipeline examples for various data formats
pipelines/02_bronze/ - Bronze layer transformation examples
pipelines/03_silver/ - Silver layer examples with data quality

Preset Examples:

presets/bronze_layer.yaml.tmpl - Reusable bronze layer configuration template

Template Examples:

templates/standard_ingestion.yaml.tmpl - Standard ingestion pattern template

To use these examples:

Copy and rename template files: cp substitutions/dev.yaml.tmpl substitutions/dev.yaml
Edit the copied files to match your environment and requirements
Use them as references when creating your own configurations
Explore the comprehensive examples in the pipelines/ directory for different data ingestion patterns

Note

The .tmpl files are static examples containing LHP template syntax. They are not Jinja2 templates for the init command, but rather complete working examples that you can use as starting points for your own configurations.

Next Steps¶

Explore Presets and Templates to reduce duplication.
Add data-quality expectations to your transforms.
Add operational metadata to your actions.
Add Schema Hints to your Load actions.
Enable Change-Data-Feed (CDC) in bronze ingestions.
Continue reading the Concepts & Architecture section for deeper architectural details.