Enterprise Best Practices

A comprehensive guide for data engineers using Lakehouse Plumber (LHP) in enterprise environments. These best practices correlate Databricks Lakeflow Declarative Pipeline conventions, enterprise configuration-framework patterns, and LHP-specific capabilities.

1. Project Structure & Organisation

BP-1.1: Organize pipeline YAML files by data domain

Group by business domain (orders/, customers/, inventory/) rather than by action type (loads/, transforms/). LHP discovers flowgroups from the pipelines/ directory and supports subdirectories, so pipelines/orders/bronze_ingest.yaml works natively.

BP-1.2: Keep each YAML file small and single-purpose

Target 50–200 lines. Use LHP’s multi-document (---) or array syntax only for tightly related flowgroups that share a pipeline. Monolithic files with 15+ flowgroups become unreadable and unreviewable.

See also

Multi-Flowgroup YAML Files for details on multi-document and array syntax.

BP-1.3: Use include patterns to filter pipeline discovery

For large repos, use the include glob patterns in lhp.yaml to control which pipeline files are processed per environment or team. This enables a mono-repo structure where each team’s files coexist without interfering.

BP-1.4: Separate presets, templates, and substitutions into dedicated directories

Follow the standard LHP project layout. See Section 2 for detailed subdirectory guidance within each top-level directory.

Standard LHP project layout
presets/           # Reusable defaults (flat — no subdirectory discovery)
templates/         # Reusable action patterns (flat — use prefix-based grouping)
substitutions/     # Environment-specific tokens (dev.yaml, prod.yaml)
pipelines/         # Flowgroup definitions (supports deep subdirectories)
sql/               # External SQL files (supports deep subdirectories)
schemas/           # External schema files (supports deep subdirectories)
expectations/      # External DQE files (supports deep subdirectories)
python_modules/    # External Python modules (supports deep subdirectories)

BP-1.5: Use a CODEOWNERS file to gate shared resource changes

CODEOWNERS is a GitHub/GitLab feature (a file at the repo root) that enforces who must review pull requests that touch specific files or directories. When a PR modifies files matching a pattern in CODEOWNERS, the listed team or person is automatically added as a required reviewer.

In an enterprise LHP project, shared resources like presets and substitutions and templates affect every pipeline, so changes to them should require platform team approval. Meanwhile, domain-specific pipelines should be reviewed by the owning team.

Example CODEOWNERS file
# Platform team must review shared configs
/presets/                @platform-team
/substitutions/          @platform-team
/templates/              @platform-team

# Domain teams own their pipeline definitions
/pipelines/system_a/    @team-a
/pipelines/system_b/    @team-b

Tip

Without CODEOWNERS, a change to a preset (e.g., default table properties) could silently affect every pipeline that uses it and merge without review from someone who understands the blast radius.

2. File Organisation & Subdirectory Structure

LHP file types have different subdirectory support. Understanding this is critical for organizing an enterprise project with hundreds of files.

Subdirectory Support Matrix

File Type

Base Directory

Subdirectory Support

Extensions

Notes

Pipeline YAMLs

pipelines/

Full recursive

.yaml + .yml

Discovered via rglob("*.yaml") — any depth works

SQL files (sql_path)

project root

Full recursive

.sql

Referenced by relative path from project root

Schema files (schema_file)

project root

Full recursive

.yaml, .json, .ddl

Referenced by relative path from project root

Expectations files (expectations_file)

project root

Full recursive

.yaml, .json

Referenced by relative path from project root

Python modules (module_path)

project root

Full recursive

.py

Referenced by relative path from project root

Templates

templates/

Flat only

.yaml only 1

Discovery uses glob("*.yaml") — not recursive

Presets

presets/

Flat only

.yaml only 1

Discovery uses glob("*.yaml") — not recursive

Substitutions

substitutions/

Flat only

.yaml only

One file per environment

1 .yml extension is also accepted but .yaml is recommended for consistency.

BP-2.1: Organize pipeline YAMLs by source system, then by medallion layer

LHP recursively discovers all .yaml/.yml files under pipelines/. Use a two-level hierarchy — source system first, layer second — so that each team owns a clear subtree:

Pipeline directory structure
pipelines/
  system_a/                          # Source system / data domain
    bronze/
      system_a_bronze_ingest.yaml    # CloudFiles ingestion
    silver/
      system_a_silver_cleanse.yaml   # Validation and enrichment
    gold/
      system_a_gold_reporting.yaml   # Aggregations
  system_b/
    bronze/
      system_b_bronze_ingest.yaml
    silver/
      system_b_silver_merge.yaml
  shared/
    gold/
      cross_domain_metrics.yaml      # Cross-system gold tables

This structure maps cleanly to CODEOWNERS (pipelines/system_a/ owned by Team A) and to include patterns when you need to generate a subset.

BP-2.2: Organize SQL files mirroring the pipeline structure

All sql_path references resolve relative to the project root, so sql_path: sql/system_a/bronze/cleanse_raw.sql works natively. Mirror the pipeline directory hierarchy:

SQL directory structure
sql/
  system_a/
    bronze/
      parse_json_payload.sql
    silver/
      enrich_orders.sql
      validate_customers.sql
    gold/
      daily_revenue_summary.sql
  system_b/
    silver/
      merge_inventory.sql
  shared/
    lookups/
      currency_conversion.sql

When referencing from YAML:

Referencing external SQL files
actions:
  - name: transform_enrich_orders
    type: transform
    transform_type: sql
    sql_path: sql/system_a/silver/enrich_orders.sql
    source: load_raw_orders
    target: enriched_orders_view

BP-2.3: Organize schema files by source system and layer

Schema files (DDL, YAML, or JSON) also resolve relative to the project root:

Schema directory structure
schemas/
  system_a/
    bronze/
      raw_orders_schema.yaml        # CloudFiles schema hints
      raw_customers_schema.ddl      # DDL format
    silver/
      orders_strict_schema.yaml     # Schema transform definitions
  system_b/
    bronze/
      raw_inventory_schema.json     # JSON format

When referencing:

Referencing external schema files
actions:
  - name: transform_enforce_schema
    type: transform
    transform_type: schema
    schema_file: schemas/system_a/silver/orders_strict_schema.yaml
    enforcement: strict

BP-2.4: Organize expectations files by domain and quality tier

Store DQE expectation files in a dedicated expectations/ directory, grouped by domain and quality tier:

Expectations directory structure
expectations/
  system_a/
    bronze/
      raw_orders_warn.yaml          # Bronze: warn-only rules
    silver/
      orders_drop_rules.yaml        # Silver: drop invalid rows
      orders_quarantine_rules.yaml  # Silver: quarantine criteria
    gold/
      revenue_fail_rules.yaml       # Gold: fail on critical invariants
  shared/
    common_not_null_rules.yaml      # Reusable cross-domain rules

When referencing:

Referencing external expectations files
actions:
  - name: transform_dqe_orders
    type: transform
    transform_type: data_quality
    expectations_file: expectations/system_a/silver/orders_drop_rules.yaml
    source: enriched_orders_view

BP-2.5: Organize Python modules by function type

For Python-based loads, transforms, and sinks, group modules by their role:

Python modules directory structure
python_modules/
  transforms/
    system_a/
      ml_scoring.py
      custom_dedup.py
    shared/
      phone_normalizer.py
  datasources/
    erp_connector.py                # Custom DataSource V2
  sinks/
    webhook_sink.py                 # Custom DataSink
    foreachbatch/
      notify_downstream.py          # ForEachBatch handlers

BP-2.6: Use prefix-based grouping for templates

Templates are discovered only at the top level of templates/ — subdirectories are not discovered by lhp list_templates. Instead, use a structured prefix convention to categorize templates:

Template naming with prefixes
templates/
  TMPL001_brz_load_cloudfiles_standard.yaml        # Bronze / Load / CloudFiles
  TMPL002_brz_load_kafka_events.yaml               # Bronze / Load / Kafka
  TMPL003_brz_load_delta_snapshot.yaml             # Bronze / Load / Delta snapshot
  TMPL004_slv_transform_sql_enrichment.yaml        # Silver / Transform / SQL
  TMPL005_slv_transform_cdc_merge.yaml             # Silver / Transform / CDC
  TMPL006_slv_write_streaming_table_std.yaml       # Silver / Write / Streaming Table
  TMPL007_gld_write_materialized_view_agg.yaml     # Gold / Write / Materialized View
  TMPL008_full_bronze_to_silver_pipeline.yaml      # Full pipeline template (multi-action)

The prefix pattern <layer>_<action_type>_<detail> makes templates scannable in lhp list_templates output and in file explorers. When you have 30+ templates, this prefix is the primary way to find the right one.

See also

Templates Reference for details on creating and using templates.

BP-2.7: Use prefix-based grouping for presets

Like templates, presets are discovered only at the top level of presets/. Use prefixes to encode scope and layer:

Preset naming with prefixes
presets/
  global_defaults.yaml                             # Organization-wide
  brz_standard.yaml                                # Bronze layer defaults
  brz_cloudfiles_json.yaml                         # Bronze / CloudFiles / JSON specific
  brz_cloudfiles_csv.yaml                          # Bronze / CloudFiles / CSV specific
  slv_standard.yaml                                # Silver layer defaults
  slv_cdc_scd2.yaml                                # Silver / CDC / SCD Type 2
  gld_standard.yaml                                # Gold layer defaults
  ord_custom_overrides.yaml                        # Orders domain custom

See also

Presets Reference for details on preset inheritance and merging.

BP-2.8: Use include patterns for team-scoped generation

When multiple teams share a mono-repo, use include patterns in lhp.yaml to generate only relevant pipelines. Patterns are matched against paths relative to pipelines/:

Include only system_a pipelines
# lhp.yaml — generate only system_a pipelines
include:
  - "system_a/**/*.yaml"

Or selectively include specific layers:

Include only bronze pipelines
# Only bronze pipelines across all systems
include:
  - "**/bronze/*.yaml"

BP-2.9: Full enterprise project layout example

Complete enterprise project structure
my_lhp_project/
  lhp.yaml                                # Project config
  substitutions/
    dev.yaml
    staging.yaml
    prod.yaml
  presets/
    global_defaults.yaml
    brz_standard.yaml
    brz_cloudfiles_json.yaml
    slv_standard.yaml
    slv_cdc_scd2.yaml
    gld_standard.yaml
  templates/
    TMPL001_brz_load_cloudfiles_standard.yaml
    TMPL002_slv_transform_sql_enrichment.yaml
    TMPL003_gld_write_mv_aggregation.yaml
  pipelines/
    system_a/
      bronze/
        system_a_bronze_ingest_TMPL001.yaml
      silver/
        system_a_silver_cleanse_TMPL002.yaml
      gold/
        system_a_gold_reporting_TMPL003.yaml
    system_b/
      bronze/
        system_b_bronze_ingest_TMPL001.yaml
      silver/
        system_b_silver_merge_TMPL002.yaml
  sql/
    system_a/
      silver/
        enrich_orders.sql
      gold/
        daily_revenue.sql
    system_b/
      silver/
        merge_inventory.sql
  schemas/
    system_a/
      bronze/
        raw_orders_schema.yaml
      silver/
        orders_strict_schema.yaml
    system_b/
      bronze/
        raw_inventory_schema.yaml
  expectations/
    system_a/
      bronze/
        raw_orders_warn.yaml
      silver/
        orders_drop_rules.yaml
    shared/
      common_not_null_rules.yaml
  python_modules/
    transforms/
      system_a/
        ml_scoring.py
    datasources/
      erp_connector.py
  generated/                               # Output (per environment)
    dev/
      system_a_bronze_pipeline/
        raw_orders.py
      system_a_silver_pipeline/
        orders_cleanse.py

3. Naming Conventions

BP-3.1: Use snake_case consistently across all identifiers

Pipelines, flowgroups, action names, templates, presets, variables, table names — all snake_case. LHP generates Python function names from action names, so this ensures valid Python identifiers.

BP-3.2: Prefix pipeline names with the source system and layer

erp_bronze_pipeline, crm_silver_pipeline — not bronze_pipeline or pipeline_v2. At 200+ pipelines, generic names become meaningless. LHP uses the pipeline field in flowgroups to group actions into output files. See BP-3.9 for the full enterprise naming pattern.

BP-3.3: Name flowgroups to describe the data flow

erp_brz_raw_orders, erp_slv_orders_enriched — not cloudfiles_load_1 or flowgroup_v2. The flowgroup name appears in generated file names and log output. Embed the source system and layer for visibility. See BP-3.8 for the full enterprise naming pattern.

BP-3.4: Name actions descriptively with the pattern <verb>_<entity>_<modifier>

load_raw_orders, transform_validate_orders, write_orders_silver, test_orders_row_count. Action names become Python function names in generated code, so clarity matters.

BP-3.5: Use SCREAMING_SNAKE_CASE for environment tokens

Environment tokens (${SOURCE_CATALOG}, ${LANDING_PATH}) are resolved from substitution files. Local variables (%{table_name}, %{source_schema}) are flowgroup-scoped. The case distinction makes it immediately clear which resolution mechanism applies.

See also

Substitutions & Secrets for the full substitution processing order and syntax.

BP-3.6: Never abbreviate in identifiers

customer_silver_merge not cust_slvr_mrg. Config files live in version control forever; clarity beats brevity.

Structured Naming for Enterprise Visibility

At enterprise scale (100+ templates, 500+ flowgroups), flat alphabetical lists become unmanageable. Templates use a TMPLxxx_ ID prefix to embed a unique sequence number, making them instantly scannable and sortable. Flowgroup config files reference the template ID as a _TMPLxxx suffix, creating a visible link between a config and its template. All other artifacts — pipelines, presets, SQL files, schemas, and expectations — use descriptive prefixes and directory structure for organisation.

BP-3.7: Use TMPLxxx ID prefixes for templates

Since templates live in a flat directory (see Section 2), the filename is the only organisational mechanism. Use a TMPLxxx_ prefix with a sequential number, followed by a structured name that encodes layer and action type:

Template naming pattern
Pattern: TMPLxxx_<layer>_<action_type>_<source_or_target_type>_<descriptive_name>

Examples:
  TMPL001_brz_load_cloudfiles_standard        # Bronze / Load / CloudFiles / standard pattern
  TMPL002_brz_load_cloudfiles_with_schema     # Bronze / Load / CloudFiles / with schema hints
  TMPL003_brz_load_kafka_events               # Bronze / Load / Kafka / event stream
  TMPL004_slv_transform_sql_enrichment        # Silver / Transform / SQL / enrichment pattern
  TMPL005_slv_transform_cdc_merge             # Silver / Transform / CDC / merge pattern
  TMPL006_slv_write_st_with_dqe               # Silver / Write / Streaming Table / with DQE
  TMPL007_gld_write_mv_aggregation            # Gold / Write / Materialized View / aggregation
  TMPL008_e2e_full_bronze_to_silver           # End-to-end / multi-action pipeline template

Layer prefixes: brz_ (bronze), slv_ (silver), gld_ (gold), e2e_ (end-to-end multi-action).

The TMPLxxx prefix sorts templates by creation order in lhp list_templates output, while the layer prefix groups them logically. The ID also appears as a suffix in flowgroup config filenames (see BP-3.8), creating a visible link between configs and their templates.

BP-3.8: Use descriptive flowgroup names with a _TMPLxxx config file suffix

Flowgroup names become Python file names and function names in generated code. Embed the source system and layer for visibility across large projects:

Flowgroup naming pattern
Pattern: <system>_<layer>_<descriptive_name>

Examples:
  erp_brz_raw_orders                  # ERP system / Bronze / raw orders
  erp_brz_raw_customers               # ERP system / Bronze / raw customers
  erp_slv_orders_enriched             # ERP system / Silver / enriched orders
  erp_slv_customers_merged            # ERP system / Silver / merged customers
  erp_gld_daily_revenue               # ERP system / Gold / daily revenue
  crm_brz_raw_contacts                # CRM system / Bronze / raw contacts
  crm_slv_contacts_deduped            # CRM system / Silver / deduped contacts

When naming the Flowgroup file, append the template ID as a suffix so the template relationship is visible at a glance without opening the file:

Config file naming pattern
Pattern: <system>_<layer>_<description>_<TMPLxxx>.yaml

Examples:
  erp_bronze_ingest_TMPL001.yaml      # Uses TMPL001 (CloudFiles standard)
  erp_silver_cleanse_TMPL004.yaml     # Uses TMPL004 (SQL enrichment)
  erp_gold_reporting_TMPL007.yaml     # Uses TMPL007 (MV aggregation)
  crm_bronze_contacts_TMPL001.yaml    # Uses TMPL001 (CloudFiles standard)

This naming ensures that when you see a generated file erp_brz_raw_orders.py or a DLT log entry for erp_slv_orders_enriched, you immediately know the source system and layer without looking up the config. The _TMPLxxx suffix in the config filename lets you identify the template at the file system level — useful when browsing directories, reviewing PRs, or triaging issues.

BP-3.9: Use structured prefixes for pipeline names

Pipeline names determine the output directory structure under generated/{env}/ and appear in Databricks UI. Use <system>_<layer>_pipeline for clear identification:

Pipeline naming pattern
Pattern: <system>_<layer>_pipeline

Examples:
  erp_bronze_pipeline                 # All ERP bronze ingestion
  erp_silver_pipeline                 # All ERP silver transforms
  erp_gold_pipeline                   # All ERP gold aggregations
  crm_bronze_pipeline                 # All CRM bronze ingestion
  shared_gold_pipeline                # Cross-system gold tables

This gives you clean, predictable output directories:

Generated output with structured names
generated/dev/
  erp_bronze_pipeline/
    erp_brz_raw_orders.py
    erp_brz_raw_customers.py
  erp_silver_pipeline/
    erp_slv_orders_enriched.py
  crm_bronze_pipeline/
    crm_brz_raw_contacts.py

BP-3.10: Use consistent prefixes for presets

Since presets are also flat (no subdirectory discovery), the naming prefix is essential for organisation:

Preset naming pattern
Pattern: <scope>_<layer>_<purpose>

Examples:
  global_defaults                     # Organisation-wide standards
  brz_standard                        # Bronze layer standard preset
  brz_cloudfiles_json                 # Bronze / CloudFiles / JSON format
  brz_cloudfiles_csv                  # Bronze / CloudFiles / CSV format
  brz_kafka_events                    # Bronze / Kafka event preset
  slv_standard                        # Silver layer standard preset
  slv_cdc_scd2                        # Silver / CDC / SCD Type 2
  gld_standard                        # Gold layer standard preset
  erp_custom                          # ERP domain custom overrides

Quick Reference Table

Artifact

Convention

Example

Pipeline names

<system>_<layer>_pipeline

erp_bronze_pipeline

Flowgroup names

<system>_<layer>_<description>

erp_brz_raw_orders

Action names

<verb>_<entity>_<modifier>

load_raw_orders

Config files

<system>_<layer>_<description>_<TMPLxxx>.yaml

erp_bronze_ingest_TMPL001.yaml

Template files

TMPLxxx_<layer>_<action>_<type>_<name>.yaml

TMPL001_brz_load_cloudfiles_standard.yaml

Preset files

<scope>_<layer>_<purpose>.yaml

brz_standard.yaml

SQL files

<domain>/<layer>/<description>.sql

erp/silver/enrich_orders.sql

Schema files

<domain>/<layer>/<description>.yaml

erp/bronze/raw_orders_schema.yaml

Expectations files

<domain>/<layer>/<description>.yaml

erp/silver/orders_drop_rules.yaml

Generated files

<flowgroup_name>.py

erp_brz_raw_orders.py

Env tokens

${SCREAMING_SNAKE_CASE}

${SOURCE_CATALOG}

Local variables

%{lower_snake_case}

%{table_suffix}

Template params

{{ lower_snake_case }}

{{ partition_column }}

4. Template Design

BP-4.1: Extract a template only after 3+ flowgroups share the same pattern

Building templates for one-off use cases leads to over-generalisation. Write three explicit flowgroups first, identify the common pattern, then extract the template. LHP templates support parameters with required, default, and description fields.

BP-4.2: Keep template parameters minimal and well-documented

Every parameter should have a description and either be required: true or have a sensible default. LHP validates required parameters at generation time and reports clear errors for missing ones. Avoid templates with 15+ parameters — they add complexity without reducing it.

BP-4.3: Establish “golden templates” for each common pipeline pattern

Maintain platform-team-owned templates for standard patterns, using the ID-based naming from Section 3:

  • TMPL001_brz_load_cloudfiles_standard — standard CloudFiles ingestion with operational metadata

  • TMPL002_brz_load_delta_snapshot — Delta table reads with standard options

  • TMPL003_slv_write_st_with_dqe — streaming table with DQE expectations

  • TMPL004_slv_transform_sql_enrichment — SQL-based silver enrichment

  • TMPL005_gld_write_mv_aggregation — materialized view for gold aggregations

These golden templates embed organisational standards (default expectations, metadata columns, table properties) so domain teams can’t accidentally skip them.

BP-4.4: Templates live in a flat directory — organise by naming convention

LHP discovers templates only from the top level of templates/ (using glob("*.yaml"), not recursive). Subdirectories under templates/ are not discovered by lhp list_templates. Instead, use the structured prefix convention from BP-3.7 to group templates logically.

Note

Subdirectories under templates/ are not discovered. Referencing templates via subfolder paths (e.g., use_template: "subfolder/name") is not supported. Stick to the flat directory with prefix-based naming.

BP-4.5: Templates can reference presets — use this to layer defaults

A template can declare presets: [brz_standard] to inherit default options. Flowgroups using the template can add additional presets that override. This creates a clean defaults hierarchy: template presets -> flowgroup presets -> explicit action config.

BP-4.6: Use template parameters for what varies; presets for what is standard

Template parameters should capture the unique aspects of each use case (source path, target table, specific columns). Standard aspects (table properties, operational metadata, reader options) belong in presets. This keeps template usage concise.

BP-4.7: Reference external files from templates using parameterised paths

Templates can reference external files via sql_path, schema_file, or expectations_file. Use template parameters for the variable part of the path, combined with a fixed subdirectory convention:

Template with parameterised SQL path
# Template: slv_transform_sql_enrichment.yaml
name: slv_transform_sql_enrichment
parameters:
  - name: system
    required: true
    description: "Source system name (used in file paths)"
  - name: entity
    required: true
    description: "Entity name"
actions:
  - name: transform_enrich_{{ entity }}
    type: transform
    transform_type: sql
    sql_path: "sql/{{ system }}/silver/enrich_{{ entity }}.sql"
    source: "load_raw_{{ entity }}"
    target: "enriched_{{ entity }}_view"

This way, the directory structure convention (sql/<system>/silver/) is baked into the template, ensuring all teams follow the same file organisation.

See also

Templates Reference for the full template specification and Dynamic Templates Guide for conditionals, loops, and advanced Jinja2 features.

5. Preset Strategy

BP-5.1: Design a preset hierarchy — global, domain, pipeline-specific

LHP supports preset inheritance via extends and preset chaining (multiple presets in a list, merged left-to-right). Use this to build layers:

  • global_defaults — organisation-wide standards (table properties, metadata)

  • bronze_standard extends global_defaults — bronze-layer conventions

  • orders_bronze extends bronze_standard — domain-specific overrides

BP-5.2: Encode organisational standards in presets, not just values

A high-value preset sets multiple related properties together:

Bronze standard preset example
name: bronze_standard
extends: global_defaults
defaults:
  load_actions:
    cloudfiles:
      options:
        cloudFiles.schemaEvolutionMode: rescue
        cloudFiles.rescuedDataColumn: _rescued_data
        cloudFiles.maxFilesPerTrigger: 1000
  write_actions:
    streaming_table:
      table_properties:
        pipelines.reset.allowed: "false"
  operational_metadata:
    - ingest_timestamp
    - source_file

BP-5.3: Limit the total number of presets

More than 15–20 distinct presets leads to confusion and misuse. Consolidate overlapping presets. LHP’s lhp list_presets command helps audit the current set.

BP-5.4: Use lhp show to verify effective configuration

After preset merging, template expansion, and substitution, the effective config can differ from what the YAML file suggests. Always verify with lhp show <flowgroup> --env <env> before deploying changes to shared presets. This is LHP’s equivalent of “fully resolved config.”

BP-5.5: Treat preset changes as high-blast-radius events

A change to a global preset affects every pipeline using it. Version presets (add a version field), document changes, and run lhp validate --env <env> across the entire project before merging preset changes.

See also

Presets Reference for complete details on preset inheritance and merging.

6. Substitution & Environment Management

BP-6.1: Use directory-based environment separation

Maintain substitutions/dev.yaml, substitutions/staging.yaml, substitutions/prod.yaml. All environments are visible on the same branch. LHP resolves ${token} patterns from these files.

BP-6.2: Put all environment-varying values in substitution tokens

Catalog names, schema names, storage paths, cluster policies, alert emails — all should be tokens. LHP supports recursive token expansion (tokens referencing other tokens, up to 10 iterations), so you can compose:

Recursive token expansion
global:
  catalog_prefix: main

dev:
  catalog: "${catalog_prefix}_dev"

prod:
  catalog: "${catalog_prefix}_prod"

BP-6.3: Use the global section for shared values

LHP’s substitution files support a global section whose values are inherited by all environments. Environment-specific sections override global values. This eliminates duplication.

BP-6.4: Never put secret values in substitution files

Use LHP’s ${secret:scope/key} syntax. LHP converts these to dbutils.secrets.get(scope="scope", key="key") calls in generated code. Configure secrets.default_scope and scopes aliases in the substitution file for clean references.

Important

Secrets in substitution files will be committed to version control and leaked. Always use the ${secret:scope/key} syntax exclusively.

BP-6.5: Use lhp substitutions to audit available tokens

Before writing flowgroups, run lhp substitutions --env <env> to check what tokens are available. This prevents unresolved token errors at generation time.

BP-6.6: Design substitution tokens for the medallion pattern

Standard token set for a medallion project:

Medallion substitution tokens
global:
  bronze_catalog: "${catalog_prefix}_bronze"
  silver_catalog: "${catalog_prefix}_silver"
  gold_catalog: "${catalog_prefix}_gold"
  landing_path_base: "abfss://landing@${storage_account}.dfs.core.windows.net"

See also

Substitutions & Secrets for the full substitution processing order and syntax.

7. Local Variables

BP-7.1: Use local variables for flowgroup-scoped repetition

When the same value (table name, schema, path segment) appears multiple times within a single flowgroup, define it as a local variable rather than repeating it. LHP resolves %{var} first, before template expansion.

BP-7.2: Prefer local variables over hardcoded values

Using local variables
variables:
  entity: orders
  source_schema: raw
actions:
  - name: load_%{entity}
    source:
      table: "${BRONZE_CATALOG}.%{source_schema}.%{entity}"

BP-7.3: Do not use local variables for environment-specific values

%{var} is scoped to a single flowgroup and resolved at parse time. Environment-specific values belong in substitution tokens (${TOKEN}) which are resolved per environment.

See also

Substitutions & Secrets for details on local variables and environment tokens.

8. FlowGroup Design

BP-8.1: Use array syntax with field inheritance for multi-flowgroup pipelines

When multiple flowgroups share the same pipeline, presets, or template, use LHP’s array syntax to inherit:

Array syntax with inheritance
pipeline: orders_bronze
presets: [bronze_standard]
operational_metadata: true
flowgroups:
  - flowgroup: raw_orders
    actions: [...]
  - flowgroup: raw_returns
    actions: [...]

Inherited fields: pipeline, use_template, presets, operational_metadata, job_name.

See also

Multi-Flowgroup YAML Files for the full multi-flowgroup reference.

BP-8.2: Scope one pipeline per data domain

Pipeline orders_bronze contains flowgroups raw_orders, raw_returns, raw_refunds. Each flowgroup generates its own Python function set but runs in the same DLT pipeline, enabling dependency resolution across them.

BP-8.3: Use job_name to group flowgroups into Databricks jobs

LHP’s lhp deps --format job generates job resource definitions. Use job_name to control which flowgroups are orchestrated together in a Databricks Workflow.

See also

Concepts & Architecture for details on job_name and multi-job orchestration.

BP-8.4: Order actions as Load, Transform, Write, Test

This matches the data flow direction and makes YAML files scannable. LHP resolves dependencies automatically, but consistent ordering improves readability.

9. Load Actions

BP-9.1: Always set schemaEvolutionMode and rescuedDataColumn for CloudFiles

LHP’s CloudFiles generator supports all Auto Loader options. In production, always use:

CloudFiles with schema rescue
source:
  type: cloudfiles
  path: "${LANDING_PATH}/orders/"
  format: json
  options:
    cloudFiles.schemaEvolutionMode: rescue
    cloudFiles.rescuedDataColumn: _rescued_data

Tip

Put these options in a bronze_standard preset so they apply everywhere without repetition.

BP-9.2: Use readMode: stream for bronze, readMode: batch for lookups

LHP’s readMode field controls whether spark.readStream or spark.read is generated. Bronze sources should stream; dimension/lookup tables should batch-read.

BP-9.3: Use full three-part names via substitution tokens for Delta loads

Delta source with substitution tokens
source:
  type: delta
  catalog: "${SILVER_CATALOG}"
  database: "orders"
  table: "validated_orders"

LHP constructs catalog.database.table references. Never hardcode catalog or database names.

BP-9.4: Rate-limit Auto Loader in production

Use cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger options (via presets) to prevent bronze ingestion from overwhelming downstream tables. Set this in your bronze_standard preset.

BP-9.5: Use schema_hints for critical columns

LHP supports cloudFiles.schemaHints option strings. For columns where wrong type inference would cause downstream failures (amounts, IDs, timestamps), provide explicit hints.

See also

Load Actions for the full load action specification.

10. Transform Actions

BP-10.1: Default to SQL transforms for silver/gold layer logic

LHP’s SQL transform generator supports inline SQL or external SQL files via sql_path. SQL is more readable, more widely understood, and easier to review than Python transforms for standard operations. Use external SQL files for anything over ~5 lines.

BP-10.2: Use external SQL files for complex transformations

LHP resolves sql_path relative to the project root. Store SQL in sql/<system>/<layer>/<transform_name>.sql (see Section 2). This keeps YAML files concise and enables SQL-specific linting.

BP-10.3: Use Python transforms only when SQL cannot express the logic

LHP’s Python transform generator copies external modules and calls your function. The signature depends on the number of sources:

  • Single source: function(df, spark, parameters) — receives the source DataFrame directly

  • Multiple sources: function(dataframes, spark, parameters) — receives a list of DataFrames

  • No sources: function(spark, parameters) — function generates data from scratch

Reserve Python transforms for UDFs, ML scoring, or complex procedural logic.

BP-10.4: Use schema transforms for explicit column control

LHP’s schema transform type supports column renaming (arrow syntax: old_name -> new_name), type casting, and strict/permissive enforcement. Use enforcement: strict at silver to reject unexpected columns from bronze.

BP-10.5: Use data_quality transforms for DQE expectations

LHP’s data_quality transform type reads expectations from YAML/JSON files or inline definitions, generating the appropriate @dp.expect_all(), @dp.expect_all_or_drop(), or @dp.expect_all_or_fail() decorators.

BP-10.6: Use temp_table transforms for intermediate calculations

LHP generates @dp.table(temporary=True) for temp tables. Use these for intermediate steps that should not be published to Unity Catalog.

See also

Transform Actions for the full transform action specification.

11. Write Actions

BP-11.1: Default to materialized views for silver/gold layers

LHP’s materialized_view write target generates @dp.materialized_view(). Materialized views always produce correct results — they reprocess when source data changes. Use them for all joins, aggregations, and enrichment.

BP-11.2: Use streaming tables for bronze ingestion and CDC targets

LHP’s streaming_table write target generates dp.create_streaming_table() + @dp.append_flow(). Streaming tables are optimal for append-only ingestion.

Important

Joins in streaming tables do not recompute when dimensions change — use materialized views for enrichment.

BP-11.3: Set pipelines.reset.allowed: "false" on history tables

LHP supports table_properties in write targets. This prevents accidental full refresh from destroying historical data:

Protecting history tables from reset
write_target:
  type: streaming_table
  table_properties:
    pipelines.reset.allowed: "false"

Tip

Put this in your silver_standard and gold_standard presets.

BP-11.4: Use cluster_columns (liquid clustering) instead of partition_columns

LHP supports both, but liquid clustering is the modern recommendation. It’s incremental, allows redefining keys without rewriting data, and works well with high-cardinality columns:

Liquid clustering
write_target:
  type: streaming_table
  cluster_columns: [customer_id, order_date]

BP-11.5: Use comment on every write target

LHP passes the comment field to the generated table/view definition. This appears in Unity Catalog UI and is queryable.

BP-11.6: Use spark_conf for per-table performance tuning

LHP supports spark_conf on write targets. Use it for adaptive shuffle or per-table optimisations rather than global pipeline settings.

BP-11.7: For CDC, use the cdc mode with explicit cdc_config

LHP generates dp.create_auto_cdc_flow() with full support for keys, sequence_by (including STRUCT for tie-breaking), scd_type (1 or 2), apply_as_deletes, ignore_null_updates, track_history_column_list, and track_history_except_column_list options. Always specify sequence_by explicitly.

BP-11.8: Use once: true for backfill flows

LHP supports the once flag on individual actions, generating one-time flows for historical data backfill without affecting the ongoing streaming ingestion.

BP-11.9: Multiple write actions targeting the same table are automatically grouped

LHP consolidates multiple sources writing to the same streaming table into one create_streaming_table with multiple append_flow functions. Use this for multi-source ingestion patterns.

BP-11.10: Use snapshot_cdc mode for full-snapshot change data capture

LHP also supports mode: "snapshot_cdc" on streaming tables, generating dp.create_auto_cdc_from_snapshot_flow(). Use this when your source provides full snapshots (not a change feed) and you want LHP to detect changes automatically.

Configuration uses snapshot_cdc_config (not cdc_config):

Snapshot CDC configuration
write_target:
  type: streaming_table
  streaming_table_config:
    mode: "snapshot_cdc"
    snapshot_cdc_config:
      source_function:
        file: "functions/my_snapshots.py"
        function: "my_snapshot_function"
      keys: [id]
      stored_as_scd_type: 2

Key differences from cdc mode:

  • Config key is snapshot_cdc_config (not cdc_config)

  • SCD type field is stored_as_scd_type (not scd_type)

  • Requires a source_function with file and function fields

  • Does not use sequence_by — ordering is implicit from snapshot timing

BP-11.11: Use sink write targets for streaming to external destinations

LHP supports a sink write target type for writing to external systems. Four sink subtypes are available:

  • delta — write to external Delta tables outside Unity Catalog (e.g., cross-workspace or external storage)

  • kafka — write to Kafka or Azure Event Hubs for event-driven architectures

  • custom — use a custom DataSink V2 class via the custom_sink_class config field

  • foreachbatch — ForEachBatch handlers for custom per-batch processing (API calls, notifications, etc.)

Kafka sink example
write_target:
  type: sink
  sink_type: kafka
  sink_config:
    kafka.bootstrap.servers: "${KAFKA_BROKERS}"
    topic: "enriched_orders"

Use sinks when data must leave the lakehouse — for downstream consumers, event buses, or external APIs. Pair with streaming tables for the primary lakehouse copy.

See also

Write Actions for the full write action specification.

12. Data Quality (Expectations)

BP-12.1: Tier expectations by medallion layer

  • Bronze: warn only — never drop or fail at bronze. Every raw record is precious.

  • Silver: drop for structural quality rules. Route violations to a quarantine table.

  • Gold/Critical: fail for reference table integrity and business-critical invariants.

LHP’s DQE parser supports failureAction: fail|drop|warn in expectation files and generates the appropriate decorators.

See also

For configuring quarantine mode in LHP, see Quarantine (Dead Letter Queue).

BP-12.2: Centralise expectation definitions in external DQE files

LHP supports expectations_file pointing to YAML/JSON files. Store these in expectations/<domain>/ and reference them from multiple actions. This enables reuse and independent review of quality rules.

BP-12.3: Name expectations descriptively

Convention: valid_<column>_<constraint_type> (e.g., valid_order_id_not_null, valid_amount_positive). These names appear in the DLT Data Quality tab and event log.

BP-12.5: Use test actions for cross-table validation

LHP’s 9 test action types (row_count, uniqueness, referential_integrity, completeness, range, schema_match, all_lookups_found, custom_sql, custom_expectations) generate SQL-based validation views. Use --include-tests flag to generate them. Always run these in staging before production deployment.

To publish test results to external systems like Azure DevOps or a Delta audit table, see Test Result Reporting (Publishing).

See also

Test Actions (Data Quality Unit Tests) for the full test action specification.

13. Operational Metadata

BP-13.1: Define operational metadata columns in lhp.yaml

LHP supports project-level operational_metadata with column definitions, presets, and defaults. Define standard columns once:

Operational metadata configuration in lhp.yaml
operational_metadata:
  columns:
    ingest_timestamp:
      expression: "F.current_timestamp()"
      description: "When the record was ingested"
      applies_to: [streaming_table, materialized_view]
    source_file:
      expression: "F.input_file_name()"
      description: "Source file path"
      applies_to: [streaming_table]
      enabled: true
    pipeline_id:
      expression: "F.lit(spark.conf.get('pipelines.id'))"
      description: "Pipeline identifier"
      additional_imports:
        - "from pyspark.sql import functions as F"

Each column config supports these fields:

  • expression (required) — PySpark expression string

  • description — Human-readable description

  • applies_to — List of target types (default: [streaming_table, materialized_view])

  • enabled — Boolean to enable/disable the column (default: true)

  • additional_imports — List of extra Python import statements needed by the expression

BP-13.2: Create metadata presets for different layers

LHP supports operational_metadata.presets for named groups in lhp.yaml:

Metadata presets
operational_metadata:
  presets:
    bronze_standard: [ingest_timestamp, source_file, pipeline_id]
    silver_standard: [updated_at, pipeline_run_id]

Note

Metadata presets are defined at the project level for documentation and organisational purposes. At the flowgroup or action level, operational_metadata accepts either true (to enable all columns) or an explicit list of column name strings — not preset names. Reference the preset definitions as a guide when writing the column name lists in your flowgroups.

BP-13.3: Metadata is additive across preset, flowgroup, and action levels

LHP deep-merges operational metadata with deduplication. This means you can set a baseline in a preset and add columns at the flowgroup or action level without losing the preset columns.

BP-13.4: Use applies_to to control which target types get each column

input_file_name() is only valid in streaming/batch reads — set applies_to: [streaming_table]. current_timestamp() works everywhere — set applies_to: [streaming_table, materialized_view].

See also

Operational Metadata for the full operational metadata reference.

14. Schema Management

BP-14.1: Use schema files for bronze layer schema definition

LHP’s schema_file field in load actions points to external DDL, YAML, or JSON schema files. This makes schema definitions reviewable independently of pipeline config.

BP-14.2: Use schema transforms at the bronze-to-silver boundary

LHP’s schema transform type provides explicit column control:

  • Arrow syntax for renaming: old_col -> new_col

  • Type casting: amount: decimal(18,2)

  • Strict enforcement to reject unexpected columns

BP-14.3: Use enforcement: strict at silver to prevent schema drift

LHP’s schema transform with enforcement: strict generates code that only keeps declared columns. Combined with silver-layer DQE expectations, this creates a clean schema contract between bronze and silver.

15. Validation & CI Integration

BP-15.1: Run lhp validate as a blocking CI check on every PR

LHP’s validation stack catches: missing required fields, unknown fields (with fuzzy-match suggestions), circular dependencies, invalid references, template parameter mismatches, and type-specific validation for all 7 load types, 5 transform types, and all write target types.

BP-15.2: Run lhp generate --dry-run to verify code generation

Dry-run generates code without writing files. Use this in CI to catch generation errors early.

BP-15.3: Maintain dry-run baselines for regression detection

Commit expected generated output to the repo. In CI, run lhp generate --dry-run and diff against baselines. Unexpected changes (especially from preset modifications) are flagged for review. This is the config-equivalent of snapshot testing.

BP-15.4: Layer your CI validation pipeline

Layer

What it checks

Tool

Syntax

Valid YAML, correct indentation

yamllint

Schema

Required fields, correct types

JSON Schema (LHP provides schemas in src/lhp/schemas/)

Semantic

References resolve, no circular deps

lhp validate --env <env>

Generation

Config generates valid Python

lhp generate --dry-run --env <env>

Regression

No unintended diff in output

Baseline comparison

Functional

Test actions pass

pytest with --include-tests

See also

CI/CD Reference for comprehensive CI/CD patterns and deployment strategies.

16. State Management & Incremental Generation

BP-16.1: DO NOT Commit .lhp_state.json to version control

LHP’s state tracking enables smart regeneration — only files whose source YAML, dependencies, or generation context changed are regenerated. This significantly speeds up lhp generate for large projects but must not be committed to source control

BP-16.2: Use lhp state to audit orphaned and stale files

After refactoring (renaming flowgroups, deleting pipelines), use the available flags to audit and manage state:

Flag

Purpose

--orphaned

Show generated files with no corresponding source YAML

--stale

Show files where the source YAML has changed since last generation

--new

Show new/untracked YAML files that haven’t been generated yet

--cleanup

Remove orphaned files

--regen

Regenerate stale files

--dry-run

Preview cleanup or regen without actually modifying files

Combine filters: lhp state --env dev --orphaned --cleanup --dry-run previews which orphaned files would be deleted.

BP-16.3: Use --force only when necessary

LHP’s ForceGenerationStrategy regenerates everything. Use it only after framework upgrades or preset changes where you want to verify all output. Normal development should rely on smart generation.

See also

CLI Reference for the full lhp state command reference.

17. Bundle Integration (Databricks Asset Bundles)

BP-17.1: Use lhp deps --format job to generate DAB job resource definitions

LHP analyses dependencies and generates pipeline and job resource YAML for Databricks Asset Bundles. Use --bundle-output to specify where bundle files are written.

BP-17.2: Bundle scaffolding is included by default

LHP scaffolds the full DAB structure by default with lhp init, including databricks.yml, resource definitions, and standard folder layout. Use lhp init <name> --no-bundle to skip DAB setup if you manage bundle configuration separately.

BP-17.3: Keep generated bundle resources separate from hand-written ones

LHP generates bundle resources from dependency analysis. Store them in a dedicated directory (e.g., bundle/generated/) so they can be regenerated without conflicting with manually defined resources.

See also

Databricks Asset Bundles Integration for the full bundle integration guide.

18. Architectural Pattern Support

BP-18.1: Medallion architecture — use LHP’s layered approach

Layer

Write Target

DQE Tier

Metadata

Key Characteristics

Bronze

Streaming table

warn only

ingest_timestamp, source_file

Raw ingestion, CloudFiles/Kafka, schema rescue

Silver

Materialized view

drop bad rows

updated_at, pipeline_run_id

Validated, deduplicated, schema-enforced

Gold

Materialized view

fail on critical

(inherited)

Aggregations, denormalised reporting

LHP supports all these natively through its action types, write targets, and DQE integration.

BP-18.2: Environment promotion — use substitution files per environment

Same YAML configs, different --env flags. LHP resolves all tokens per environment. Generated code is environment-specific but source configs are environment-agnostic.

BP-18.3: Multi-pipeline orchestration — use job_name and lhp deps

LHP’s dependency analysis produces pipeline-level and job-level dependency graphs. Use these to build Databricks Workflow orchestration that respects data dependencies across pipelines.

See also

Dependency Analysis & Job Generation for pipeline dependency analysis and orchestration job generation.

BP-18.4: Multi-source ingestion — use multiple load/write actions targeting the same table

LHP consolidates multiple write actions to the same streaming table into multiple append_flow functions. This supports fan-in patterns (multiple sources -> one table) natively.

19. Documentation & Discoverability

BP-19.1: Use description fields on every action and write target

LHP passes descriptions through to generated code comments and table metadata. Fill these in consistently.

BP-19.2: Use comment on write targets for Unity Catalog table descriptions

These appear in the Data Explorer and are queryable. Make them meaningful: “Silver layer orders — deduped, validated, enriched with customer data.”

BP-19.3: Use YAML comments for “why” decisions

Comments explaining decisions
# Using batch mode because source schema changes frequently and CDC is not supported
readMode: batch

The YAML declares what; comments explain why.

BP-19.4: Use lhp info and lhp stats for project documentation

These commands produce summaries of project structure, pipeline counts, and action distributions. Use them in onboarding documentation.

See also

CLI Reference for the full CLI command reference.

20. Anti-Patterns to Avoid

Warning

The following are common mistakes that undermine the value of using LHP. Each anti-pattern lists the impact and the recommended fix.

ID

Anti-Pattern

Why It’s Harmful

Fix

AP-1

Hardcoding catalog/schema names in YAML

Makes environment promotion impossible

Always use substitution tokens

AP-2

Using expect_or_fail at bronze

One bad record stops the entire pipeline

Use warn at bronze; reserve fail for critical tables

AP-3

Skipping lhp validate before lhp generate

Generation errors from invalid config are harder to diagnose

Always validate first

AP-4

Using streaming tables for join-based enrichment

Streaming tables don’t recompute when dimensions change

Use materialized views for any join with updating dimensions

AP-5

Building templates before understanding the pattern

Leads to over-generalised, hard-to-use templates

Write 3+ concrete flowgroups first, then extract

AP-6

Treating preset changes as low-risk

A global preset change affects every pipeline using it

Validate the full project after any preset change

AP-7

Not using operational metadata

Debugging production issues without audit columns is very hard

Use LHP’s operational metadata system consistently

AP-8

Monolithic YAML files

Unreadable, unreviewable, untestable

One pipeline per file

AP-9

Secrets in substitution files

Secrets in version control will be leaked

Use ${secret:scope/key} syntax exclusively

AP-10

Ignoring _rescued_data column

Schema mismatches without rescue silently drop data

Always enable cloudFiles.rescuedDataColumn at bronze

AP-11

Dumping all SQL files in a flat sql/ directory

At 100+ SQL files, finding the right one is painful

Use sql/<system>/<layer>/ subdirectories

AP-12

Using subdirectories for templates or presets

LHP only discovers flat *.yaml in these directories

Use prefix-based naming instead (see Section 2)

AP-13

Generic names without system/layer context

pipeline_1, ingest.yaml, transform.sql are meaningless at scale

Use ID-based naming: erp_brz_raw_orders (see Section 3)