Enterprise Best Practices¶

A comprehensive guide for data engineers using Lakehouse Plumber (LHP) in enterprise environments. These best practices correlate Databricks Lakeflow Declarative Pipeline conventions, enterprise configuration-framework patterns, and LHP-specific capabilities.

1. Project Structure & Organisation¶

BP-1.1: Organize pipeline YAML files by data domain¶

Group by business domain (orders/, customers/, inventory/) rather than by action type (loads/, transforms/). LHP discovers flowgroups from the pipelines/ directory and supports subdirectories, so pipelines/orders/bronze_ingest.yaml works natively.

BP-1.2: Keep each YAML file small and single-purpose¶

Target 50–200 lines. Use LHP’s multi-document (---) or array syntax only for tightly related flowgroups that share a pipeline. Monolithic files with 15+ flowgroups become unreadable and unreviewable.

See also

Multi-Flowgroup YAML Files for details on multi-document and array syntax.

BP-1.3: Use `include` patterns to filter pipeline discovery¶

For large repos, use the include glob patterns in lhp.yaml to control which pipeline files are processed per environment or team. This enables a mono-repo structure where each team’s files coexist without interfering.

BP-1.4: Separate presets, templates, and substitutions into dedicated directories¶

Follow the standard LHP project layout. See Section 2 for detailed subdirectory guidance within each top-level directory.

Standard LHP project layout¶

presets/           # Reusable defaults (flat — no subdirectory discovery)
templates/         # Reusable action patterns (flat — use prefix-based grouping)
substitutions/     # Environment-specific tokens (dev.yaml, prod.yaml)
pipelines/         # Flowgroup definitions (supports deep subdirectories)
sql/               # External SQL files (supports deep subdirectories)
schemas/           # External schema files (supports deep subdirectories)
expectations/      # External DQE files (supports deep subdirectories)
python_modules/    # External Python modules (supports deep subdirectories)

BP-1.5: Use a CODEOWNERS file to gate shared resource changes¶

CODEOWNERS is a GitHub/GitLab feature (a file at the repo root) that enforces who must review pull requests that touch specific files or directories. When a PR modifies files matching a pattern in CODEOWNERS, the listed team or person is automatically added as a required reviewer.

In an enterprise LHP project, shared resources like presets and substitutions and templates affect every pipeline, so changes to them should require platform team approval. Meanwhile, domain-specific pipelines should be reviewed by the owning team.

Example CODEOWNERS file¶

# Platform team must review shared configs
/presets/                @platform-team
/substitutions/          @platform-team
/templates/              @platform-team

# Domain teams own their pipeline definitions
/pipelines/system_a/    @team-a
/pipelines/system_b/    @team-b

Tip

Without CODEOWNERS, a change to a preset (e.g., default table properties) could silently affect every pipeline that uses it and merge without review from someone who understands the blast radius.

2. File Organisation & Subdirectory Structure¶

LHP file types have different subdirectory support. Understanding this is critical for organizing an enterprise project with hundreds of files.

Subdirectory Support Matrix¶

File Type	Base Directory	Subdirectory Support	Extensions	Notes
Pipeline YAMLs	`pipelines/`	Full recursive	`.yaml` + `.yml`	Discovered via `rglob("*.yaml")` — any depth works
SQL files (`sql_path`)	project root	Full recursive	`.sql`	Referenced by relative path from project root
Schema files (`schema_file`)	project root	Full recursive	`.yaml`, `.json`, `.ddl`	Referenced by relative path from project root
Expectations files (`expectations_file`)	project root	Full recursive	`.yaml`, `.json`	Referenced by relative path from project root
Python modules (`module_path`)	project root	Full recursive	`.py`	Referenced by relative path from project root
Templates	`templates/`	Flat only	`.yaml` only ¹	Discovery uses `glob("*.yaml")` — not recursive
Presets	`presets/`	Flat only	`.yaml` only ¹	Discovery uses `glob("*.yaml")` — not recursive
Substitutions	`substitutions/`	Flat only	`.yaml` only	One file per environment

¹ .yml extension is also accepted but .yaml is recommended for consistency.

BP-2.1: Organize pipeline YAMLs by source system, then by medallion layer¶

LHP recursively discovers all .yaml/.yml files under pipelines/. Use a two-level hierarchy — source system first, layer second — so that each team owns a clear subtree:

Pipeline directory structure¶

pipelines/
  system_a/                          # Source system / data domain
    bronze/
      system_a_bronze_ingest.yaml    # CloudFiles ingestion
    silver/
      system_a_silver_cleanse.yaml   # Validation and enrichment
    gold/
      system_a_gold_reporting.yaml   # Aggregations
  system_b/
    bronze/
      system_b_bronze_ingest.yaml
    silver/
      system_b_silver_merge.yaml
  shared/
    gold/
      cross_domain_metrics.yaml      # Cross-system gold tables

This structure maps cleanly to CODEOWNERS (pipelines/system_a/ owned by Team A) and to include patterns when you need to generate a subset.

BP-2.2: Organize SQL files mirroring the pipeline structure¶

All sql_path references resolve relative to the project root, so sql_path: sql/system_a/bronze/cleanse_raw.sql works natively. Mirror the pipeline directory hierarchy:

SQL directory structure¶

sql/
  system_a/
    bronze/
      parse_json_payload.sql
    silver/
      enrich_orders.sql
      validate_customers.sql
    gold/
      daily_revenue_summary.sql
  system_b/
    silver/
      merge_inventory.sql
  shared/
    lookups/
      currency_conversion.sql

When referencing from YAML:

Referencing external SQL files¶

actions:
  - name: transform_enrich_orders
    type: transform
    transform_type: sql
    sql_path: sql/system_a/silver/enrich_orders.sql
    source: load_raw_orders
    target: enriched_orders_view

BP-2.3: Organize schema files by source system and layer¶

Schema files (DDL, YAML, or JSON) also resolve relative to the project root:

Schema directory structure¶

schemas/
  system_a/
    bronze/
      raw_orders_schema.yaml        # CloudFiles schema hints
      raw_customers_schema.ddl      # DDL format
    silver/
      orders_strict_schema.yaml     # Schema transform definitions
  system_b/
    bronze/
      raw_inventory_schema.json     # JSON format

When referencing:

Referencing external schema files¶

actions:
  - name: transform_enforce_schema
    type: transform
    transform_type: schema
    schema_file: schemas/system_a/silver/orders_strict_schema.yaml
    enforcement: strict

BP-2.4: Organize expectations files by domain and quality tier¶

Store DQE expectation files in a dedicated expectations/ directory, grouped by domain and quality tier:

Expectations directory structure¶

expectations/
  system_a/
    bronze/
      raw_orders_warn.yaml          # Bronze: warn-only rules
    silver/
      orders_drop_rules.yaml        # Silver: drop invalid rows
      orders_quarantine_rules.yaml  # Silver: quarantine criteria
    gold/
      revenue_fail_rules.yaml       # Gold: fail on critical invariants
  shared/
    common_not_null_rules.yaml      # Reusable cross-domain rules

When referencing:

Referencing external expectations files¶

actions:
  - name: transform_dqe_orders
    type: transform
    transform_type: data_quality
    expectations_file: expectations/system_a/silver/orders_drop_rules.yaml
    source: enriched_orders_view

BP-2.5: Organize Python modules by function type¶

For Python-based loads, transforms, and sinks, group modules by their role:

Python modules directory structure¶

python_modules/
  transforms/
    system_a/
      ml_scoring.py
      custom_dedup.py
    shared/
      phone_normalizer.py
  datasources/
    erp_connector.py                # Custom DataSource V2
  sinks/
    webhook_sink.py                 # Custom DataSink
    foreachbatch/
      notify_downstream.py          # ForEachBatch handlers

BP-2.6: Use prefix-based grouping for templates¶

Templates are discovered only at the top level of templates/ — subdirectories are not discovered by lhp list_templates. Instead, use a structured prefix convention to categorize templates:

Template naming with prefixes¶

templates/
  TMPL001_brz_load_cloudfiles_standard.yaml        # Bronze / Load / CloudFiles
  TMPL002_brz_load_kafka_events.yaml               # Bronze / Load / Kafka
  TMPL003_brz_load_delta_snapshot.yaml             # Bronze / Load / Delta snapshot
  TMPL004_slv_transform_sql_enrichment.yaml        # Silver / Transform / SQL
  TMPL005_slv_transform_cdc_merge.yaml             # Silver / Transform / CDC
  TMPL006_slv_write_streaming_table_std.yaml       # Silver / Write / Streaming Table
  TMPL007_gld_write_materialized_view_agg.yaml     # Gold / Write / Materialized View
  TMPL008_full_bronze_to_silver_pipeline.yaml      # Full pipeline template (multi-action)

The prefix pattern <layer>_<action_type>_<detail> makes templates scannable in lhp list_templates output and in file explorers. When you have 30+ templates, this prefix is the primary way to find the right one.

See also

Templates Reference for details on creating and using templates.

BP-2.7: Use prefix-based grouping for presets¶

Like templates, presets are discovered only at the top level of presets/. Use prefixes to encode scope and layer:

Preset naming with prefixes¶

presets/
  global_defaults.yaml                             # Organization-wide
  brz_standard.yaml                                # Bronze layer defaults
  brz_cloudfiles_json.yaml                         # Bronze / CloudFiles / JSON specific
  brz_cloudfiles_csv.yaml                          # Bronze / CloudFiles / CSV specific
  slv_standard.yaml                                # Silver layer defaults
  slv_cdc_scd2.yaml                                # Silver / CDC / SCD Type 2
  gld_standard.yaml                                # Gold layer defaults
  ord_custom_overrides.yaml                        # Orders domain custom

See also

Presets Reference for details on preset inheritance and merging.

BP-2.8: Use `include` patterns for team-scoped generation¶

When multiple teams share a mono-repo, use include patterns in lhp.yaml to generate only relevant pipelines. Patterns are matched against paths relative to pipelines/:

Include only system_a pipelines¶

# lhp.yaml — generate only system_a pipelines
include:
  - "system_a/**/*.yaml"

Or selectively include specific layers:

Include only bronze pipelines¶

# Only bronze pipelines across all systems
include:
  - "**/bronze/*.yaml"

BP-2.9: Full enterprise project layout example¶

Complete enterprise project structure¶

my_lhp_project/
  lhp.yaml                                # Project config
  substitutions/
    dev.yaml
    staging.yaml
    prod.yaml
  presets/
    global_defaults.yaml
    brz_standard.yaml
    brz_cloudfiles_json.yaml
    slv_standard.yaml
    slv_cdc_scd2.yaml
    gld_standard.yaml
  templates/
    TMPL001_brz_load_cloudfiles_standard.yaml
    TMPL002_slv_transform_sql_enrichment.yaml
    TMPL003_gld_write_mv_aggregation.yaml
  pipelines/
    system_a/
      bronze/
        system_a_bronze_ingest_TMPL001.yaml
      silver/
        system_a_silver_cleanse_TMPL002.yaml
      gold/
        system_a_gold_reporting_TMPL003.yaml
    system_b/
      bronze/
        system_b_bronze_ingest_TMPL001.yaml
      silver/
        system_b_silver_merge_TMPL002.yaml
  sql/
    system_a/
      silver/
        enrich_orders.sql
      gold/
        daily_revenue.sql
    system_b/
      silver/
        merge_inventory.sql
  schemas/
    system_a/
      bronze/
        raw_orders_schema.yaml
      silver/
        orders_strict_schema.yaml
    system_b/
      bronze/
        raw_inventory_schema.yaml
  expectations/
    system_a/
      bronze/
        raw_orders_warn.yaml
      silver/
        orders_drop_rules.yaml
    shared/
      common_not_null_rules.yaml
  python_modules/
    transforms/
      system_a/
        ml_scoring.py
    datasources/
      erp_connector.py
  generated/                               # Output (per environment)
    dev/
      system_a_bronze_pipeline/
        raw_orders.py
      system_a_silver_pipeline/
        orders_cleanse.py

3. Naming Conventions¶

BP-3.1: Use `snake_case` consistently across all identifiers¶

Pipelines, flowgroups, action names, templates, presets, variables, table names — all snake_case. LHP generates Python function names from action names, so this ensures valid Python identifiers.

BP-3.2: Prefix pipeline names with the source system and layer¶

erp_bronze_pipeline, crm_silver_pipeline — not bronze_pipeline or pipeline_v2. At 200+ pipelines, generic names become meaningless. LHP uses the pipeline field in flowgroups to group actions into output files. See BP-3.9 for the full enterprise naming pattern.

BP-3.3: Name flowgroups to describe the data flow¶

erp_brz_raw_orders, erp_slv_orders_enriched — not cloudfiles_load_1 or flowgroup_v2. The flowgroup name appears in generated file names and log output. Embed the source system and layer for visibility. See BP-3.8 for the full enterprise naming pattern.

BP-3.4: Name actions descriptively with the pattern `<verb>_<entity>_<modifier>`¶

load_raw_orders, transform_validate_orders, write_orders_silver, test_orders_row_count. Action names become Python function names in generated code, so clarity matters.

BP-3.5: Use SCREAMING_SNAKE_CASE for environment tokens¶

Environment tokens (${SOURCE_CATALOG}, ${LANDING_PATH}) are resolved from substitution files. Local variables (%{table_name}, %{source_schema}) are flowgroup-scoped. The case distinction makes it immediately clear which resolution mechanism applies.

See also

Substitutions & Secrets for the full substitution processing order and syntax.

BP-3.6: Never abbreviate in identifiers¶

customer_silver_merge not cust_slvr_mrg. Config files live in version control forever; clarity beats brevity.

Structured Naming for Enterprise Visibility¶

At enterprise scale (100+ templates, 500+ flowgroups), flat alphabetical lists become unmanageable. Templates use a TMPLxxx_ ID prefix to embed a unique sequence number, making them instantly scannable and sortable. Flowgroup config files reference the template ID as a _TMPLxxx suffix, creating a visible link between a config and its template. All other artifacts — pipelines, presets, SQL files, schemas, and expectations — use descriptive prefixes and directory structure for organisation.

BP-3.7: Use `TMPLxxx` ID prefixes for templates¶

Since templates live in a flat directory (see Section 2), the filename is the only organisational mechanism. Use a TMPLxxx_ prefix with a sequential number, followed by a structured name that encodes layer and action type:

Template naming pattern¶

Pattern: TMPLxxx_<layer>_<action_type>_<source_or_target_type>_<descriptive_name>

Examples:
  TMPL001_brz_load_cloudfiles_standard        # Bronze / Load / CloudFiles / standard pattern
  TMPL002_brz_load_cloudfiles_with_schema     # Bronze / Load / CloudFiles / with schema hints
  TMPL003_brz_load_kafka_events               # Bronze / Load / Kafka / event stream
  TMPL004_slv_transform_sql_enrichment        # Silver / Transform / SQL / enrichment pattern
  TMPL005_slv_transform_cdc_merge             # Silver / Transform / CDC / merge pattern
  TMPL006_slv_write_st_with_dqe               # Silver / Write / Streaming Table / with DQE
  TMPL007_gld_write_mv_aggregation            # Gold / Write / Materialized View / aggregation
  TMPL008_e2e_full_bronze_to_silver           # End-to-end / multi-action pipeline template

Layer prefixes: brz_ (bronze), slv_ (silver), gld_ (gold), e2e_ (end-to-end multi-action).

The TMPLxxx prefix sorts templates by creation order in lhp list_templates output, while the layer prefix groups them logically. The ID also appears as a suffix in flowgroup config filenames (see BP-3.8), creating a visible link between configs and their templates.

BP-3.8: Use descriptive flowgroup names with a `_TMPLxxx` config file suffix¶

Flowgroup names become Python file names and function names in generated code. Embed the source system and layer for visibility across large projects:

Flowgroup naming pattern¶

Pattern: <system>_<layer>_<descriptive_name>

Examples:
  erp_brz_raw_orders                  # ERP system / Bronze / raw orders
  erp_brz_raw_customers               # ERP system / Bronze / raw customers
  erp_slv_orders_enriched             # ERP system / Silver / enriched orders
  erp_slv_customers_merged            # ERP system / Silver / merged customers
  erp_gld_daily_revenue               # ERP system / Gold / daily revenue
  crm_brz_raw_contacts                # CRM system / Bronze / raw contacts
  crm_slv_contacts_deduped            # CRM system / Silver / deduped contacts

When naming the Flowgroup file, append the template ID as a suffix so the template relationship is visible at a glance without opening the file:

Config file naming pattern¶

Pattern: <system>_<layer>_<description>_<TMPLxxx>.yaml

Examples:
  erp_bronze_ingest_TMPL001.yaml      # Uses TMPL001 (CloudFiles standard)
  erp_silver_cleanse_TMPL004.yaml     # Uses TMPL004 (SQL enrichment)
  erp_gold_reporting_TMPL007.yaml     # Uses TMPL007 (MV aggregation)
  crm_bronze_contacts_TMPL001.yaml    # Uses TMPL001 (CloudFiles standard)

This naming ensures that when you see a generated file erp_brz_raw_orders.py or a DLT log entry for erp_slv_orders_enriched, you immediately know the source system and layer without looking up the config. The _TMPLxxx suffix in the config filename lets you identify the template at the file system level — useful when browsing directories, reviewing PRs, or triaging issues.

BP-3.9: Use structured prefixes for pipeline names¶

Pipeline names determine the output directory structure under generated/{env}/ and appear in Databricks UI. Use <system>_<layer>_pipeline for clear identification:

Pipeline naming pattern¶

Pattern: <system>_<layer>_pipeline

Examples:
  erp_bronze_pipeline                 # All ERP bronze ingestion
  erp_silver_pipeline                 # All ERP silver transforms
  erp_gold_pipeline                   # All ERP gold aggregations
  crm_bronze_pipeline                 # All CRM bronze ingestion
  shared_gold_pipeline                # Cross-system gold tables

This gives you clean, predictable output directories:

Generated output with structured names¶

generated/dev/
  erp_bronze_pipeline/
    erp_brz_raw_orders.py
    erp_brz_raw_customers.py
  erp_silver_pipeline/
    erp_slv_orders_enriched.py
  crm_bronze_pipeline/
    crm_brz_raw_contacts.py

BP-3.10: Use consistent prefixes for presets¶

Since presets are also flat (no subdirectory discovery), the naming prefix is essential for organisation:

Preset naming pattern¶

Pattern: <scope>_<layer>_<purpose>

Examples:
  global_defaults                     # Organisation-wide standards
  brz_standard                        # Bronze layer standard preset
  brz_cloudfiles_json                 # Bronze / CloudFiles / JSON format
  brz_cloudfiles_csv                  # Bronze / CloudFiles / CSV format
  brz_kafka_events                    # Bronze / Kafka event preset
  slv_standard                        # Silver layer standard preset
  slv_cdc_scd2                        # Silver / CDC / SCD Type 2
  gld_standard                        # Gold layer standard preset
  erp_custom                          # ERP domain custom overrides

Quick Reference Table¶

Artifact	Convention	Example
Pipeline names	`<system>_<layer>_pipeline`	`erp_bronze_pipeline`
Flowgroup names	`<system>_<layer>_<description>`	`erp_brz_raw_orders`
Action names	`<verb>_<entity>_<modifier>`	`load_raw_orders`
Config files	`<system>_<layer>_<description>_<TMPLxxx>.yaml`	`erp_bronze_ingest_TMPL001.yaml`
Template files	`TMPLxxx_<layer>_<action>_<type>_<name>.yaml`	`TMPL001_brz_load_cloudfiles_standard.yaml`
Preset files	`<scope>_<layer>_<purpose>.yaml`	`brz_standard.yaml`
SQL files	`<domain>/<layer>/<description>.sql`	`erp/silver/enrich_orders.sql`
Schema files	`<domain>/<layer>/<description>.yaml`	`erp/bronze/raw_orders_schema.yaml`
Expectations files	`<domain>/<layer>/<description>.yaml`	`erp/silver/orders_drop_rules.yaml`
Generated files	`<flowgroup_name>.py`	`erp_brz_raw_orders.py`
Env tokens	`${SCREAMING_SNAKE_CASE}`	`${SOURCE_CATALOG}`
Local variables	`%{lower_snake_case}`	`%{table_suffix}`
Template params	`{{ lower_snake_case }}`	`{{ partition_column }}`

4. Template Design¶

BP-4.2: Keep template parameters minimal and well-documented¶

Every parameter should have a description and either be required: true or have a sensible default. LHP validates required parameters at generation time and reports clear errors for missing ones. Avoid templates with 15+ parameters — they add complexity without reducing it.

BP-4.3: Establish “golden templates” for each common pipeline pattern¶

Maintain platform-team-owned templates for standard patterns, using the ID-based naming from Section 3:

TMPL001_brz_load_cloudfiles_standard — standard CloudFiles ingestion with operational metadata
TMPL002_brz_load_delta_snapshot — Delta table reads with standard options
TMPL003_slv_write_st_with_dqe — streaming table with DQE expectations
TMPL004_slv_transform_sql_enrichment — SQL-based silver enrichment
TMPL005_gld_write_mv_aggregation — materialized view for gold aggregations

These golden templates embed organisational standards (default expectations, metadata columns, table properties) so domain teams can’t accidentally skip them.

BP-4.4: Templates live in a flat directory — organise by naming convention¶

LHP discovers templates only from the top level of templates/ (using glob("*.yaml"), not recursive). Subdirectories under templates/ are not discovered by lhp list_templates. Instead, use the structured prefix convention from BP-3.7 to group templates logically.

Note

Subdirectories under templates/ are not discovered. Referencing templates via subfolder paths (e.g., use_template: "subfolder/name") is not supported. Stick to the flat directory with prefix-based naming.

BP-4.5: Templates can reference presets — use this to layer defaults¶

A template can declare presets: [brz_standard] to inherit default options. Flowgroups using the template can add additional presets that override. This creates a clean defaults hierarchy: template presets -> flowgroup presets -> explicit action config.

BP-4.6: Use template parameters for what varies; presets for what is standard¶

Template parameters should capture the unique aspects of each use case (source path, target table, specific columns). Standard aspects (table properties, operational metadata, reader options) belong in presets. This keeps template usage concise.

BP-4.7: Reference external files from templates using parameterised paths¶

Templates can reference external files via sql_path, schema_file, or expectations_file. Use template parameters for the variable part of the path, combined with a fixed subdirectory convention:

Template with parameterised SQL path¶

# Template: slv_transform_sql_enrichment.yaml
name: slv_transform_sql_enrichment
parameters:
  - name: system
    required: true
    description: "Source system name (used in file paths)"
  - name: entity
    required: true
    description: "Entity name"
actions:
  - name: transform_enrich_{{ entity }}
    type: transform
    transform_type: sql
    sql_path: "sql/{{ system }}/silver/enrich_{{ entity }}.sql"
    source: "load_raw_{{ entity }}"
    target: "enriched_{{ entity }}_view"

This way, the directory structure convention (sql/<system>/silver/) is baked into the template, ensuring all teams follow the same file organisation.

See also

Templates Reference for the full template specification and Dynamic Templates Guide for conditionals, loops, and advanced Jinja2 features.

5. Preset Strategy¶

BP-5.1: Design a preset hierarchy — global, domain, pipeline-specific¶

LHP supports preset inheritance via extends and preset chaining (multiple presets in a list, merged left-to-right). Use this to build layers:

global_defaults — organisation-wide standards (table properties, metadata)
bronze_standard extends global_defaults — bronze-layer conventions
orders_bronze extends bronze_standard — domain-specific overrides

BP-5.2: Encode organisational standards in presets, not just values¶

A high-value preset sets multiple related properties together:

Bronze standard preset example¶

name: bronze_standard
extends: global_defaults
defaults:
  load_actions:
    cloudfiles:
      options:
        cloudFiles.schemaEvolutionMode: rescue
        cloudFiles.rescuedDataColumn: _rescued_data
        cloudFiles.maxFilesPerTrigger: 1000
  write_actions:
    streaming_table:
      table_properties:
        pipelines.reset.allowed: "false"
  operational_metadata:
    - ingest_timestamp
    - source_file

BP-5.3: Limit the total number of presets¶

More than 15–20 distinct presets leads to confusion and misuse. Consolidate overlapping presets. LHP’s lhp list_presets command helps audit the current set.

BP-5.4: Use `lhp show` to verify effective configuration¶

After preset merging, template expansion, and substitution, the effective config can differ from what the YAML file suggests. Always verify with lhp show <flowgroup> --env <env> before deploying changes to shared presets. This is LHP’s equivalent of “fully resolved config.”

BP-5.5: Treat preset changes as high-blast-radius events¶

A change to a global preset affects every pipeline using it. Version presets (add a version field), document changes, and run lhp validate --env <env> across the entire project before merging preset changes.

See also

Presets Reference for complete details on preset inheritance and merging.

6. Substitution & Environment Management¶

BP-6.1: Use directory-based environment separation¶

Maintain substitutions/dev.yaml, substitutions/staging.yaml, substitutions/prod.yaml. All environments are visible on the same branch. LHP resolves ${token} patterns from these files.

BP-6.2: Put all environment-varying values in substitution tokens¶

Catalog names, schema names, storage paths, cluster policies, alert emails — all should be tokens. LHP supports recursive token expansion (tokens referencing other tokens, up to 10 iterations), so you can compose:

Recursive token expansion¶

global:
  catalog_prefix: main

dev:
  catalog: "${catalog_prefix}_dev"

prod:
  catalog: "${catalog_prefix}_prod"

BP-6.3: Use the `global` section for shared values¶

LHP’s substitution files support a global section whose values are inherited by all environments. Environment-specific sections override global values. This eliminates duplication.

BP-6.4: Never put secret values in substitution files¶

Use LHP’s ${secret:scope/key} syntax. LHP converts these to dbutils.secrets.get(scope="scope", key="key") calls in generated code. Configure secrets.default_scope and scopes aliases in the substitution file for clean references.

Important

Secrets in substitution files will be committed to version control and leaked. Always use the ${secret:scope/key} syntax exclusively.

BP-6.5: Use `lhp substitutions` to audit available tokens¶

Before writing flowgroups, run lhp substitutions --env <env> to check what tokens are available. This prevents unresolved token errors at generation time.

BP-6.6: Design substitution tokens for the medallion pattern¶

Standard token set for a medallion project:

Medallion substitution tokens¶

global:
  bronze_catalog: "${catalog_prefix}_bronze"
  silver_catalog: "${catalog_prefix}_silver"
  gold_catalog: "${catalog_prefix}_gold"
  landing_path_base: "abfss://landing@${storage_account}.dfs.core.windows.net"

See also

Substitutions & Secrets for the full substitution processing order and syntax.

7. Local Variables¶

BP-7.1: Use local variables for flowgroup-scoped repetition¶

When the same value (table name, schema, path segment) appears multiple times within a single flowgroup, define it as a local variable rather than repeating it. LHP resolves %{var} first, before template expansion.

BP-7.2: Prefer local variables over hardcoded values¶

Using local variables¶

variables:
  entity: orders
  source_schema: raw
actions:
  - name: load_%{entity}
    source:
      table: "${BRONZE_CATALOG}.%{source_schema}.%{entity}"

BP-7.3: Do not use local variables for environment-specific values¶

%{var} is scoped to a single flowgroup and resolved at parse time. Environment-specific values belong in substitution tokens (${TOKEN}) which are resolved per environment.

See also

Substitutions & Secrets for details on local variables and environment tokens.

8. FlowGroup Design¶

BP-8.1: Use array syntax with field inheritance for multi-flowgroup pipelines¶

When multiple flowgroups share the same pipeline, presets, or template, use LHP’s array syntax to inherit:

Array syntax with inheritance¶

pipeline: orders_bronze
presets: [bronze_standard]
operational_metadata: true
flowgroups:
  - flowgroup: raw_orders
    actions: [...]
  - flowgroup: raw_returns
    actions: [...]

Inherited fields: pipeline, use_template, presets, operational_metadata, job_name.

See also

Multi-Flowgroup YAML Files for the full multi-flowgroup reference.

BP-8.2: Scope one pipeline per data domain¶

Pipeline orders_bronze contains flowgroups raw_orders, raw_returns, raw_refunds. Each flowgroup generates its own Python function set but runs in the same DLT pipeline, enabling dependency resolution across them.

BP-8.3: Use `job_name` to group flowgroups into Databricks jobs¶

LHP’s lhp deps --format job generates job resource definitions. Use job_name to control which flowgroups are orchestrated together in a Databricks Workflow.

See also

Concepts & Architecture for details on job_name and multi-job orchestration.

BP-8.4: Order actions as Load, Transform, Write, Test¶

This matches the data flow direction and makes YAML files scannable. LHP resolves dependencies automatically, but consistent ordering improves readability.

9. Load Actions¶

BP-9.1: Always set `schemaEvolutionMode` and `rescuedDataColumn` for CloudFiles¶

LHP’s CloudFiles generator supports all Auto Loader options. In production, always use:

CloudFiles with schema rescue¶

source:
  type: cloudfiles
  path: "${LANDING_PATH}/orders/"
  format: json
  options:
    cloudFiles.schemaEvolutionMode: rescue
    cloudFiles.rescuedDataColumn: _rescued_data

Tip

Put these options in a bronze_standard preset so they apply everywhere without repetition.

BP-9.2: Use `readMode: stream` for bronze, `readMode: batch` for lookups¶

LHP’s readMode field controls whether spark.readStream or spark.read is generated. Bronze sources should stream; dimension/lookup tables should batch-read.

BP-9.3: Use full three-part names via substitution tokens for Delta loads¶

Delta source with substitution tokens¶

source:
  type: delta
  catalog: "${SILVER_CATALOG}"
  database: "orders"
  table: "validated_orders"

LHP constructs catalog.database.table references. Never hardcode catalog or database names.

BP-9.4: Rate-limit Auto Loader in production¶

Use cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger options (via presets) to prevent bronze ingestion from overwhelming downstream tables. Set this in your bronze_standard preset.

BP-9.5: Use `schema_hints` for critical columns¶

LHP supports cloudFiles.schemaHints option strings. For columns where wrong type inference would cause downstream failures (amounts, IDs, timestamps), provide explicit hints.

See also

Load Actions for the full load action specification.

10. Transform Actions¶

BP-10.1: Default to SQL transforms for silver/gold layer logic¶

LHP’s SQL transform generator supports inline SQL or external SQL files via sql_path. SQL is more readable, more widely understood, and easier to review than Python transforms for standard operations. Use external SQL files for anything over ~5 lines.

BP-10.2: Use external SQL files for complex transformations¶

LHP resolves sql_path relative to the project root. Store SQL in sql/<system>/<layer>/<transform_name>.sql (see Section 2). This keeps YAML files concise and enables SQL-specific linting.

BP-10.3: Use Python transforms only when SQL cannot express the logic¶

LHP’s Python transform generator copies external modules and calls your function. The signature depends on the number of sources:

Single source: function(df, spark, parameters) — receives the source DataFrame directly
Multiple sources: function(dataframes, spark, parameters) — receives a list of DataFrames
No sources: function(spark, parameters) — function generates data from scratch

Reserve Python transforms for UDFs, ML scoring, or complex procedural logic.

BP-10.4: Use schema transforms for explicit column control¶

LHP’s schema transform type supports column renaming (arrow syntax: old_name -> new_name), type casting, and strict/permissive enforcement. Use enforcement: strict at silver to reject unexpected columns from bronze.

BP-10.5: Use data_quality transforms for DQE expectations¶

LHP’s data_quality transform type reads expectations from YAML/JSON files or inline definitions, generating the appropriate @dp.expect_all(), @dp.expect_all_or_drop(), or @dp.expect_all_or_fail() decorators.

BP-10.6: Use temp_table transforms for intermediate calculations¶

LHP generates @dp.table(temporary=True) for temp tables. Use these for intermediate steps that should not be published to Unity Catalog.

See also

Transform Actions for the full transform action specification.

11. Write Actions¶

BP-11.1: Default to materialized views for silver/gold layers¶

LHP’s materialized_view write target generates @dp.materialized_view(). Materialized views always produce correct results — they reprocess when source data changes. Use them for all joins, aggregations, and enrichment.

BP-11.2: Use streaming tables for bronze ingestion and CDC targets¶

LHP’s streaming_table write target generates dp.create_streaming_table() + @dp.append_flow(). Streaming tables are optimal for append-only ingestion.

Important

Joins in streaming tables do not recompute when dimensions change — use materialized views for enrichment.

BP-11.3: Set `pipelines.reset.allowed: "false"` on history tables¶

LHP supports table_properties in write targets. This prevents accidental full refresh from destroying historical data:

Protecting history tables from reset¶

write_target:
  type: streaming_table
  table_properties:
    pipelines.reset.allowed: "false"

Tip

Put this in your silver_standard and gold_standard presets.

BP-11.4: Use `cluster_columns` (liquid clustering) instead of `partition_columns`¶

LHP supports both, but liquid clustering is the modern recommendation. It’s incremental, allows redefining keys without rewriting data, and works well with high-cardinality columns:

Liquid clustering¶

write_target:
  type: streaming_table
  cluster_columns: [customer_id, order_date]

BP-11.5: Use `comment` on every write target¶

LHP passes the comment field to the generated table/view definition. This appears in Unity Catalog UI and is queryable.

BP-11.6: Use `spark_conf` for per-table performance tuning¶

LHP supports spark_conf on write targets. Use it for adaptive shuffle or per-table optimisations rather than global pipeline settings.

BP-11.7: For CDC, use the `cdc` mode with explicit `cdc_config`¶

LHP generates dp.create_auto_cdc_flow() with full support for keys, sequence_by (including STRUCT for tie-breaking), scd_type (1 or 2), apply_as_deletes, ignore_null_updates, track_history_column_list, and track_history_except_column_list options. Always specify sequence_by explicitly.

BP-11.8: Use `once: true` for backfill flows¶

LHP supports the once flag on individual actions, generating one-time flows for historical data backfill without affecting the ongoing streaming ingestion.

BP-11.9: Multiple write actions targeting the same table are automatically grouped¶

LHP consolidates multiple sources writing to the same streaming table into one create_streaming_table with multiple append_flow functions. Use this for multi-source ingestion patterns.

BP-11.10: Use `snapshot_cdc` mode for full-snapshot change data capture¶

LHP also supports mode: "snapshot_cdc" on streaming tables, generating dp.create_auto_cdc_from_snapshot_flow(). Use this when your source provides full snapshots (not a change feed) and you want LHP to detect changes automatically.

Configuration uses snapshot_cdc_config (not cdc_config):

Snapshot CDC configuration¶

write_target:
  type: streaming_table
  streaming_table_config:
    mode: "snapshot_cdc"
    snapshot_cdc_config:
      source_function:
        file: "functions/my_snapshots.py"
        function: "my_snapshot_function"
      keys: [id]
      stored_as_scd_type: 2

Key differences from cdc mode:

Config key is snapshot_cdc_config (not cdc_config)
SCD type field is stored_as_scd_type (not scd_type)
Requires a source_function with file and function fields
Does not use sequence_by — ordering is implicit from snapshot timing

BP-11.11: Use `sink` write targets for streaming to external destinations¶

LHP supports a sink write target type for writing to external systems. Four sink subtypes are available:

delta — write to external Delta tables outside Unity Catalog (e.g., cross-workspace or external storage)
kafka — write to Kafka or Azure Event Hubs for event-driven architectures
custom — use a custom DataSink V2 class via the custom_sink_class config field
foreachbatch — ForEachBatch handlers for custom per-batch processing (API calls, notifications, etc.)

Kafka sink example¶

write_target:
  type: sink
  sink_type: kafka
  sink_config:
    kafka.bootstrap.servers: "${KAFKA_BROKERS}"
    topic: "enriched_orders"

Use sinks when data must leave the lakehouse — for downstream consumers, event buses, or external APIs. Pair with streaming tables for the primary lakehouse copy.

See also

Write Actions for the full write action specification.

12. Data Quality (Expectations)¶

BP-12.1: Tier expectations by medallion layer¶

Bronze: warn only — never drop or fail at bronze. Every raw record is precious.
Silver: drop for structural quality rules. Route violations to a quarantine table.
Gold/Critical: fail for reference table integrity and business-critical invariants.

LHP’s DQE parser supports failureAction: fail|drop|warn in expectation files and generates the appropriate decorators.

See also

For configuring quarantine mode in LHP, see Quarantine (Dead Letter Queue).

BP-12.2: Centralise expectation definitions in external DQE files¶

LHP supports expectations_file pointing to YAML/JSON files. Store these in expectations/<domain>/ and reference them from multiple actions. This enables reuse and independent review of quality rules.

BP-12.3: Name expectations descriptively¶

Convention: valid_<column>_<constraint_type> (e.g., valid_order_id_not_null, valid_amount_positive). These names appear in the DLT Data Quality tab and event log.

BP-12.5: Use test actions for cross-table validation¶

LHP’s 9 test action types (row_count, uniqueness, referential_integrity, completeness, range, schema_match, all_lookups_found, custom_sql, custom_expectations) generate SQL-based validation views. Use --include-tests flag to generate them. Always run these in staging before production deployment.

To publish test results to external systems like Azure DevOps or a Delta audit table, see Test Result Reporting (Publishing).

See also

Test Actions (Data Quality Unit Tests) for the full test action specification.

13. Operational Metadata¶

BP-13.1: Define operational metadata columns in `lhp.yaml`¶

LHP supports project-level operational_metadata with column definitions, presets, and defaults. Define standard columns once:

Operational metadata configuration in lhp.yaml¶

operational_metadata:
  columns:
    ingest_timestamp:
      expression: "F.current_timestamp()"
      description: "When the record was ingested"
      applies_to: [streaming_table, materialized_view]
    source_file:
      expression: "F.input_file_name()"
      description: "Source file path"
      applies_to: [streaming_table]
      enabled: true
    pipeline_id:
      expression: "F.lit(spark.conf.get('pipelines.id'))"
      description: "Pipeline identifier"
      additional_imports:
        - "from pyspark.sql import functions as F"

Each column config supports these fields:

expression (required) — PySpark expression string
description — Human-readable description
applies_to — List of target types (default: [streaming_table, materialized_view])
enabled — Boolean to enable/disable the column (default: true)
additional_imports — List of extra Python import statements needed by the expression

BP-13.2: Create metadata presets for different layers¶

LHP supports operational_metadata.presets for named groups in lhp.yaml:

Metadata presets¶

operational_metadata:
  presets:
    bronze_standard: [ingest_timestamp, source_file, pipeline_id]
    silver_standard: [updated_at, pipeline_run_id]

Note

Metadata presets are defined at the project level for documentation and organisational purposes. At the flowgroup or action level, operational_metadata accepts either true (to enable all columns) or an explicit list of column name strings — not preset names. Reference the preset definitions as a guide when writing the column name lists in your flowgroups.

BP-13.3: Metadata is additive across preset, flowgroup, and action levels¶

LHP deep-merges operational metadata with deduplication. This means you can set a baseline in a preset and add columns at the flowgroup or action level without losing the preset columns.

BP-13.4: Use `applies_to` to control which target types get each column¶

input_file_name() is only valid in streaming/batch reads — set applies_to: [streaming_table]. current_timestamp() works everywhere — set applies_to: [streaming_table, materialized_view].

See also

Operational Metadata for the full operational metadata reference.

14. Schema Management¶

BP-14.1: Use schema files for bronze layer schema definition¶

LHP’s schema_file field in load actions points to external DDL, YAML, or JSON schema files. This makes schema definitions reviewable independently of pipeline config.

BP-14.2: Use schema transforms at the bronze-to-silver boundary¶

LHP’s schema transform type provides explicit column control:

Arrow syntax for renaming: old_col -> new_col
Type casting: amount: decimal(18,2)
Strict enforcement to reject unexpected columns

BP-14.3: Use `enforcement: strict` at silver to prevent schema drift¶

LHP’s schema transform with enforcement: strict generates code that only keeps declared columns. Combined with silver-layer DQE expectations, this creates a clean schema contract between bronze and silver.

15. Validation & CI Integration¶

BP-15.1: Run `lhp validate` as a blocking CI check on every PR¶

LHP’s validation stack catches: missing required fields, unknown fields (with fuzzy-match suggestions), circular dependencies, invalid references, template parameter mismatches, and type-specific validation for all 7 load types, 5 transform types, and all write target types.

BP-15.2: Run `lhp generate --dry-run` to verify code generation¶

Dry-run generates code without writing files. Use this in CI to catch generation errors early.

BP-15.3: Maintain dry-run baselines for regression detection¶

Commit expected generated output to the repo. In CI, run lhp generate --dry-run and diff against baselines. Unexpected changes (especially from preset modifications) are flagged for review. This is the config-equivalent of snapshot testing.

BP-15.4: Layer your CI validation pipeline¶

Layer	What it checks	Tool
Syntax	Valid YAML, correct indentation	`yamllint`
Schema	Required fields, correct types	JSON Schema (LHP provides schemas in `src/lhp/schemas/`)
Semantic	References resolve, no circular deps	`lhp validate --env <env>`
Generation	Config generates valid Python	`lhp generate --dry-run --env <env>`
Regression	No unintended diff in output	Baseline comparison
Functional	Test actions pass	`pytest` with `--include-tests`

See also

CI/CD Reference for comprehensive CI/CD patterns and deployment strategies.

16. State Management & Incremental Generation¶

BP-16.1: DO NOT Commit `.lhp_state.json` to version control¶

LHP’s state tracking enables smart regeneration — only files whose source YAML, dependencies, or generation context changed are regenerated. This significantly speeds up lhp generate for large projects but must not be committed to source control

BP-16.2: Use `lhp state` to audit orphaned and stale files¶

After refactoring (renaming flowgroups, deleting pipelines), use the available flags to audit and manage state:

Flag	Purpose
`--orphaned`	Show generated files with no corresponding source YAML
`--stale`	Show files where the source YAML has changed since last generation
`--new`	Show new/untracked YAML files that haven’t been generated yet
`--cleanup`	Remove orphaned files
`--regen`	Regenerate stale files
`--dry-run`	Preview cleanup or regen without actually modifying files

Combine filters: lhp state --env dev --orphaned --cleanup --dry-run previews which orphaned files would be deleted.

BP-16.3: Use `--force` only when necessary¶

LHP’s ForceGenerationStrategy regenerates everything. Use it only after framework upgrades or preset changes where you want to verify all output. Normal development should rely on smart generation.

See also

CLI Reference for the full lhp state command reference.

17. Bundle Integration (Databricks Asset Bundles)¶

BP-17.1: Use `lhp deps --format job` to generate DAB job resource definitions¶

LHP analyses dependencies and generates pipeline and job resource YAML for Databricks Asset Bundles. Use --bundle-output to specify where bundle files are written.

BP-17.2: Bundle scaffolding is included by default¶

LHP scaffolds the full DAB structure by default with lhp init, including databricks.yml, resource definitions, and standard folder layout. Use lhp init <name> --no-bundle to skip DAB setup if you manage bundle configuration separately.

BP-17.3: Keep generated bundle resources separate from hand-written ones¶

LHP generates bundle resources from dependency analysis. Store them in a dedicated directory (e.g., bundle/generated/) so they can be regenerated without conflicting with manually defined resources.

See also

Databricks Asset Bundles Integration for the full bundle integration guide.

18. Architectural Pattern Support¶

BP-18.1: Medallion architecture — use LHP’s layered approach¶

Layer	Write Target	DQE Tier	Metadata	Key Characteristics
Bronze	Streaming table	`warn` only	ingest_timestamp, source_file	Raw ingestion, CloudFiles/Kafka, schema rescue
Silver	Materialized view	`drop` bad rows	updated_at, pipeline_run_id	Validated, deduplicated, schema-enforced
Gold	Materialized view	`fail` on critical	(inherited)	Aggregations, denormalised reporting

LHP supports all these natively through its action types, write targets, and DQE integration.

BP-18.2: Environment promotion — use substitution files per environment¶

Same YAML configs, different --env flags. LHP resolves all tokens per environment. Generated code is environment-specific but source configs are environment-agnostic.

BP-18.3: Multi-pipeline orchestration — use `job_name` and `lhp deps`¶

LHP’s dependency analysis produces pipeline-level and job-level dependency graphs. Use these to build Databricks Workflow orchestration that respects data dependencies across pipelines.

See also

Dependency Analysis & Job Generation for pipeline dependency analysis and orchestration job generation.

BP-18.4: Multi-source ingestion — use multiple load/write actions targeting the same table¶

LHP consolidates multiple write actions to the same streaming table into multiple append_flow functions. This supports fan-in patterns (multiple sources -> one table) natively.

19. Documentation & Discoverability¶

BP-19.1: Use `description` fields on every action and write target¶

LHP passes descriptions through to generated code comments and table metadata. Fill these in consistently.

BP-19.2: Use `comment` on write targets for Unity Catalog table descriptions¶

These appear in the Data Explorer and are queryable. Make them meaningful: “Silver layer orders — deduped, validated, enriched with customer data.”

BP-19.3: Use YAML comments for “why” decisions¶

Comments explaining decisions¶

# Using batch mode because source schema changes frequently and CDC is not supported
readMode: batch

The YAML declares what; comments explain why.

BP-19.4: Use `lhp info` and `lhp stats` for project documentation¶

These commands produce summaries of project structure, pipeline counts, and action distributions. Use them in onboarding documentation.

See also

CLI Reference for the full CLI command reference.

20. Anti-Patterns to Avoid¶

Warning

The following are common mistakes that undermine the value of using LHP. Each anti-pattern lists the impact and the recommended fix.

ID	Anti-Pattern	Why It’s Harmful	Fix
AP-1	Hardcoding catalog/schema names in YAML	Makes environment promotion impossible	Always use substitution tokens
AP-2	Using `expect_or_fail` at bronze	One bad record stops the entire pipeline	Use `warn` at bronze; reserve `fail` for critical tables
AP-3	Skipping `lhp validate` before `lhp generate`	Generation errors from invalid config are harder to diagnose	Always validate first
AP-4	Using streaming tables for join-based enrichment	Streaming tables don’t recompute when dimensions change	Use materialized views for any join with updating dimensions
AP-5	Building templates before understanding the pattern	Leads to over-generalised, hard-to-use templates	Write 3+ concrete flowgroups first, then extract
AP-6	Treating preset changes as low-risk	A global preset change affects every pipeline using it	Validate the full project after any preset change
AP-7	Not using operational metadata	Debugging production issues without audit columns is very hard	Use LHP’s operational metadata system consistently
AP-8	Monolithic YAML files	Unreadable, unreviewable, untestable	One pipeline per file
AP-9	Secrets in substitution files	Secrets in version control will be leaked	Use `${secret:scope/key}` syntax exclusively
AP-10	Ignoring `_rescued_data` column	Schema mismatches without rescue silently drop data	Always enable `cloudFiles.rescuedDataColumn` at bronze
AP-11	Dumping all SQL files in a flat `sql/` directory	At 100+ SQL files, finding the right one is painful	Use `sql/<system>/<layer>/` subdirectories
AP-12	Using subdirectories for templates or presets	LHP only discovers flat `*.yaml` in these directories	Use prefix-based naming instead (see Section 2)
AP-13	Generic names without system/layer context	`pipeline_1`, `ingest.yaml`, `transform.sql` are meaningless at scale	Use ID-based naming: `erp_brz_raw_orders` (see Section 3)

Enterprise Best Practices¶

1. Project Structure & Organisation¶

BP-1.1: Organize pipeline YAML files by data domain¶

BP-1.2: Keep each YAML file small and single-purpose¶

BP-1.3: Use include patterns to filter pipeline discovery¶

BP-1.4: Separate presets, templates, and substitutions into dedicated directories¶

BP-1.5: Use a CODEOWNERS file to gate shared resource changes¶

2. File Organisation & Subdirectory Structure¶

Subdirectory Support Matrix¶

BP-2.1: Organize pipeline YAMLs by source system, then by medallion layer¶

BP-2.2: Organize SQL files mirroring the pipeline structure¶

BP-2.3: Organize schema files by source system and layer¶

BP-2.4: Organize expectations files by domain and quality tier¶

BP-2.5: Organize Python modules by function type¶

BP-2.6: Use prefix-based grouping for templates¶

BP-2.7: Use prefix-based grouping for presets¶

BP-2.8: Use include patterns for team-scoped generation¶

BP-2.9: Full enterprise project layout example¶

3. Naming Conventions¶

BP-3.1: Use snake_case consistently across all identifiers¶

BP-3.2: Prefix pipeline names with the source system and layer¶

BP-3.3: Name flowgroups to describe the data flow¶

BP-3.4: Name actions descriptively with the pattern <verb>_<entity>_<modifier>¶

BP-3.5: Use SCREAMING_SNAKE_CASE for environment tokens¶

BP-3.6: Never abbreviate in identifiers¶

Structured Naming for Enterprise Visibility¶

BP-3.7: Use TMPLxxx ID prefixes for templates¶

BP-3.8: Use descriptive flowgroup names with a _TMPLxxx config file suffix¶

BP-3.9: Use structured prefixes for pipeline names¶

BP-3.10: Use consistent prefixes for presets¶

Quick Reference Table¶

4. Template Design¶

BP-4.1: Extract a template only after 3+ flowgroups share the same pattern¶

BP-4.2: Keep template parameters minimal and well-documented¶

BP-4.3: Establish “golden templates” for each common pipeline pattern¶

BP-4.4: Templates live in a flat directory — organise by naming convention¶

BP-4.5: Templates can reference presets — use this to layer defaults¶

BP-4.6: Use template parameters for what varies; presets for what is standard¶

BP-4.7: Reference external files from templates using parameterised paths¶

5. Preset Strategy¶

BP-5.1: Design a preset hierarchy — global, domain, pipeline-specific¶

BP-5.2: Encode organisational standards in presets, not just values¶

BP-5.3: Limit the total number of presets¶

BP-5.4: Use lhp show to verify effective configuration¶

BP-5.5: Treat preset changes as high-blast-radius events¶

6. Substitution & Environment Management¶

BP-6.1: Use directory-based environment separation¶

BP-6.2: Put all environment-varying values in substitution tokens¶

BP-6.3: Use the global section for shared values¶

BP-6.4: Never put secret values in substitution files¶

BP-6.5: Use lhp substitutions to audit available tokens¶

BP-6.6: Design substitution tokens for the medallion pattern¶

7. Local Variables¶

BP-7.1: Use local variables for flowgroup-scoped repetition¶

BP-7.2: Prefer local variables over hardcoded values¶

BP-7.3: Do not use local variables for environment-specific values¶

8. FlowGroup Design¶

BP-8.1: Use array syntax with field inheritance for multi-flowgroup pipelines¶

BP-8.2: Scope one pipeline per data domain¶

BP-8.3: Use job_name to group flowgroups into Databricks jobs¶

BP-8.4: Order actions as Load, Transform, Write, Test¶

9. Load Actions¶

BP-9.1: Always set schemaEvolutionMode and rescuedDataColumn for CloudFiles¶

BP-9.2: Use readMode: stream for bronze, readMode: batch for lookups¶

BP-9.3: Use full three-part names via substitution tokens for Delta loads¶

BP-9.4: Rate-limit Auto Loader in production¶

BP-9.5: Use schema_hints for critical columns¶

10. Transform Actions¶

BP-10.1: Default to SQL transforms for silver/gold layer logic¶

BP-10.2: Use external SQL files for complex transformations¶

BP-10.3: Use Python transforms only when SQL cannot express the logic¶

BP-10.4: Use schema transforms for explicit column control¶

BP-10.5: Use data_quality transforms for DQE expectations¶

BP-10.6: Use temp_table transforms for intermediate calculations¶

11. Write Actions¶

BP-11.1: Default to materialized views for silver/gold layers¶

BP-11.2: Use streaming tables for bronze ingestion and CDC targets¶

BP-11.3: Set pipelines.reset.allowed: "false" on history tables¶

BP-11.4: Use cluster_columns (liquid clustering) instead of partition_columns¶

BP-11.5: Use comment on every write target¶

BP-1.3: Use `include` patterns to filter pipeline discovery¶

BP-2.8: Use `include` patterns for team-scoped generation¶

BP-3.1: Use `snake_case` consistently across all identifiers¶

BP-3.4: Name actions descriptively with the pattern `<verb>_<entity>_<modifier>`¶

BP-3.7: Use `TMPLxxx` ID prefixes for templates¶

BP-3.8: Use descriptive flowgroup names with a `_TMPLxxx` config file suffix¶

BP-5.4: Use `lhp show` to verify effective configuration¶

BP-6.3: Use the `global` section for shared values¶

BP-6.5: Use `lhp substitutions` to audit available tokens¶

BP-8.3: Use `job_name` to group flowgroups into Databricks jobs¶

BP-9.1: Always set `schemaEvolutionMode` and `rescuedDataColumn` for CloudFiles¶

BP-9.2: Use `readMode: stream` for bronze, `readMode: batch` for lookups¶

BP-9.5: Use `schema_hints` for critical columns¶

BP-11.3: Set `pipelines.reset.allowed: "false"` on history tables¶

BP-11.4: Use `cluster_columns` (liquid clustering) instead of `partition_columns`¶

BP-11.5: Use `comment` on every write target¶

BP-11.6: Use `spark_conf` for per-table performance tuning¶

BP-11.7: For CDC, use the `cdc` mode with explicit `cdc_config`¶

BP-11.8: Use `once: true` for backfill flows¶

BP-11.10: Use `snapshot_cdc` mode for full-snapshot change data capture¶

BP-11.11: Use `sink` write targets for streaming to external destinations¶

BP-13.1: Define operational metadata columns in `lhp.yaml`¶

BP-13.4: Use `applies_to` to control which target types get each column¶

BP-14.3: Use `enforcement: strict` at silver to prevent schema drift¶

BP-15.1: Run `lhp validate` as a blocking CI check on every PR¶

BP-15.2: Run `lhp generate --dry-run` to verify code generation¶

BP-16.1: DO NOT Commit `.lhp_state.json` to version control¶

BP-16.2: Use `lhp state` to audit orphaned and stale files¶

BP-16.3: Use `--force` only when necessary¶

BP-17.1: Use `lhp deps --format job` to generate DAB job resource definitions¶

BP-18.3: Multi-pipeline orchestration — use `job_name` and `lhp deps`¶

BP-19.1: Use `description` fields on every action and write target¶

BP-19.2: Use `comment` on write targets for Unity Catalog table descriptions¶

BP-19.4: Use `lhp info` and `lhp stats` for project documentation¶