Enterprise Best Practices¶
A comprehensive guide for data engineers using Lakehouse Plumber (LHP) in enterprise environments. These best practices correlate Databricks Lakeflow Declarative Pipeline conventions, enterprise configuration-framework patterns, and LHP-specific capabilities.
1. Project Structure & Organisation¶
BP-1.1: Organize pipeline YAML files by data domain¶
Group by business domain (orders/, customers/, inventory/) rather than by
action type (loads/, transforms/). LHP discovers flowgroups from the pipelines/
directory and supports subdirectories, so pipelines/orders/bronze_ingest.yaml works
natively.
BP-1.2: Keep each YAML file small and single-purpose¶
Target 50–200 lines. Use LHP’s multi-document (---) or array syntax only for tightly
related flowgroups that share a pipeline. Monolithic files with 15+ flowgroups become
unreadable and unreviewable.
See also
Multi-Flowgroup YAML Files for details on multi-document and array syntax.
BP-1.3: Use include patterns to filter pipeline discovery¶
For large repos, use the include glob patterns in lhp.yaml to control which pipeline
files are processed per environment or team. This enables a mono-repo structure where each
team’s files coexist without interfering.
BP-1.4: Separate presets, templates, and substitutions into dedicated directories¶
Follow the standard LHP project layout. See Section 2 for detailed subdirectory guidance within each top-level directory.
presets/ # Reusable defaults (flat — no subdirectory discovery)
templates/ # Reusable action patterns (flat — use prefix-based grouping)
substitutions/ # Environment-specific tokens (dev.yaml, prod.yaml)
pipelines/ # Flowgroup definitions (supports deep subdirectories)
sql/ # External SQL files (supports deep subdirectories)
schemas/ # External schema files (supports deep subdirectories)
expectations/ # External DQE files (supports deep subdirectories)
python_modules/ # External Python modules (supports deep subdirectories)
2. File Organisation & Subdirectory Structure¶
LHP file types have different subdirectory support. Understanding this is critical for organizing an enterprise project with hundreds of files.
Subdirectory Support Matrix¶
File Type |
Base Directory |
Subdirectory Support |
Extensions |
Notes |
|---|---|---|---|---|
Pipeline YAMLs |
|
Full recursive |
|
Discovered via |
SQL files ( |
project root |
Full recursive |
|
Referenced by relative path from project root |
Schema files ( |
project root |
Full recursive |
|
Referenced by relative path from project root |
Expectations files ( |
project root |
Full recursive |
|
Referenced by relative path from project root |
Python modules ( |
project root |
Full recursive |
|
Referenced by relative path from project root |
Templates |
|
Flat only |
|
Discovery uses |
Presets |
|
Flat only |
|
Discovery uses |
Substitutions |
|
Flat only |
|
One file per environment |
1 .yml extension is also accepted but .yaml is recommended for consistency.
BP-2.1: Organize pipeline YAMLs by source system, then by medallion layer¶
LHP recursively discovers all .yaml/.yml files under pipelines/. Use a
two-level hierarchy — source system first, layer second — so that each team owns a clear
subtree:
pipelines/
system_a/ # Source system / data domain
bronze/
system_a_bronze_ingest.yaml # CloudFiles ingestion
silver/
system_a_silver_cleanse.yaml # Validation and enrichment
gold/
system_a_gold_reporting.yaml # Aggregations
system_b/
bronze/
system_b_bronze_ingest.yaml
silver/
system_b_silver_merge.yaml
shared/
gold/
cross_domain_metrics.yaml # Cross-system gold tables
This structure maps cleanly to CODEOWNERS (pipelines/system_a/ owned by Team A) and
to include patterns when you need to generate a subset.
BP-2.2: Organize SQL files mirroring the pipeline structure¶
All sql_path references resolve relative to the project root, so
sql_path: sql/system_a/bronze/cleanse_raw.sql works natively. Mirror the pipeline
directory hierarchy:
sql/
system_a/
bronze/
parse_json_payload.sql
silver/
enrich_orders.sql
validate_customers.sql
gold/
daily_revenue_summary.sql
system_b/
silver/
merge_inventory.sql
shared/
lookups/
currency_conversion.sql
When referencing from YAML:
actions:
- name: transform_enrich_orders
type: transform
transform_type: sql
sql_path: sql/system_a/silver/enrich_orders.sql
source: load_raw_orders
target: enriched_orders_view
BP-2.3: Organize schema files by source system and layer¶
Schema files (DDL, YAML, or JSON) also resolve relative to the project root:
schemas/
system_a/
bronze/
raw_orders_schema.yaml # CloudFiles schema hints
raw_customers_schema.ddl # DDL format
silver/
orders_strict_schema.yaml # Schema transform definitions
system_b/
bronze/
raw_inventory_schema.json # JSON format
When referencing:
actions:
- name: transform_enforce_schema
type: transform
transform_type: schema
schema_file: schemas/system_a/silver/orders_strict_schema.yaml
enforcement: strict
BP-2.4: Organize expectations files by domain and quality tier¶
Store DQE expectation files in a dedicated expectations/ directory, grouped by domain
and quality tier:
expectations/
system_a/
bronze/
raw_orders_warn.yaml # Bronze: warn-only rules
silver/
orders_drop_rules.yaml # Silver: drop invalid rows
orders_quarantine_rules.yaml # Silver: quarantine criteria
gold/
revenue_fail_rules.yaml # Gold: fail on critical invariants
shared/
common_not_null_rules.yaml # Reusable cross-domain rules
When referencing:
actions:
- name: transform_dqe_orders
type: transform
transform_type: data_quality
expectations_file: expectations/system_a/silver/orders_drop_rules.yaml
source: enriched_orders_view
BP-2.5: Organize Python modules by function type¶
For Python-based loads, transforms, and sinks, group modules by their role:
python_modules/
transforms/
system_a/
ml_scoring.py
custom_dedup.py
shared/
phone_normalizer.py
datasources/
erp_connector.py # Custom DataSource V2
sinks/
webhook_sink.py # Custom DataSink
foreachbatch/
notify_downstream.py # ForEachBatch handlers
BP-2.6: Use prefix-based grouping for templates¶
Templates are discovered only at the top level of templates/ — subdirectories are
not discovered by lhp list_templates. Instead, use a structured prefix convention
to categorize templates:
templates/
TMPL001_brz_load_cloudfiles_standard.yaml # Bronze / Load / CloudFiles
TMPL002_brz_load_kafka_events.yaml # Bronze / Load / Kafka
TMPL003_brz_load_delta_snapshot.yaml # Bronze / Load / Delta snapshot
TMPL004_slv_transform_sql_enrichment.yaml # Silver / Transform / SQL
TMPL005_slv_transform_cdc_merge.yaml # Silver / Transform / CDC
TMPL006_slv_write_streaming_table_std.yaml # Silver / Write / Streaming Table
TMPL007_gld_write_materialized_view_agg.yaml # Gold / Write / Materialized View
TMPL008_full_bronze_to_silver_pipeline.yaml # Full pipeline template (multi-action)
The prefix pattern <layer>_<action_type>_<detail> makes templates scannable in
lhp list_templates output and in file explorers. When you have 30+ templates, this
prefix is the primary way to find the right one.
See also
Templates Reference for details on creating and using templates.
BP-2.7: Use prefix-based grouping for presets¶
Like templates, presets are discovered only at the top level of presets/. Use prefixes
to encode scope and layer:
presets/
global_defaults.yaml # Organization-wide
brz_standard.yaml # Bronze layer defaults
brz_cloudfiles_json.yaml # Bronze / CloudFiles / JSON specific
brz_cloudfiles_csv.yaml # Bronze / CloudFiles / CSV specific
slv_standard.yaml # Silver layer defaults
slv_cdc_scd2.yaml # Silver / CDC / SCD Type 2
gld_standard.yaml # Gold layer defaults
ord_custom_overrides.yaml # Orders domain custom
See also
Presets Reference for details on preset inheritance and merging.
BP-2.8: Use include patterns for team-scoped generation¶
When multiple teams share a mono-repo, use include patterns in lhp.yaml to generate
only relevant pipelines. Patterns are matched against paths relative to pipelines/:
# lhp.yaml — generate only system_a pipelines
include:
- "system_a/**/*.yaml"
Or selectively include specific layers:
# Only bronze pipelines across all systems
include:
- "**/bronze/*.yaml"
BP-2.9: Full enterprise project layout example¶
my_lhp_project/
lhp.yaml # Project config
substitutions/
dev.yaml
staging.yaml
prod.yaml
presets/
global_defaults.yaml
brz_standard.yaml
brz_cloudfiles_json.yaml
slv_standard.yaml
slv_cdc_scd2.yaml
gld_standard.yaml
templates/
TMPL001_brz_load_cloudfiles_standard.yaml
TMPL002_slv_transform_sql_enrichment.yaml
TMPL003_gld_write_mv_aggregation.yaml
pipelines/
system_a/
bronze/
system_a_bronze_ingest_TMPL001.yaml
silver/
system_a_silver_cleanse_TMPL002.yaml
gold/
system_a_gold_reporting_TMPL003.yaml
system_b/
bronze/
system_b_bronze_ingest_TMPL001.yaml
silver/
system_b_silver_merge_TMPL002.yaml
sql/
system_a/
silver/
enrich_orders.sql
gold/
daily_revenue.sql
system_b/
silver/
merge_inventory.sql
schemas/
system_a/
bronze/
raw_orders_schema.yaml
silver/
orders_strict_schema.yaml
system_b/
bronze/
raw_inventory_schema.yaml
expectations/
system_a/
bronze/
raw_orders_warn.yaml
silver/
orders_drop_rules.yaml
shared/
common_not_null_rules.yaml
python_modules/
transforms/
system_a/
ml_scoring.py
datasources/
erp_connector.py
generated/ # Output (per environment)
dev/
system_a_bronze_pipeline/
raw_orders.py
system_a_silver_pipeline/
orders_cleanse.py
3. Naming Conventions¶
BP-3.1: Use snake_case consistently across all identifiers¶
Pipelines, flowgroups, action names, templates, presets, variables, table names — all
snake_case. LHP generates Python function names from action names, so this ensures
valid Python identifiers.
BP-3.2: Prefix pipeline names with the source system and layer¶
erp_bronze_pipeline, crm_silver_pipeline — not bronze_pipeline or
pipeline_v2. At 200+ pipelines, generic names become meaningless. LHP uses the
pipeline field in flowgroups to group actions into output files.
See BP-3.9 for the full enterprise naming pattern.
BP-3.3: Name flowgroups to describe the data flow¶
erp_brz_raw_orders, erp_slv_orders_enriched — not cloudfiles_load_1 or
flowgroup_v2. The flowgroup name appears in generated file names and log output.
Embed the source system and layer for visibility. See BP-3.8 for the
full enterprise naming pattern.
BP-3.4: Name actions descriptively with the pattern <verb>_<entity>_<modifier>¶
load_raw_orders, transform_validate_orders, write_orders_silver,
test_orders_row_count. Action names become Python function names in generated code,
so clarity matters.
BP-3.5: Use SCREAMING_SNAKE_CASE for environment tokens¶
Environment tokens (${SOURCE_CATALOG}, ${LANDING_PATH}) are resolved from
substitution files. Local variables (%{table_name}, %{source_schema}) are
flowgroup-scoped. The case distinction makes it immediately clear which resolution
mechanism applies.
See also
Substitutions & Secrets for the full substitution processing order and syntax.
BP-3.6: Never abbreviate in identifiers¶
customer_silver_merge not cust_slvr_mrg. Config files live in version control
forever; clarity beats brevity.
Structured Naming for Enterprise Visibility¶
At enterprise scale (100+ templates, 500+ flowgroups), flat alphabetical lists become
unmanageable. Templates use a TMPLxxx_ ID prefix to embed a unique sequence
number, making them instantly scannable and sortable. Flowgroup config files reference
the template ID as a _TMPLxxx suffix, creating a visible link between a config and
its template. All other artifacts — pipelines, presets, SQL files, schemas, and
expectations — use descriptive prefixes and directory structure for organisation.
BP-3.7: Use TMPLxxx ID prefixes for templates¶
Since templates live in a flat directory (see Section 2), the filename is
the only organisational mechanism. Use a TMPLxxx_ prefix with a sequential number,
followed by a structured name that encodes layer and action type:
Pattern: TMPLxxx_<layer>_<action_type>_<source_or_target_type>_<descriptive_name>
Examples:
TMPL001_brz_load_cloudfiles_standard # Bronze / Load / CloudFiles / standard pattern
TMPL002_brz_load_cloudfiles_with_schema # Bronze / Load / CloudFiles / with schema hints
TMPL003_brz_load_kafka_events # Bronze / Load / Kafka / event stream
TMPL004_slv_transform_sql_enrichment # Silver / Transform / SQL / enrichment pattern
TMPL005_slv_transform_cdc_merge # Silver / Transform / CDC / merge pattern
TMPL006_slv_write_st_with_dqe # Silver / Write / Streaming Table / with DQE
TMPL007_gld_write_mv_aggregation # Gold / Write / Materialized View / aggregation
TMPL008_e2e_full_bronze_to_silver # End-to-end / multi-action pipeline template
Layer prefixes: brz_ (bronze), slv_ (silver), gld_ (gold), e2e_
(end-to-end multi-action).
The TMPLxxx prefix sorts templates by creation order in lhp list_templates
output, while the layer prefix groups them logically. The ID also appears as a suffix
in flowgroup config filenames (see BP-3.8), creating a visible link
between configs and their templates.
BP-3.8: Use descriptive flowgroup names with a _TMPLxxx config file suffix¶
Flowgroup names become Python file names and function names in generated code. Embed the source system and layer for visibility across large projects:
Pattern: <system>_<layer>_<descriptive_name>
Examples:
erp_brz_raw_orders # ERP system / Bronze / raw orders
erp_brz_raw_customers # ERP system / Bronze / raw customers
erp_slv_orders_enriched # ERP system / Silver / enriched orders
erp_slv_customers_merged # ERP system / Silver / merged customers
erp_gld_daily_revenue # ERP system / Gold / daily revenue
crm_brz_raw_contacts # CRM system / Bronze / raw contacts
crm_slv_contacts_deduped # CRM system / Silver / deduped contacts
When naming the Flowgroup file, append the template ID as a suffix so the template relationship is visible at a glance without opening the file:
Pattern: <system>_<layer>_<description>_<TMPLxxx>.yaml
Examples:
erp_bronze_ingest_TMPL001.yaml # Uses TMPL001 (CloudFiles standard)
erp_silver_cleanse_TMPL004.yaml # Uses TMPL004 (SQL enrichment)
erp_gold_reporting_TMPL007.yaml # Uses TMPL007 (MV aggregation)
crm_bronze_contacts_TMPL001.yaml # Uses TMPL001 (CloudFiles standard)
This naming ensures that when you see a generated file erp_brz_raw_orders.py or a DLT
log entry for erp_slv_orders_enriched, you immediately know the source system and layer
without looking up the config. The _TMPLxxx suffix in the config filename lets you
identify the template at the file system level — useful when browsing directories, reviewing
PRs, or triaging issues.
BP-3.9: Use structured prefixes for pipeline names¶
Pipeline names determine the output directory structure under generated/{env}/ and
appear in Databricks UI. Use <system>_<layer>_pipeline for clear identification:
Pattern: <system>_<layer>_pipeline
Examples:
erp_bronze_pipeline # All ERP bronze ingestion
erp_silver_pipeline # All ERP silver transforms
erp_gold_pipeline # All ERP gold aggregations
crm_bronze_pipeline # All CRM bronze ingestion
shared_gold_pipeline # Cross-system gold tables
This gives you clean, predictable output directories:
generated/dev/
erp_bronze_pipeline/
erp_brz_raw_orders.py
erp_brz_raw_customers.py
erp_silver_pipeline/
erp_slv_orders_enriched.py
crm_bronze_pipeline/
crm_brz_raw_contacts.py
BP-3.10: Use consistent prefixes for presets¶
Since presets are also flat (no subdirectory discovery), the naming prefix is essential for organisation:
Pattern: <scope>_<layer>_<purpose>
Examples:
global_defaults # Organisation-wide standards
brz_standard # Bronze layer standard preset
brz_cloudfiles_json # Bronze / CloudFiles / JSON format
brz_cloudfiles_csv # Bronze / CloudFiles / CSV format
brz_kafka_events # Bronze / Kafka event preset
slv_standard # Silver layer standard preset
slv_cdc_scd2 # Silver / CDC / SCD Type 2
gld_standard # Gold layer standard preset
erp_custom # ERP domain custom overrides
Quick Reference Table¶
Artifact |
Convention |
Example |
|---|---|---|
Pipeline names |
|
|
Flowgroup names |
|
|
Action names |
|
|
Config files |
|
|
Template files |
|
|
Preset files |
|
|
SQL files |
|
|
Schema files |
|
|
Expectations files |
|
|
Generated files |
|
|
Env tokens |
|
|
Local variables |
|
|
Template params |
|
|
4. Template Design¶
BP-4.2: Keep template parameters minimal and well-documented¶
Every parameter should have a description and either be required: true or have a
sensible default. LHP validates required parameters at generation time and reports clear
errors for missing ones. Avoid templates with 15+ parameters — they add complexity without
reducing it.
BP-4.3: Establish “golden templates” for each common pipeline pattern¶
Maintain platform-team-owned templates for standard patterns, using the ID-based naming from Section 3:
TMPL001_brz_load_cloudfiles_standard— standard CloudFiles ingestion with operational metadataTMPL002_brz_load_delta_snapshot— Delta table reads with standard optionsTMPL003_slv_write_st_with_dqe— streaming table with DQE expectationsTMPL004_slv_transform_sql_enrichment— SQL-based silver enrichmentTMPL005_gld_write_mv_aggregation— materialized view for gold aggregations
These golden templates embed organisational standards (default expectations, metadata columns, table properties) so domain teams can’t accidentally skip them.
BP-4.4: Templates live in a flat directory — organise by naming convention¶
LHP discovers templates only from the top level of templates/ (using
glob("*.yaml"), not recursive). Subdirectories under templates/ are not
discovered by lhp list_templates. Instead, use the structured prefix convention from
BP-3.7 to group templates logically.
Note
Subdirectories under templates/ are not discovered. Referencing templates via
subfolder paths (e.g., use_template: "subfolder/name") is not supported. Stick to
the flat directory with prefix-based naming.
BP-4.5: Templates can reference presets — use this to layer defaults¶
A template can declare presets: [brz_standard] to inherit default options. Flowgroups
using the template can add additional presets that override. This creates a clean defaults
hierarchy: template presets -> flowgroup presets -> explicit action config.
BP-4.6: Use template parameters for what varies; presets for what is standard¶
Template parameters should capture the unique aspects of each use case (source path, target table, specific columns). Standard aspects (table properties, operational metadata, reader options) belong in presets. This keeps template usage concise.
BP-4.7: Reference external files from templates using parameterised paths¶
Templates can reference external files via sql_path, schema_file, or
expectations_file. Use template parameters for the variable part of the path, combined
with a fixed subdirectory convention:
# Template: slv_transform_sql_enrichment.yaml
name: slv_transform_sql_enrichment
parameters:
- name: system
required: true
description: "Source system name (used in file paths)"
- name: entity
required: true
description: "Entity name"
actions:
- name: transform_enrich_{{ entity }}
type: transform
transform_type: sql
sql_path: "sql/{{ system }}/silver/enrich_{{ entity }}.sql"
source: "load_raw_{{ entity }}"
target: "enriched_{{ entity }}_view"
This way, the directory structure convention (sql/<system>/silver/) is baked into
the template, ensuring all teams follow the same file organisation.
See also
Templates Reference for the full template specification and Dynamic Templates Guide for conditionals, loops, and advanced Jinja2 features.
5. Preset Strategy¶
BP-5.1: Design a preset hierarchy — global, domain, pipeline-specific¶
LHP supports preset inheritance via extends and preset chaining (multiple presets in a
list, merged left-to-right). Use this to build layers:
global_defaults— organisation-wide standards (table properties, metadata)bronze_standardextendsglobal_defaults— bronze-layer conventionsorders_bronzeextendsbronze_standard— domain-specific overrides
BP-5.2: Encode organisational standards in presets, not just values¶
A high-value preset sets multiple related properties together:
name: bronze_standard
extends: global_defaults
defaults:
load_actions:
cloudfiles:
options:
cloudFiles.schemaEvolutionMode: rescue
cloudFiles.rescuedDataColumn: _rescued_data
cloudFiles.maxFilesPerTrigger: 1000
write_actions:
streaming_table:
table_properties:
pipelines.reset.allowed: "false"
operational_metadata:
- ingest_timestamp
- source_file
BP-5.3: Limit the total number of presets¶
More than 15–20 distinct presets leads to confusion and misuse. Consolidate overlapping
presets. LHP’s lhp list_presets command helps audit the current set.
BP-5.4: Use lhp show to verify effective configuration¶
After preset merging, template expansion, and substitution, the effective config can differ
from what the YAML file suggests. Always verify with lhp show <flowgroup> --env <env>
before deploying changes to shared presets. This is LHP’s equivalent of “fully resolved
config.”
BP-5.5: Treat preset changes as high-blast-radius events¶
A change to a global preset affects every pipeline using it. Version presets (add a version
field), document changes, and run lhp validate --env <env> across the entire project
before merging preset changes.
See also
Presets Reference for complete details on preset inheritance and merging.
6. Substitution & Environment Management¶
BP-6.1: Use directory-based environment separation¶
Maintain substitutions/dev.yaml, substitutions/staging.yaml,
substitutions/prod.yaml. All environments are visible on the same branch. LHP resolves
${token} patterns from these files.
BP-6.2: Put all environment-varying values in substitution tokens¶
Catalog names, schema names, storage paths, cluster policies, alert emails — all should be tokens. LHP supports recursive token expansion (tokens referencing other tokens, up to 10 iterations), so you can compose:
global:
catalog_prefix: main
dev:
catalog: "${catalog_prefix}_dev"
prod:
catalog: "${catalog_prefix}_prod"
BP-6.4: Never put secret values in substitution files¶
Use LHP’s ${secret:scope/key} syntax. LHP converts these to
dbutils.secrets.get(scope="scope", key="key") calls in generated code. Configure
secrets.default_scope and scopes aliases in the substitution file for clean
references.
Important
Secrets in substitution files will be committed to version control and leaked. Always
use the ${secret:scope/key} syntax exclusively.
BP-6.5: Use lhp substitutions to audit available tokens¶
Before writing flowgroups, run lhp substitutions --env <env> to check what tokens are
available. This prevents unresolved token errors at generation time.
BP-6.6: Design substitution tokens for the medallion pattern¶
Standard token set for a medallion project:
global:
bronze_catalog: "${catalog_prefix}_bronze"
silver_catalog: "${catalog_prefix}_silver"
gold_catalog: "${catalog_prefix}_gold"
landing_path_base: "abfss://landing@${storage_account}.dfs.core.windows.net"
See also
Substitutions & Secrets for the full substitution processing order and syntax.
7. Local Variables¶
BP-7.1: Use local variables for flowgroup-scoped repetition¶
When the same value (table name, schema, path segment) appears multiple times within a
single flowgroup, define it as a local variable rather than repeating it. LHP resolves
%{var} first, before template expansion.
BP-7.2: Prefer local variables over hardcoded values¶
variables:
entity: orders
source_schema: raw
actions:
- name: load_%{entity}
source:
table: "${BRONZE_CATALOG}.%{source_schema}.%{entity}"
BP-7.3: Do not use local variables for environment-specific values¶
%{var} is scoped to a single flowgroup and resolved at parse time. Environment-specific
values belong in substitution tokens (${TOKEN}) which are resolved per environment.
See also
Substitutions & Secrets for details on local variables and environment tokens.
8. FlowGroup Design¶
BP-8.1: Use array syntax with field inheritance for multi-flowgroup pipelines¶
When multiple flowgroups share the same pipeline, presets, or template, use LHP’s array syntax to inherit:
pipeline: orders_bronze
presets: [bronze_standard]
operational_metadata: true
flowgroups:
- flowgroup: raw_orders
actions: [...]
- flowgroup: raw_returns
actions: [...]
Inherited fields: pipeline, use_template, presets, operational_metadata,
job_name.
See also
Multi-Flowgroup YAML Files for the full multi-flowgroup reference.
BP-8.2: Scope one pipeline per data domain¶
Pipeline orders_bronze contains flowgroups raw_orders, raw_returns,
raw_refunds. Each flowgroup generates its own Python function set but runs in the same
DLT pipeline, enabling dependency resolution across them.
BP-8.3: Use job_name to group flowgroups into Databricks jobs¶
LHP’s lhp deps --format job generates job resource definitions. Use job_name to
control which flowgroups are orchestrated together in a Databricks Workflow.
See also
Concepts & Architecture for details on job_name and multi-job orchestration.
BP-8.4: Order actions as Load, Transform, Write, Test¶
This matches the data flow direction and makes YAML files scannable. LHP resolves dependencies automatically, but consistent ordering improves readability.
9. Load Actions¶
BP-9.1: Always set schemaEvolutionMode and rescuedDataColumn for CloudFiles¶
LHP’s CloudFiles generator supports all Auto Loader options. In production, always use:
source:
type: cloudfiles
path: "${LANDING_PATH}/orders/"
format: json
options:
cloudFiles.schemaEvolutionMode: rescue
cloudFiles.rescuedDataColumn: _rescued_data
Tip
Put these options in a bronze_standard preset so they apply everywhere without
repetition.
BP-9.2: Use readMode: stream for bronze, readMode: batch for lookups¶
LHP’s readMode field controls whether spark.readStream or spark.read is
generated. Bronze sources should stream; dimension/lookup tables should batch-read.
BP-9.3: Use full three-part names via substitution tokens for Delta loads¶
source:
type: delta
catalog: "${SILVER_CATALOG}"
database: "orders"
table: "validated_orders"
LHP constructs catalog.database.table references. Never hardcode catalog or database
names.
BP-9.4: Rate-limit Auto Loader in production¶
Use cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger options (via
presets) to prevent bronze ingestion from overwhelming downstream tables. Set this in your
bronze_standard preset.
BP-9.5: Use schema_hints for critical columns¶
LHP supports cloudFiles.schemaHints option strings. For columns where wrong type
inference would cause downstream failures (amounts, IDs, timestamps), provide explicit
hints.
See also
Load Actions for the full load action specification.
10. Transform Actions¶
BP-10.1: Default to SQL transforms for silver/gold layer logic¶
LHP’s SQL transform generator supports inline SQL or external SQL files via sql_path.
SQL is more readable, more widely understood, and easier to review than Python transforms
for standard operations. Use external SQL files for anything over ~5 lines.
BP-10.2: Use external SQL files for complex transformations¶
LHP resolves sql_path relative to the project root. Store SQL in
sql/<system>/<layer>/<transform_name>.sql (see Section 2). This keeps
YAML files concise and enables SQL-specific linting.
BP-10.3: Use Python transforms only when SQL cannot express the logic¶
LHP’s Python transform generator copies external modules and calls your function. The signature depends on the number of sources:
Single source:
function(df, spark, parameters)— receives the source DataFrame directlyMultiple sources:
function(dataframes, spark, parameters)— receives a list of DataFramesNo sources:
function(spark, parameters)— function generates data from scratch
Reserve Python transforms for UDFs, ML scoring, or complex procedural logic.
BP-10.4: Use schema transforms for explicit column control¶
LHP’s schema transform type supports column renaming (arrow syntax:
old_name -> new_name), type casting, and strict/permissive enforcement. Use
enforcement: strict at silver to reject unexpected columns from bronze.
BP-10.5: Use data_quality transforms for DQE expectations¶
LHP’s data_quality transform type reads expectations from YAML/JSON files or inline
definitions, generating the appropriate @dp.expect_all(),
@dp.expect_all_or_drop(), or @dp.expect_all_or_fail() decorators.
BP-10.6: Use temp_table transforms for intermediate calculations¶
LHP generates @dp.table(temporary=True) for temp tables. Use these for intermediate
steps that should not be published to Unity Catalog.
See also
Transform Actions for the full transform action specification.
11. Write Actions¶
BP-11.1: Default to materialized views for silver/gold layers¶
LHP’s materialized_view write target generates @dp.materialized_view(). Materialized
views always produce correct results — they reprocess when source data changes. Use them for
all joins, aggregations, and enrichment.
BP-11.2: Use streaming tables for bronze ingestion and CDC targets¶
LHP’s streaming_table write target generates dp.create_streaming_table() +
@dp.append_flow(). Streaming tables are optimal for append-only ingestion.
Important
Joins in streaming tables do not recompute when dimensions change — use materialized views for enrichment.
BP-11.3: Set pipelines.reset.allowed: "false" on history tables¶
LHP supports table_properties in write targets. This prevents accidental full refresh
from destroying historical data:
write_target:
type: streaming_table
table_properties:
pipelines.reset.allowed: "false"
Tip
Put this in your silver_standard and gold_standard presets.
BP-11.4: Use cluster_columns (liquid clustering) instead of partition_columns¶
LHP supports both, but liquid clustering is the modern recommendation. It’s incremental, allows redefining keys without rewriting data, and works well with high-cardinality columns:
write_target:
type: streaming_table
cluster_columns: [customer_id, order_date]
BP-11.5: Use comment on every write target¶
LHP passes the comment field to the generated table/view definition. This appears in
Unity Catalog UI and is queryable.
BP-11.6: Use spark_conf for per-table performance tuning¶
LHP supports spark_conf on write targets. Use it for adaptive shuffle or per-table
optimisations rather than global pipeline settings.
BP-11.7: For CDC, use the cdc mode with explicit cdc_config¶
LHP generates dp.create_auto_cdc_flow() with full support for keys,
sequence_by (including STRUCT for tie-breaking), scd_type (1 or 2),
apply_as_deletes, ignore_null_updates, track_history_column_list, and
track_history_except_column_list options. Always specify sequence_by explicitly.
BP-11.8: Use once: true for backfill flows¶
LHP supports the once flag on individual actions, generating one-time flows for
historical data backfill without affecting the ongoing streaming ingestion.
BP-11.9: Multiple write actions targeting the same table are automatically grouped¶
LHP consolidates multiple sources writing to the same streaming table into one
create_streaming_table with multiple append_flow functions. Use this for
multi-source ingestion patterns.
BP-11.10: Use snapshot_cdc mode for full-snapshot change data capture¶
LHP also supports mode: "snapshot_cdc" on streaming tables, generating
dp.create_auto_cdc_from_snapshot_flow(). Use this when your source provides full
snapshots (not a change feed) and you want LHP to detect changes automatically.
Configuration uses snapshot_cdc_config (not cdc_config):
write_target:
type: streaming_table
streaming_table_config:
mode: "snapshot_cdc"
snapshot_cdc_config:
source_function:
file: "functions/my_snapshots.py"
function: "my_snapshot_function"
keys: [id]
stored_as_scd_type: 2
Key differences from cdc mode:
Config key is
snapshot_cdc_config(notcdc_config)SCD type field is
stored_as_scd_type(notscd_type)Requires a
source_functionwithfileandfunctionfieldsDoes not use
sequence_by— ordering is implicit from snapshot timing
BP-11.11: Use sink write targets for streaming to external destinations¶
LHP supports a sink write target type for writing to external systems. Four sink
subtypes are available:
delta — write to external Delta tables outside Unity Catalog (e.g., cross-workspace or external storage)
kafka — write to Kafka or Azure Event Hubs for event-driven architectures
custom — use a custom DataSink V2 class via the
custom_sink_classconfig fieldforeachbatch — ForEachBatch handlers for custom per-batch processing (API calls, notifications, etc.)
write_target:
type: sink
sink_type: kafka
sink_config:
kafka.bootstrap.servers: "${KAFKA_BROKERS}"
topic: "enriched_orders"
Use sinks when data must leave the lakehouse — for downstream consumers, event buses, or external APIs. Pair with streaming tables for the primary lakehouse copy.
See also
Write Actions for the full write action specification.
12. Data Quality (Expectations)¶
BP-12.1: Tier expectations by medallion layer¶
Bronze:
warnonly — never drop or fail at bronze. Every raw record is precious.Silver:
dropfor structural quality rules. Route violations to a quarantine table.Gold/Critical:
failfor reference table integrity and business-critical invariants.
LHP’s DQE parser supports failureAction: fail|drop|warn in expectation files and
generates the appropriate decorators.
See also
For configuring quarantine mode in LHP, see Quarantine (Dead Letter Queue).
BP-12.2: Centralise expectation definitions in external DQE files¶
LHP supports expectations_file pointing to YAML/JSON files. Store these in
expectations/<domain>/ and reference them from multiple actions. This enables reuse and
independent review of quality rules.
BP-12.3: Name expectations descriptively¶
Convention: valid_<column>_<constraint_type> (e.g., valid_order_id_not_null,
valid_amount_positive). These names appear in the DLT Data Quality tab and event log.
BP-12.5: Use test actions for cross-table validation¶
LHP’s 9 test action types (row_count, uniqueness, referential_integrity,
completeness, range, schema_match, all_lookups_found, custom_sql,
custom_expectations) generate SQL-based validation views. Use --include-tests flag
to generate them. Always run these in staging before production deployment.
To publish test results to external systems like Azure DevOps or a Delta audit table, see Test Result Reporting (Publishing).
See also
Test Actions (Data Quality Unit Tests) for the full test action specification.
13. Operational Metadata¶
BP-13.1: Define operational metadata columns in lhp.yaml¶
LHP supports project-level operational_metadata with column definitions, presets, and
defaults. Define standard columns once:
operational_metadata:
columns:
ingest_timestamp:
expression: "F.current_timestamp()"
description: "When the record was ingested"
applies_to: [streaming_table, materialized_view]
source_file:
expression: "F.input_file_name()"
description: "Source file path"
applies_to: [streaming_table]
enabled: true
pipeline_id:
expression: "F.lit(spark.conf.get('pipelines.id'))"
description: "Pipeline identifier"
additional_imports:
- "from pyspark.sql import functions as F"
Each column config supports these fields:
expression(required) — PySpark expression stringdescription— Human-readable descriptionapplies_to— List of target types (default:[streaming_table, materialized_view])enabled— Boolean to enable/disable the column (default:true)additional_imports— List of extra Python import statements needed by the expression
BP-13.2: Create metadata presets for different layers¶
LHP supports operational_metadata.presets for named groups in lhp.yaml:
operational_metadata:
presets:
bronze_standard: [ingest_timestamp, source_file, pipeline_id]
silver_standard: [updated_at, pipeline_run_id]
Note
Metadata presets are defined at the project level for documentation and organisational
purposes. At the flowgroup or action level, operational_metadata accepts either
true (to enable all columns) or an explicit list of column name strings — not preset
names. Reference the preset definitions as a guide when writing the column name lists in
your flowgroups.
BP-13.3: Metadata is additive across preset, flowgroup, and action levels¶
LHP deep-merges operational metadata with deduplication. This means you can set a baseline in a preset and add columns at the flowgroup or action level without losing the preset columns.
BP-13.4: Use applies_to to control which target types get each column¶
input_file_name() is only valid in streaming/batch reads — set
applies_to: [streaming_table]. current_timestamp() works everywhere — set
applies_to: [streaming_table, materialized_view].
See also
Operational Metadata for the full operational metadata reference.
14. Schema Management¶
BP-14.1: Use schema files for bronze layer schema definition¶
LHP’s schema_file field in load actions points to external DDL, YAML, or JSON schema
files. This makes schema definitions reviewable independently of pipeline config.
BP-14.2: Use schema transforms at the bronze-to-silver boundary¶
LHP’s schema transform type provides explicit column control:
Arrow syntax for renaming:
old_col -> new_colType casting:
amount: decimal(18,2)Strict enforcement to reject unexpected columns
BP-14.3: Use enforcement: strict at silver to prevent schema drift¶
LHP’s schema transform with enforcement: strict generates code that only keeps declared
columns. Combined with silver-layer DQE expectations, this creates a clean schema contract
between bronze and silver.
15. Validation & CI Integration¶
BP-15.1: Run lhp validate as a blocking CI check on every PR¶
LHP’s validation stack catches: missing required fields, unknown fields (with fuzzy-match suggestions), circular dependencies, invalid references, template parameter mismatches, and type-specific validation for all 7 load types, 5 transform types, and all write target types.
BP-15.2: Run lhp generate --dry-run to verify code generation¶
Dry-run generates code without writing files. Use this in CI to catch generation errors early.
BP-15.3: Maintain dry-run baselines for regression detection¶
Commit expected generated output to the repo. In CI, run lhp generate --dry-run and
diff against baselines. Unexpected changes (especially from preset modifications) are
flagged for review. This is the config-equivalent of snapshot testing.
BP-15.4: Layer your CI validation pipeline¶
Layer |
What it checks |
Tool |
|---|---|---|
Syntax |
Valid YAML, correct indentation |
|
Schema |
Required fields, correct types |
JSON Schema (LHP provides schemas in |
Semantic |
References resolve, no circular deps |
|
Generation |
Config generates valid Python |
|
Regression |
No unintended diff in output |
Baseline comparison |
Functional |
Test actions pass |
|
See also
CI/CD Reference for comprehensive CI/CD patterns and deployment strategies.
16. State Management & Incremental Generation¶
BP-16.1: DO NOT Commit .lhp_state.json to version control¶
LHP’s state tracking enables smart regeneration — only files whose source YAML,
dependencies, or generation context changed are regenerated. This significantly speeds up
lhp generate for large projects but must not be committed to source control
BP-16.2: Use lhp state to audit orphaned and stale files¶
After refactoring (renaming flowgroups, deleting pipelines), use the available flags to audit and manage state:
Flag |
Purpose |
|---|---|
|
Show generated files with no corresponding source YAML |
|
Show files where the source YAML has changed since last generation |
|
Show new/untracked YAML files that haven’t been generated yet |
|
Remove orphaned files |
|
Regenerate stale files |
|
Preview cleanup or regen without actually modifying files |
Combine filters: lhp state --env dev --orphaned --cleanup --dry-run previews which
orphaned files would be deleted.
BP-16.3: Use --force only when necessary¶
LHP’s ForceGenerationStrategy regenerates everything. Use it only after framework
upgrades or preset changes where you want to verify all output. Normal development should
rely on smart generation.
See also
CLI Reference for the full lhp state command reference.
17. Bundle Integration (Databricks Asset Bundles)¶
BP-17.1: Use lhp deps --format job to generate DAB job resource definitions¶
LHP analyses dependencies and generates pipeline and job resource YAML for Databricks
Asset Bundles. Use --bundle-output to specify where bundle files are written.
BP-17.2: Bundle scaffolding is included by default¶
LHP scaffolds the full DAB structure by default with lhp init, including
databricks.yml, resource definitions, and standard folder layout. Use
lhp init <name> --no-bundle to skip DAB setup if you manage bundle configuration
separately.
BP-17.3: Keep generated bundle resources separate from hand-written ones¶
LHP generates bundle resources from dependency analysis. Store them in a dedicated
directory (e.g., bundle/generated/) so they can be regenerated without conflicting
with manually defined resources.
See also
Databricks Asset Bundles Integration for the full bundle integration guide.
18. Architectural Pattern Support¶
BP-18.1: Medallion architecture — use LHP’s layered approach¶
Layer |
Write Target |
DQE Tier |
Metadata |
Key Characteristics |
|---|---|---|---|---|
Bronze |
Streaming table |
|
ingest_timestamp, source_file |
Raw ingestion, CloudFiles/Kafka, schema rescue |
Silver |
Materialized view |
|
updated_at, pipeline_run_id |
Validated, deduplicated, schema-enforced |
Gold |
Materialized view |
|
(inherited) |
Aggregations, denormalised reporting |
LHP supports all these natively through its action types, write targets, and DQE integration.
BP-18.2: Environment promotion — use substitution files per environment¶
Same YAML configs, different --env flags. LHP resolves all tokens per environment.
Generated code is environment-specific but source configs are environment-agnostic.
BP-18.3: Multi-pipeline orchestration — use job_name and lhp deps¶
LHP’s dependency analysis produces pipeline-level and job-level dependency graphs. Use these to build Databricks Workflow orchestration that respects data dependencies across pipelines.
See also
Dependency Analysis & Job Generation for pipeline dependency analysis and orchestration job generation.
BP-18.4: Multi-source ingestion — use multiple load/write actions targeting the same table¶
LHP consolidates multiple write actions to the same streaming table into multiple
append_flow functions. This supports fan-in patterns (multiple sources -> one table)
natively.
19. Documentation & Discoverability¶
BP-19.1: Use description fields on every action and write target¶
LHP passes descriptions through to generated code comments and table metadata. Fill these in consistently.
BP-19.2: Use comment on write targets for Unity Catalog table descriptions¶
These appear in the Data Explorer and are queryable. Make them meaningful: “Silver layer orders — deduped, validated, enriched with customer data.”
BP-19.3: Use YAML comments for “why” decisions¶
# Using batch mode because source schema changes frequently and CDC is not supported
readMode: batch
The YAML declares what; comments explain why.
BP-19.4: Use lhp info and lhp stats for project documentation¶
These commands produce summaries of project structure, pipeline counts, and action distributions. Use them in onboarding documentation.
See also
CLI Reference for the full CLI command reference.
20. Anti-Patterns to Avoid¶
Warning
The following are common mistakes that undermine the value of using LHP. Each anti-pattern lists the impact and the recommended fix.
ID |
Anti-Pattern |
Why It’s Harmful |
Fix |
|---|---|---|---|
AP-1 |
Hardcoding catalog/schema names in YAML |
Makes environment promotion impossible |
Always use substitution tokens |
AP-2 |
Using |
One bad record stops the entire pipeline |
Use |
AP-3 |
Skipping |
Generation errors from invalid config are harder to diagnose |
Always validate first |
AP-4 |
Using streaming tables for join-based enrichment |
Streaming tables don’t recompute when dimensions change |
Use materialized views for any join with updating dimensions |
AP-5 |
Building templates before understanding the pattern |
Leads to over-generalised, hard-to-use templates |
Write 3+ concrete flowgroups first, then extract |
AP-6 |
Treating preset changes as low-risk |
A global preset change affects every pipeline using it |
Validate the full project after any preset change |
AP-7 |
Not using operational metadata |
Debugging production issues without audit columns is very hard |
Use LHP’s operational metadata system consistently |
AP-8 |
Monolithic YAML files |
Unreadable, unreviewable, untestable |
One pipeline per file |
AP-9 |
Secrets in substitution files |
Secrets in version control will be leaked |
Use |
AP-10 |
Ignoring |
Schema mismatches without rescue silently drop data |
Always enable |
AP-11 |
Dumping all SQL files in a flat |
At 100+ SQL files, finding the right one is painful |
Use |
AP-12 |
Using subdirectories for templates or presets |
LHP only discovers flat |
Use prefix-based naming instead (see Section 2) |
AP-13 |
Generic names without system/layer context |
|
Use ID-based naming: |