Multi-Flowgroup YAML Files¶
Overview¶
LakehousePlumber supports defining multiple flowgroups in a single YAML file, reducing file proliferation and improving project organization. This is particularly useful when you have many similar flowgroups that share common configuration.
Key Benefits:
Reduce file count (e.g., 93 files → 20-30 files)
Group related flowgroups logically (e.g., all SAP master data tables)
Share common configuration across flowgroups
Maintain cleaner project structure
Easier to maintain and review
Two Syntaxes Available¶
LakehousePlumber supports two syntaxes for multi-flowgroup files:
Multi-Document Syntax using
---separators (traditional YAML)Array Syntax with
flowgroups:list (with inheritance)
Both syntaxes work identically for generation, validation, and all LHP commands.
Multi-Document Syntax¶
Use YAML’s multi-document feature with --- separators to define multiple flowgroups in one file.
Example: SAP Master Data Ingestions
# Brand master data ingestion from SAP
pipeline: raw_ingestions_sap
flowgroup: sap_brand_ingestion
use_template: TMPL003_parquet_ingestion_template
template_parameters:
table_name: raw_sap_brand
landing_folder: brand
---
# Category master data ingestion from SAP
pipeline: raw_ingestions_sap
flowgroup: sap_cat_ingestion
use_template: TMPL003_parquet_ingestion_template
template_parameters:
table_name: raw_sap_cat
landing_folder: category
---
# Carrier master data ingestion from SAP
pipeline: raw_ingestions_sap
flowgroup: sap_carrier_ingestion
use_template: TMPL003_parquet_ingestion_template
template_parameters:
table_name: raw_sap_carrier
landing_folder: carrier
Characteristics:
Each flowgroup is a complete, self-contained document
Use
---to separate documentsNo field inheritance between documents
Best for flowgroups with different structures
Array Syntax with Inheritance¶
Use the flowgroups: array syntax to define multiple flowgroups with shared configuration at the document level.
Example: SAP Master Data Ingestions (with inheritance)
# SAP Master Data Ingestions
pipeline: raw_ingestions_sap
use_template: TMPL003_parquet_ingestion_template
flowgroups:
- flowgroup: sap_brand_ingestion
template_parameters:
table_name: raw_sap_brand
landing_folder: brand
- flowgroup: sap_cat_ingestion
template_parameters:
table_name: raw_sap_cat
landing_folder: category
- flowgroup: sap_carrier_ingestion
template_parameters:
table_name: raw_sap_carrier
landing_folder: carrier
Characteristics:
Shared fields at document level are inherited by all flowgroups
Individual flowgroups can override inherited values
More concise when many flowgroups share configuration
Best for similar flowgroups with common settings
Inheritance Rules¶
When using array syntax, flowgroups inherit document-level fields:
Inheritable Fields:
pipelineuse_templatepresetsoperational_metadatajob_name(for multi-job orchestration)
Inheritance Behavior:
Key Not Present → Inherits document-level value
Key Present (any value) → Uses flowgroup-level value (overrides)
Empty List ``[]`` → Explicit override (no inheritance)
Example: Inheritance with Override
pipeline: bronze_layer
presets:
- bronze_preset
- data_quality
operational_metadata: true
flowgroups:
- flowgroup: table1_bronze
# Inherits: pipeline, presets, operational_metadata
- flowgroup: table2_bronze
presets:
- silver_preset # Override presets
# Inherits: pipeline, operational_metadata
- flowgroup: table3_bronze
presets: [] # Explicit: no presets
# Inherits: pipeline, operational_metadata
Type-Agnostic Override¶
Different types can override each other (e.g., bool → list):
operational_metadata: true
flowgroups:
- flowgroup: table1
# Inherits: operational_metadata: true
- flowgroup: table2
operational_metadata: ["col1", "col2", "col3"]
# Override with list
Job Name Inheritance¶
The job_name property (used for multi-job orchestration) is also inherited by all flowgroups in the array:
pipeline: raw_ingestions_sap
job_name:
- SAP_SFCC # Inherited by all flowgroups below
use_template: TMPL003_parquet_ingestion_template
flowgroups:
- flowgroup: sap_supplier_ingestion
# Inherits: pipeline, job_name, use_template
template_parameters:
table_name: raw_sap_supplier
landing_folder: supplier
- flowgroup: sap_product_ingestion
# Inherits: pipeline, job_name, use_template
template_parameters:
table_name: raw_sap_product
landing_folder: product
Result: All flowgroups automatically assigned to the SAP_SFCC job.
Note: If any flowgroup has job_name defined, all flowgroups in your project must have it (validated at runtime). See Databricks Asset Bundles Integration for multi-job orchestration details.
Migration Guide¶
Converting Multiple Files to One¶
Before: 3 Separate Files
pipelines/bronze/brand.yaml:
pipeline: bronze_sap
flowgroup: bronze_sap_brand
use_template: standard_bronze
template_parameters:
table_name: brand
pipelines/bronze/category.yaml:
pipeline: bronze_sap
flowgroup: bronze_sap_category
use_template: standard_bronze
template_parameters:
table_name: category
pipelines/bronze/carrier.yaml:
pipeline: bronze_sap
flowgroup: bronze_sap_carrier
use_template: standard_bronze
template_parameters:
table_name: carrier
After: 1 Combined File
pipelines/bronze/sap_master_data.yaml:
pipeline: bronze_sap
use_template: standard_bronze
flowgroups:
- flowgroup: bronze_sap_brand
template_parameters:
table_name: brand
- flowgroup: bronze_sap_category
template_parameters:
table_name: category
- flowgroup: bronze_sap_carrier
template_parameters:
table_name: carrier
Result: 3 files → 1 file (67% reduction)
When to Use Multi-Flowgroup Files¶
Use Multi-Flowgroup Files When:
Multiple flowgroups share common configuration
Tables belong to same logical group (e.g., all SAP master data)
Flowgroups use same template with different parameters
Managing many similar, small flowgroups
Use Separate Files When:
Flowgroups have significantly different structures
Flowgroups belong to different logical domains
Independent ownership/maintenance is important
File would become too large (>500 lines as guideline)
Use Multi-Document Syntax When:
Flowgroups have different base configurations
You prefer traditional YAML multi-document approach
No shared fields to inherit
Use Array Syntax When:
Many flowgroups share same configuration
You want to minimize repetition
Flowgroups are variations on a theme
Real-World Example: ACME Supermarkets¶
The ACME Supermarkets project originally had 93 flowgroup files across raw, bronze, and silver layers. Using multi-flowgroup files:
Before:
pipelines/01_raw_ingestion/SAP/
├── brand_ingestion.yaml
├── category_ingestion.yaml
├── carrier_ingestion.yaml
├── product_ingestion.yaml
├── supplier_ingestion.yaml
├── warehouse_ingestion.yaml
├── uom_ingestion.yaml
├── user_ingestion.yaml
└── ... (15+ more files)
After:
pipelines/01_raw_ingestion/SAP/
├── master_data_ingestions.yaml # 8 flowgroups
├── transactional_data_ingestions.yaml # 6 flowgroups
└── inventory_ingestions.yaml # 4 flowgroups
Result: 27 files → 3 files (89% reduction)
Complete Example¶
Here’s a complete example combining multiple concepts:
# Bronze Layer - NCR Transaction Data
pipeline: bronze_ncr
job_name:
- NCR # Multi-job orchestration (optional)
use_template: TMPL004_raw_to_bronze_standard
presets:
- bronze_layer
- data_quality_checks
operational_metadata: true
flowgroups:
- flowgroup: pos_transaction_bronze
template_parameters:
raw_table_name: raw_ncr_pos_transaction
bronze_table_name: bronze_ncr_pos_transaction
generate_surrogate_key: true
surrogate_key_name: transaction_key
- flowgroup: payment_method_bronze
template_parameters:
raw_table_name: raw_ncr_payment_method
bronze_table_name: bronze_ncr_payment_method
generate_surrogate_key: true
surrogate_key_name: payment_method_key
- flowgroup: location_bronze
template_parameters:
raw_table_name: raw_ncr_location
bronze_table_name: bronze_ncr_location
generate_surrogate_key: true
surrogate_key_name: location_key
presets:
- bronze_layer
- data_quality_checks
- scd_type2 # Additional preset for this flowgroup
- flowgroup: customer_bronze
template_parameters:
raw_table_name: raw_ncr_customer
bronze_table_name: bronze_ncr_customer
generate_surrogate_key: true
surrogate_key_name: customer_key
operational_metadata: ["col1", "col2", "col3"] # Override with specific columns
Validation and Error Handling¶
Duplicate Flowgroup Names¶
LakehousePlumber detects duplicate flowgroup names within the same file:
flowgroups:
- flowgroup: my_flowgroup
- flowgroup: my_flowgroup # ERROR: Duplicate name
Error:
ValueError: Duplicate flowgroup name 'my_flowgroup' in file
pipelines/bronze/data.yaml
Mixed Syntax Not Allowed¶
You cannot mix multi-document and array syntax in the same file:
# This is NOT allowed
flowgroups:
- flowgroup: fg1
---
flowgroup: fg2
Error:
ValueError: Mixed syntax detected in pipelines/bronze/data.yaml:
cannot use both multi-document (---) and flowgroups array syntax
in the same file
Backward Compatibility¶
All existing single-flowgroup files continue to work without changes.
LakehousePlumber automatically detects:
Single-flowgroup files (original behavior)
Multi-document files (new)
Array syntax files (new)
All commands (generate, validate, show, etc.) work identically regardless of syntax.
CLI Commands¶
All LakehousePlumber commands work seamlessly with multi-flowgroup files:
Generate¶
# Generate from multi-flowgroup file
lhp generate bronze_sap --env dev
# Discovers all flowgroups automatically
# Output: Generates one Python file per flowgroup
Validate¶
# Validate multi-flowgroup files
lhp validate bronze_sap --env dev
# Validates all flowgroups in all files
Show¶
# Show specific flowgroup (works regardless of file structure)
lhp show sap_brand_ingestion --env dev
# Finds flowgroup even in multi-flowgroup files
State Tracking¶
State tracking works identically with multi-flowgroup files:
Each flowgroup tracked individually by
(pipeline, flowgroup)tupleSource YAML file path tracked for each flowgroup
Multiple flowgroups from same file tracked with same source path
Smart regeneration based on individual flowgroup changes
Best Practices¶
Group Logically: Combine flowgroups that belong together conceptually
Keep Manageable: Aim for 5-15 flowgroups per file
Use Clear Names: Name files to reflect their contents (e.g.,
sap_master_data.yaml)Add Comments: Document the purpose of each section
Consistent Style: Choose one syntax per project for consistency
Test First: Validate with
lhp validateafter combining files
Troubleshooting¶
Flowgroup Not Found¶
If lhp show cannot find your flowgroup:
Check the flowgroup name is spelled correctly
Verify the file is in the
pipelines/directoryCheck
lhp.yamlinclude patterns (if used)Ensure no YAML syntax errors
Generation Issues¶
If generation fails with multi-flowgroup files:
Validate syntax:
lhp validate <pipeline> --env <env>Check for duplicate flowgroup names
Verify template parameters are correct for each flowgroup
Test with
--dry-runfirst
Performance¶
Multi-flowgroup files are parsed once per file, not once per flowgroup, making discovery more efficient for projects with many flowgroups.
Summary¶
Multi-flowgroup YAML support in LakehousePlumber:
✅ Reduces file proliferation
✅ Improves project organization
✅ Maintains backward compatibility
✅ Works with all LHP commands
✅ Two flexible syntaxes available
✅ Smart inheritance in array syntax
✅ Validated and tested
For questions or issues, refer to the main documentation or open an issue on GitHub.