Concepts & Architecture ======================= .. meta:: :description: Core concepts: Pipelines, FlowGroups, Actions, substitutions, presets, and templates. Understand the LHP architecture. At its core Lakehouse Plumber converts **declarative YAML** into regular Databricks Lakeflow Declarative Pipelines (ETL) Python code. The YAML files are intentionally simple and the heavy-lifting happens inside the Plumber engine at generation time. This page explains the key building blocks you will interact with. FlowGroups ---------- A **FlowGroup** represents a logical slice of your pipeline often a single source table or business entity. YAML files can contain one or multiple FlowGroups (see :doc:`multi_flowgroup_guide` for details on multi-flowgroup files). Required keys in a FlowGroup YAML file .. code-block:: yaml pipeline: bronze_raw # pipeline name (logical) flowgroup: customer_bronze_ingestion # unique name for the flowgroup (logical) actions: # list of steps in the flowgroup Optional keys in a FlowGroup YAML file .. code-block:: yaml job_name: - NCR # Optional: Assign flowgroup to a specific orchestration job variables: # Optional: Define local variables for this flowgroup entity: customer table: customer_raw The ``job_name`` property enables **multi-job orchestration**, allowing you to split your flowgroups into separate Databricks jobs rather than a single monolithic orchestration job. This is useful for: * **Separate scheduling** - Different jobs can run on different schedules (e.g., hourly POS data, daily ERP data) * **Isolated execution** - Jobs run independently with separate concurrency and resource settings * **Modular organization** - Group related flowgroups by source system, business domain, or data criticality * **Flexible configuration** - Each job can have its own tags, notifications, timeouts, and performance targets .. important:: **All-or-Nothing Rule**: If ``job_name`` is defined for **any** flowgroup in your project, it **must** be defined for **all** flowgroups. This ensures consistent orchestration behavior and prevents configuration errors. **Example with multi-job orchestration:** .. code-block:: yaml :caption: pipelines/ncr/pos_transactions.yaml pipeline: bronze_ncr flowgroup: pos_transaction_bronze job_name: - NCR # Assigns this flowgroup to the "NCR" orchestration job actions: - name: load_pos_data type: load source: type: cloudfiles path: "/mnt/landing/ncr/pos/*.parquet" target: v_pos_raw When ``job_name`` is used: * Each unique ``job_name`` generates a separate Databricks job file (e.g., ``NCR.job.yml``, ``SAP_SFCC.job.yml``) * A **master orchestration job** is generated that coordinates execution across all jobs * Dependencies between jobs are automatically detected and handled in the master job * Per-job configuration is managed through multi-document ``job_config.yaml`` files .. seealso:: For complete details on multi-job orchestration, job configuration, and the master orchestration job, see :doc:`databricks_bundles`. .. note:: **FlowGroup vs Pipeline:** - A **FlowGroup** represents a logical slice of your pipeline often a single source table or business entity. - A **Pipeline** is a logical grouping of FlowGroups. It is used to group the generated python files in the same folder. - Lakeflow Declarative Pipelines are **declarative** (as the name suggests) hence the order of the actions is determined at runtime by the Lakeflow engine based on the dependencies between the tables/views. - **YAML files** can contain one flowgroup (traditional) or multiple flowgroups (see :doc:`multi_flowgroup_guide`). Actions ------- Every FlowGroup lists one or more **Actions** Actions come in three top-level types: +----------------+----------------------------------------------------------+ | Type | Purpose | +================+==========================================================+ | **Load** | Bring data into a temporary **view** (e.g. CloudFiles, | | | Delta, JDBC, SQL, Python, custom_datasource). | +----------------+----------------------------------------------------------+ | **Transform** | Manipulate data in one or more steps (SQL, Python, | | | schema adjustments, data-quality checks, temp tables…). | +----------------+----------------------------------------------------------+ | **Write** | Persist the final dataset to a *streaming_table*, | | | *materialized_view*, or external *sink* (Kafka, | | | Delta, custom API). | +----------------+----------------------------------------------------------+ .. note:: - You may chain **zero or many Transform actions** between a Load and a Write. .. important:: - the order of the actions is determined at runtime by the Lakeflow engine based on the dependencies between the tables/views, Not the order in the YAML file or the generated Python file. For a complete catalogue of Action sub-types and their options see :doc:`actions/index`. Presets ------- A **Preset** is a YAML file that provides default configuration snippets you can reuse across FlowGroups. Presets inject default values that are merged with explicit configurations in templates and flowgroups. Common use cases: * Standardised table properties for all Bronze streaming tables * CloudFiles ingestion options (error handling, schema evolution) * Spark configuration tuning Example preset file: .. code-block:: yaml :caption: presets/cloudfiles_defaults.yaml name: cloudfiles_defaults version: "1.0" description: "Standard CloudFiles options" defaults: load_actions: cloudfiles: options: cloudFiles.rescuedDataColumn: "_rescued_data" ignoreCorruptFiles: "true" ignoreMissingFiles: "true" cloudFiles.maxFilesPerTrigger: 200 Usage in a FlowGroup: .. code-block:: yaml presets: - cloudfiles_defaults actions: - name: load_data type: load source: type: cloudfiles options: cloudFiles.format: csv # Merged with preset options For complete preset documentation see :doc:`presets_reference`. Templates --------- While presets inject reusable **values**, **Templates** inject reusable **action patterns** think of them as parametrised macros. In a template file you define parameters and a list of actions that reference those parameters. Inside a FlowGroup you apply the template and provide actual arguments **Example of a template file:** .. code-block:: yaml :caption: templates/csv_ingestion_template.yaml :linenos: # This is a template for ingesting CSV files with schema enforcement # It is used to generate the actions for the pipeline # within the pipeline all it need to defined are the parameters for the table name and landing folder # the template will generate the actions for the pipeline name: csv_ingestion_template version: "1.0" description: "Standard template for ingesting CSV files with schema enforcement" presets: - bronze_layer parameters: - name: table_name required: true description: "Name of the table to ingest" - name: landing_folder required: true description: "Name of the landing folder" actions: - name: load_{{ table_name }}_csv type: load readMode : "stream" operational_metadata: ["_source_file_path","_source_file_size","_source_file_modification_time","_record_hash"] source: type: cloudfiles path: "${landing_volume}/{{ landing_folder }}/*.csv" format: csv options: cloudFiles.format: csv header: True delimiter: "|" cloudFiles.maxFilesPerTrigger: 11 cloudFiles.inferColumnTypes: False cloudFiles.schemaEvolutionMode: "addNewColumns" cloudFiles.rescuedDataColumn: "_rescued_data" cloudFiles.schemaHints: "schemas/{{ table_name }}_schema.yaml" target: v_{{ table_name }}_cloudfiles description: "Load {{ table_name }} CSV files from landing volume" - name: write_{{ table_name }}_cloudfiles type: write source: v_{{ table_name }}_cloudfiles write_target: type: streaming_table database: "${catalog}.${raw_schema}" table: "{{ table_name }}" description: "Write {{ table_name }} to raw layer" **Example of a flowgroup using the template:** .. code-block:: yaml :caption: pipelines/01_raw_ingestion/csv_ingestions/customer_ingestion.yaml :linenos: :emphasize-lines: 11-14 # This pipeline is used to ingest the customer table from the csv files into the raw schema # Pipeline variable puts the generate files in the same folder for the pipeline to pick up pipeline: raw_ingestions # Flowgroup are conceptual artifacts and has no functional purpose # there are used to group actions together in the generated files flowgroup: customer_ingestion # Use the template to generate the actions for the pipeline # Template parameters are used to pass in the table name and landing folder # The template will generate the actions for the pipeline use_template: csv_ingestion_template template_parameters: table_name: customer landing_folder: customer Configuration Management ------------------------ LakehousePlumber provides two configuration files to customize how your pipelines and orchestration jobs are deployed to Databricks: - **Pipeline Configuration** (``pipeline_config.yaml``) - Controls SDP pipeline settings like compute, runtime, notifications - **Job Configuration** (``job_config.yaml``) - Controls orchestration job settings like concurrency, schedules, permissions .. seealso:: For complete configuration options, examples, and best practices, see the Configuration Management section in :doc:`databricks_bundles`. Substitutions & Secrets ----------------------- LakehousePlumber supports environment-aware tokens, local variables, secret references, and file substitutions that make your pipeline definitions portable across environments. .. seealso:: For the full reference on all substitution syntaxes, processing order, secret management, and file substitution support, see :doc:`substitutions`. Operational Metadata --------------------- Operational metadata columns provide lineage, data provenance, and processing context. They are defined at the project level in ``lhp.yaml`` and can be selectively enabled at the preset, flowgroup, or action level with additive behavior. .. seealso:: For the full reference on column definitions, target type compatibility, usage patterns, version requirements, and event log configuration, see :doc:`operational_metadata`. State Management & Smart Generation ------------------------------------ Lakehouse Plumber keeps a small **state file** under ``.lhp_state.json`` that maps generated Python files to their source YAML. It records checksums and dependency links so that future `lhp generate` runs can: * re-process only *new* or *stale* FlowGroups. * skip files whose inputs did not change. * optionally clean up orphaned files when you delete YAML. This behaviour is similar to Gradle's incremental build or Terraform's state management. **How state management works:** .. code-block:: json :caption: .lhp_state.json example :linenos: { "version": "1.0", "generated_files": { "customer_ingestion.py": { "source_yaml": "pipelines/bronze/customer_ingestion.yaml", "checksum": "a1b2c3d4e5f6", "environment": "dev", "dependencies": ["presets/bronze_layer.yaml"] } } } **Benefits:** - **Faster regeneration** - Only changed files are processed - **Dependency tracking** - Upstream changes trigger downstream regeneration - **Cleanup support** - Detect and remove orphaned generated files - **CI/CD optimization** - Skip unchanged pipeline generation in builds Dependency Resolver ------------------- Transforms may reference earlier views or tables via the ``source`` field. LHP builds a directed acyclic graph (DAG) of these references, detects cycles, and ensures downstream FlowGroups regenerate when upstream definitions change. .. seealso:: :doc:`dependency_analysis` for the full 5-step resolution process, dependency chain examples, and the ``lhp deps`` command. Pipeline Generation Workflow ---------------------------- The complete pipeline generation process follows this workflow: .. mermaid:: graph TD subgraph "Discovery Phase" A[Scan YAML Files] --> B[Apply Include Patterns] B --> C[Parse FlowGroups] end subgraph "Resolution Phase" C --> D[Apply Presets] D --> E[Expand Templates] E --> F[Apply Substitutions] F --> G[Validate Configuration] end subgraph "Generation Phase" G --> H[Resolve Dependencies] H --> I[Check State] I --> J{Changed?} J -->|Yes| K[Generate Code] J -->|No| L[Skip Generation] K --> M[Update State] L --> M end subgraph "Output" M --> N[Python DLT Files] end **Key optimization points:** - **Smart discovery** - Include patterns reduce files to process - **Incremental generation** - State tracking skips unchanged files - **Dependency awareness** - Changes propagate to affected downstream files - **Validation early** - Catch errors before code generation - **Parallel processing** - Independent FlowGroups can be processed simultaneously Troubleshooting --------------- Common issues include state management problems (stale ```.lhp_state.json```), dependency resolution failures, and performance with large projects. .. seealso:: :doc:`errors_reference` for error codes, resolution steps, and general troubleshooting tips (state debugging, dependency debugging, performance optimization). What's Next? ------------ Now that you understand the core building blocks of Lakehouse Plumber, explore these topics: * :doc:`substitutions` - Environment tokens, local variables, secrets, and file substitution support. * :doc:`operational_metadata` - Audit columns, version requirements, and event log configuration. * :doc:`templates_reference` - Reuse common patterns across your pipelines. * :doc:`databricks_bundles` - Deploy and manage your pipelines as code. * :doc:`dependency_analysis` - Pipeline dependency analysis and orchestration job generation. For hands-on examples and complete workflows, check out :doc:`getting_started`.