Getting Started
===============

.. meta::
   :description: Install Lakehouse Plumber and create your first Databricks DLT pipeline from YAML in minutes. Step-by-step quickstart guide.

This short tutorial walks you through creating your first **Lakehouse Plumber**
project and generating a Lakeflow Pipeliness (DLT) pipeline based on the ACME demo
configuration that ships with the repository.

Prerequisites
-------------

* Python 3.11+ (3.12 recommended)
* Access to a Databricks workspace with DLT enabled (for actual deployment)
* Git installed (optional but recommended)

Installation
------------

.. code-block:: bash

   # Create and activate a virtual environment (optional)
   python -m venv .venv
   source .venv/bin/activate

   # Install Lakehouse Plumber CLI and its extras for docs & examples
   pip install lakehouse-plumber


Option 1: Clone the ACME Example Pipeline
-----------------------------------------
There is a companion repository that includes a fully working example (TPC-H retail dataset).
Copy its raw-ingestion flow into your new project:

.. code-block:: bash

   git clone https://github.com/Mmodarre/acme_edw.git
   cd acme_edw
   lhp validate --env dev
   lhp generate --env dev

Option 2: Create a new pipeline configuration
---------------------------------------------

Step 1: Project Initialisation
------------------------------

Use the **`lhp init`** command to scaffold a new repo-ready directory structure:

.. code-block:: bash

   lhp init <my_spd_project>
   cd <my_spd_project>

The command creates folders such as ``pipelines/``, ``templates/``,
``substitutions/`` and a starter ``lhp.yaml`` project file. It also includes
example template files with ``.tmpl`` extensions that you can use as starting points.

.. note::
   **VS Code IntelliSense**: If you use VS Code, IntelliSense with autocomplete, 
   validation, and documentation is automatically configured! Open any YAML file 
   to see real-time validation and smart suggestions.

Step 2: Edit the project configuration file
-------------------------------------------

Edit the ``lhp.yaml`` file to configure your project.


.. note::

   The ``lhp.yaml`` file is the main entry point for configuring your project.
   for full documentation on the project configuration file, see :doc:`concepts`.


Step 3: Create your environment configuration file
--------------------------------------------------

Create a ``substitutions/dev.yaml`` file to match your workspace catalog & storage paths.
You can either:

1. **Rename the example template**: ``mv substitutions/dev.yaml.tmpl substitutions/dev.yaml``
2. **Create a new file** following the same structure as the example template

Edit the file to configure tokens such as ``${catalog}`` or ``${secret:scope/key}`` 
that will be replaced during code generation.

.. tip::
   The ``.tmpl`` files created by ``lhp init`` contain working examples that you can
   use as starting points. Simply rename them or copy their content to create your
   working configuration files.

Step 4: Create your first pipeline configuration
------------------------------------------------

Create a new pipeline configuration in the ``pipelines/`` folder.

.. tip::
   **Understanding Pipeline Configuration Structure:**
   
   **Pipeline:** (line 1) specifies the pipeline name that contains this flowgroup. All YAML files sharing the same pipeline name will be organized together in the same directory during code generation.
   
   **Flowgroup:** (line 2) represents a logical grouping of related actions within the pipeline and serves as an organizational construct without impacting runtime behavior.

   **Actions:** (line 4) define the individual operations in the pipeline. They serve as the fundamental components that execute the data processing workflow:
   
      • **Loads** (lines 5-11) customer data from the Databricks samples catalog using Delta streaming
      • **Transforms** (lines 13-27) the raw data by renaming columns and standardizing field names  
      • **Writes** (lines 29-35) the processed data to a bronze layer streaming table
      • **Leverages substitutions** like ``${catalog}`` and ``${bronze_schema}`` for environment flexibility from ``dev.yaml`` file
      • **Implements medallion architecture** by writing to the bronze schema for downstream processing
      • **Enables streaming** with ``readMode: stream`` for incremental read from Delta Change Data Feed (CDF)

.. tip::
   **Multi-Flowgroup Files:**
   
   You can define multiple flowgroups in a single YAML file to reduce file proliferation.
   This is useful when you have many similar flowgroups (e.g., SAP master data tables).
   
   See :doc:`multi_flowgroup_guide` for detailed examples and syntax options.

.. code-block:: yaml
   :caption: pipelines/customer_sample.yaml
   :linenos:

   pipeline: tpch_sample_ingestion  # Grouping of generated python files in the same folder
   flowgroup: customer_ingestion   # Logical grouping for generated Python file

   actions:
      - name: customer_sample_load     # Unique action identifier
        type: load                     # Action type: Load
        readMode: stream              # Read using streaming CDF
        source:
           type: delta                # Source format: Delta Lake table
           database: "samples.tpch"   # Source database and schema in Unity Catalog
           table: customer_sample     # Source table name
        target: v_customer_sample_raw # Target view name (temporary in-memory)
        description: "Load customer sample table from Databricks samples catalog"

      - name: transform_customer_sample  # Unique action identifier
        type: transform                  # Action type: Transform
        transform_type: sql             # Transform using SQL query
        source: v_customer_sample_raw   # Input view from previous action
        target: v_customer_sample_cleaned  # Output view name
        sql: |
           SELECT
           c_custkey as customer_id,
           c_name as name,
           c_address as address,
           c_nationkey as nation_id,
           c_phone as phone,
           c_acctbal as account_balance,
           c_mktsegment as market_segment,
           c_comment as comment
           FROM stream(v_customer_sample_raw)
        description: "Transform customer sample table"

      - name: write_customer_sample_bronze  # Unique action identifier
        type: write                         # Action type: Write
        source: v_customer_sample_cleaned   # Input view from previous action
        write_target:
           type: streaming_table            # Output as streaming table
           database: "${catalog}.${bronze_schema}"  # Target database.schema with substitutions
           table: "tpch_sample_customer"    # Final table name
        description: "Write customer sample table to bronze schema"


Validate the Configuration
--------------------------

.. code-block:: shell

   # Check for schema errors, missing secrets, circular dependencies …
   lhp validate --env dev

If everything is green you will see **✅ All configurations are valid**.

Generate DLT Code
-----------------

.. code-block:: shell

   # Create Python files in ./generated/ (default output dir)
   lhp generate --env dev
   
   # Include data quality tests (optional - for development/testing)
   lhp generate --env dev --include-tests

Inspect the Output
------------------

Navigate to ``generated/tpch_sample_ingestion`` — each FlowGroup becomes a Python
file, formatted with `black <https://black.readthedocs.io>`_. These are standard
Lakeflow Declarative Pipeline scripts that you can run in
Databricks or commit to your repository. See :doc:`databricks_bundles` for Asset Bundle integration.

**This is the generated python file from the above YAML configuration:**

.. code-block:: python
   :caption: generated/tpch_sample_ingestion/customer_ingestion.py
   :linenos:

   # Generated by LakehousePlumber
   # Pipeline: tpch_sample_ingestion
   # FlowGroup: customer_ingestion

   from pyspark import pipelines as dp

   # Pipeline Configuration
   PIPELINE_ID = "tpch_sample_ingestion"
   FLOWGROUP_ID = "customer_ingestion"

   # ============================================================================
   # SOURCE VIEWS
   # ============================================================================

   @dp.temporary_view()
   def v_customer_sample_raw():
      """Load customer sample table from Databricks samples catalog"""
      df = spark.readStream \
         .table("samples.tpch.customer_sample")

      return df


   # ============================================================================
   # TRANSFORMATION VIEWS
   # ============================================================================

   @dp.temporary_view(comment="Transform customer sample table")
   def v_customer_sample_cleaned():
      """Transform customer sample table"""
      return spark.sql("""SELECT
   c_custkey as customer_id,
   c_name as name,
   c_address as address,
   c_nationkey as nation_id,
   c_phone as phone,
   c_acctbal as account_balance,
   c_mktsegment as market_segment,
   c_comment as comment
   FROM stream(v_customer_sample_raw)""")


   # ============================================================================
   # TARGET TABLES
   # ============================================================================

   # Create the streaming table
   dp.create_streaming_table(
      name="acmi_edw_dev.edw_bronze.tpch_sample_customer",
      comment="Streaming table: tpch_sample_customer",
      table_properties={"delta.autoOptimize.optimizeWrite": "true", "delta.enableChangeDataFeed": "true"})


   # Define append flow(s)
   @dp.append_flow(
      target="acmi_edw_dev.edw_bronze.tpch_sample_customer",
      name="f_customer_sample_bronze",
      comment="Write customer sample table to bronze schema"
   )
   def f_customer_sample_bronze():
      """Write customer sample table to bronze schema"""
      # Streaming flow
      df = spark.readStream.table("v_customer_sample_cleaned")

      return df


Deploy on Databricks
--------------------
**Option 1: Manually create a Lakeflow Declarative Pipeline(ETL)**

1. Create a Lakeflow Declarative Pipeline(ETL) in the Databricks UI.

2. Point the *Notebook/Directory* field to your ``generated/`` folder in the
   workspace (or sync the files via Repos).

**OR** (create new python files and paste the generated code into them.)

3. Configure clusters & permissions, then click **Validate**.

**Option 2: Use Asset Bundles**

:doc:`databricks_bundles`


Working with Example Templates
------------------------------

When you run ``lhp init``, several example template files are created to help you get started:

**Configuration Examples:**
   - ``substitutions/dev.yaml.tmpl`` - Example environment configuration with common substitution variables
   - ``substitutions/prod.yaml.tmpl`` - Production environment example
   - ``substitutions/tst.yaml.tmpl`` - Test environment example

**Pipeline Examples:**
   - ``pipelines/01_raw_ingestion/`` - Complete ingestion pipeline examples for various data formats
   - ``pipelines/02_bronze/`` - Bronze layer transformation examples
   - ``pipelines/03_silver/`` - Silver layer examples with data quality

**Preset Examples:**
   - ``presets/bronze_layer.yaml.tmpl`` - Reusable bronze layer configuration template

**Template Examples:**
   - ``templates/standard_ingestion.yaml.tmpl`` - Standard ingestion pattern template

To use these examples:

1. **Copy and rename** template files: ``cp substitutions/dev.yaml.tmpl substitutions/dev.yaml``
2. **Edit the copied files** to match your environment and requirements
3. **Use them as references** when creating your own configurations
4. **Explore the comprehensive examples** in the ``pipelines/`` directory for different data ingestion patterns

.. note::
   The ``.tmpl`` files are static examples containing LHP template syntax. They are not 
   Jinja2 templates for the init command, but rather complete working examples that you 
   can use as starting points for your own configurations.

Next Steps
----------

* Explore **Presets** and **Templates** to reduce duplication.
* Add **data-quality expectations** to your transforms.
* Add **operational metadata** to your actions.
* Add **Schema Hints** to your Load actions.
* Enable **Change-Data-Feed (CDC)** in bronze ingestions.
* Continue reading the :doc:`concepts` section for deeper architectural details.