CI/CD Reference¶
This comprehensive guide covers enterprise CI/CD patterns for deploying Lakehouse Plumber pipelines with Databricks Asset Bundles across development, testing, and production environments. It includes modern DataOps workflows and practical examples for GitHub Actions, Azure DevOps, and Bitbucket.
Prerequisites¶
Enterprise deployment of LHP requires: - Source control (Git) - CI/CD platform (GitHub Actions, Azure DevOps, Bitbucket Pipelines) - Databricks Asset Bundles (DABs)
CI/CD Overview¶
Lakehouse Plumber supports enterprise-grade CI/CD workflows that follow DataOps best practices for data pipeline deployment. The framework enables multiple deployment strategies while maintaining version consistency, audit trails, and robust state management.
Core CI/CD Principles:
Principle |
Implementation |
|---|---|
Single Source of Truth |
YAML configurations are the authoritative source; Python files are ephemeral build artifacts |
Version Consistency |
Same commit SHA deployed across all environments ensures identical business logic |
Environment Isolation |
Different substitution files (dev.yaml, test.yaml, prod.yaml) provide environment-specific configurations |
Approval Gates |
Automated dev/test deployment with manual production approval requirements |
Rollback Capability |
Complete rollback to any previous version |
Important
Generated Python files should never be committed to source control. They are treated as build artifacts and regenerated deterministically from YAML configurations.
- This is to:
Prevent manual changes to Python files
Ensure that the Python files are always in sync with the YAML configurations
Repository Structure¶
Organize your repository structure to support clean CI/CD workflows and team collaboration.
lakehouse-project/
├── .github/workflows/ # CI/CD pipeline definitions
│ ├── ci-validation.yml # PR validation workflow
│ ├── dev-deployment.yml # Automatic dev deployment
│ ├── test-promotion.yml # Test environment promotion
│ ├── prod-deployment.yml # Production deployment
│ └── monitoring.yml # Health and deployment monitoring
├── .gitignore # Exclude generated files and state
├── databricks.yml # Databricks Asset Bundle configuration
├── lhp.yaml # LHP project configuration (with version pinning)
├── pipelines/ # Source pipeline definitions
│ ├── 01_raw_ingestion/
│ ├── 02_bronze/
│ ├── 03_silver/
│ └── 04_gold/
├── substitutions/ # Environment-specific configurations
│ ├── dev.yaml
│ ├── test.yaml
│ └── prod.yaml
├── presets/ # Reusable configuration patterns
├── templates/ # Reusable action patterns
├── expectations/ # Data quality definitions
├── schemas/ # Schema definitions
├── generated/ # Generated Python code (gitignored)
│ ├── dev/ # Development environment code
│ ├── test/ # Test environment code
│ └── prod/ # Production environment code
├── resources/ # Generated resource YAMLs (gitignored)
│ └── lhp/
│ ├── dev/ # Development environment resources
│ ├── test/ # Test environment resources
│ └── prod/ # Production environment resources
├── scripts/ # Deployment and monitoring scripts
│ ├── integration-tests.sh
│ ├── health-check.py
│ └── deployment-notify.sh
└── docs/ # Project documentation
See also
For more information on the repository structure see Concepts & Architecture.
Version Management¶
Lakehouse Plumber supports semantic version (semver) pinning in lhp.yaml for reproducible builds across environments.
Version Pinning in lhp.yaml:
1name: acme_edw
2version: "1.0"
3description: "acme Delta Lakehouse Project - TPC-H"
4author: "Joe Bloggs"
5created_date: "2025-07-11"
6required_lhp_version: ">=0.5.0,<0.6.0"
7
8include:
Benefits of Version Pinning:
Reproducible Builds: Same LHP version across all environments
Controlled Upgrades: Test new versions in dev before production
Dependency Management: Lock to compatible versions with your pipelines
CI/CD Stability: Prevent unexpected changes from automatic updates
Environment-Specific Generation:
Starting with LHP 0.5.0+, generated code and resource YAMLs are organized by environment:
generated/
├── dev/
│ └── pipeline_code.py
├── test/
│ └── pipeline_code.py
└── prod/
└── pipeline_code.py
resources/
└── lhp/
├── dev/
│ └── pipeline.yml
├── test/
│ └── pipeline.yml
└── prod/
└── pipeline.yml
This structure provides:
Clear Separation: No accidental cross-environment deployments
Environment-Specific Configuration: For instance different cluster configurations in DABs pipeline.yml across environments
Deployment Strategies¶
Lakehouse Plumber supports multiple CI/CD deployment strategies to fit different organizational needs and maturity levels.
Trunk based development and Tag-Based Promotion (Recommended)¶
Strategy Overview:
Trunk-based development is a version control strategy where all team members commit changes directly to a single main branch (the “trunk”) rather than working on long-lived feature branches. This approach aligns perfectly with DataOps principles by promoting frequent integration and continuous collaboration.
Key Principles:
Single Source of Truth: All development occurs on the main branch, ensuring the codebase represents the current state of data pipelines and transformations. The trunk must remain deployment-ready at all times, meaning every commit should be production-quality.
Small, Frequent Commits: Developers make small, incremental changes multiple times per day rather than large, monolithic updates. This reduces merge conflicts and makes code reviews more manageable, particularly important for complex data transformation logic.
Automated Testing Integration: Comprehensive automated testing runs on every commit, including data quality checks, pipeline validation, and integration tests. This ensures that changes don’t break existing data flows or introduce quality issues.
Feature Flags for Data: Teams use feature flags to control the visibility of new data transformations or pipeline changes. This allows deploying code to production while keeping features inactive until fully tested, enabling safe experimentation with data models.
Tag-Based Promotion Workflows
Tag-based promotion uses Git tags to control when and how data pipeline changes move through different environments (development, staging, production). This approach provides better control over deployments compared to automatic branch-based triggers.
Promotion Strategy
Environment-Specific Tags: Create tags with specific naming conventions for different environments:
Development: dev-* tags for initial testing
Staging: rc-* (release candidate) tags for pre-production validation
Production: v* tags (semantic versioning) for production releases
Environment |
Trigger Mechanism |
|---|---|
Development |
Automatic deployment on main branch push |
Testing |
Developer-created tags (v1.2.3-test) |
Production |
Approval-gated tags (v1.2.3-prod) |
Principles:
Commit Once, Deploy Many: Generate artifacts (Python code) once per commit and promote the same artifacts through environments using tags. This ensures consistency and traceability across the deployment pipeline.
Immutable Deployments: Each tag represents an immutable snapshot of the data pipeline configuration. Tags cannot be moved or modified, providing a clear audit trail of what was deployed when.
Tag-Based Promotion Workflow:
flowchart TD
A[Developer commits to feature branch] --> B[Create Pull Request]
B --> C[PR validation & review]
C --> D[Merge to main]
D --> E["🚀 Auto deploy to DEV<br/>Commit: abc123"]
E --> F[Developer testing in DEV]
F --> G{Ready for TEST?}
G -->|Yes| H["🏷️ Create tag: v1.2.3-test<br/>Points to commit: abc123"]
G -->|No| I[Continue development]
I --> A
H --> J["🔄 Auto deploy to TEST<br/>Same commit: abc123"]
J --> K[Comprehensive testing]
K --> L{Ready for PROD?}
L -->|Yes| M["🏷️ Create tag: v1.2.3-prod<br/>Points to commit: abc123"]
L -->|No| N[Fix issues]
N --> A
M --> O["⚠️ Approval required"]
O --> P["✅ Deploy to PROD<br/>Same commit: abc123"]
style E fill:#e8f5e8
style J fill:#fff3e0
style P fill:#ffebee
style O fill:#f3e5f5
Tag-based promotion notes:
Automatic Dev Deployment: Every main branch push triggers dev environment deployment
Self-Service Test Deployment: Developers create test tags to promote to test environment
Gated Production Deployment: Production tags require approval before deployment
Version Consistency: Same commit SHA promoted through all environments
Audit Trail: Complete deployment history through Git tags and CI/CD logs
Branch-Based Promotion¶
Branch-based promotion uses separate branches for environment targeting.
When Branch-Based Promotion Might Be Appropriate:
Large, Distributed Data Teams: Organizations with multiple data engineering teams working on independent data domains might benefit from GitFlow approaches. Each team can maintain their own feature branches while coordinating releases through structured merge processes.
Regulated Industries: Financial services, healthcare, or other highly regulated industries may require the formal approval processes and audit trails that branch-based promotion provides. The structured release workflow can satisfy compliance requirements.
Complex Release Coordination: Organizations deploying large data platform updates quarterly or annually might prefer the predictable release cycles that GitFlow supports. This allows coordinating multiple team contributions into scheduled releases.
Branch Strategy:
1on:
2 push:
3 branches:
4 - main # Triggers dev deployment
5 - release/test # Triggers test deployment
6 - release/prod # Triggers prod deployment
Promotion Process:
# Develop on feature branches
git checkout -b feature/customer-pipeline
git commit -m "Add customer segmentation pipeline"
git push origin feature/customer-pipeline
# Merge to main triggers dev deployment
git checkout main
git merge feature/customer-pipeline
git push origin main # → Dev deployment
# Promote to test environment
git checkout release/test
git merge main
git push origin release/test # → Test deployment
# Promote to production (with approval)
git checkout release/prod
git merge release/test
git push origin release/prod # → Prod deployment (after approval)
Anatomy of branch-based promotion:
Branch Protection: Each environment branch has protection rules and required reviewers
Linear Progression: Changes flow through main → release/test → release/prod
Approval Gates: Production branch requires pull request approval before merge
Environment Isolation: Each branch represents a deployment environment
Rollback Strategy: Revert commits on environment branches for rollbacks
Continuous Deployment (Not Recommended)¶
Continuous deployment automatically promotes changes through all environments based on automated quality gates. In terms of dataOps, this is not recommended as testing data pipelines usually requires much more testing team and business involvement for integration and user acceptance testing.
Deployment Strategy summary¶
Regardless of which CI/CD strategy you choose, the key is that YAML configurations are the single source of truth for data pipelines and generated Python code should be treated as build artifacts.
Environment Management¶
For environment management, Lakehouse Plumber uses substitution files and Databricks Asset Bundle targets to maintain consistent pipeline logic while adapting to environment-specific configurations.
Environment Architecture:
graph TB
subgraph "Source Control"
A[YAML Pipelines<br/>Single Source of Truth]
B[substitutions/dev.yaml]
C[substitutions/test.yaml]
D[substitutions/prod.yaml]
end
subgraph "Generation Process"
E[lhp generate -e dev]
F[lhp generate -e test]
G[lhp generate -e prod]
end
subgraph "Environments"
H[DEV Environment<br/>dev_catalog.bronze<br/>Fast iteration]
I[TEST Environment<br/>test_catalog.bronze<br/>Quality validation]
J[PROD Environment<br/>prod_catalog.bronze<br/>Business operations]
end
A --> E
A --> F
A --> G
B --> E
C --> F
D --> G
E --> H
F --> I
G --> J
style A fill:#e1f5fe
style H fill:#e8f5e8
style I fill:#fff3e0
style J fill:#ffebee
See also
For more information on substitution files see Substitutions & Secrets.
See also
For more information on Databricks Asset Bundles see Databricks Asset Bundles Integration.
Environment-Specific Configuration Files¶
In addition to substitution files, LHP supports environment-specific pipeline and job configuration files for fine-grained control over compute resources, notifications, and scheduling per environment.
Recommended file structure:
config/
├── pipeline_config-dev.yaml # Dev: smaller clusters, no notifications
├── pipeline_config-prod.yaml # Prod: larger clusters, full alerting
├── job_config-dev.yaml # Dev: relaxed timeouts
└── job_config-prod.yaml # Prod: strict SLAs, schedules
Common environment-specific differences:
Setting |
Development |
Production |
|---|---|---|
Cluster size |
Smaller nodes (cost efficiency) |
Larger nodes (performance) |
Concurrency |
1-2 concurrent runs |
3+ concurrent runs |
Notifications |
Minimal or none |
Full alerting to ops teams |
Timeouts |
Relaxed (for debugging) |
Strict (SLA enforcement) |
Performance target |
|
|
Usage in CI/CD:
# Development deployment
lhp generate -e dev -pc config/pipeline_config-dev.yaml
lhp deps -jc config/job_config-dev.yaml --bundle-output
# Production deployment
lhp generate -e prod -pc config/pipeline_config-prod.yaml
lhp deps -jc config/job_config-prod.yaml --bundle-output
See also
For complete configuration options and examples, see the Configuration Management section in Databricks Asset Bundles Integration.
Deployment overview using Databricks Asset Bundles¶
The following CI/CD workflow ensures consistency without storing generated artifacts in source control.
State Management Flow:
flowchart TB
subgraph "Local Development"
A[YAML Changes] --> B[lhp generate --env dev]
B --> C[.lhp_state.json<br/>Updated]
end
subgraph "CI/CD Pipeline"
D[Clean Environment<br/>No state file] --> E[lhp generate --env prod]
E --> F[Complete Regeneration<br/>Deterministic]
F --> G[databricks bundle deploy --target prod]
G --> H[Record Deployment<br/>Success/Failure]
end
C -.-> D
style C fill:#e8f5e8
style F fill:#fff3e0
style G fill:#e1f5fe
CI/CD Deployment Workflows¶
Deployment workflows orchestrate the complete process from source changes to production deployment with appropriate validation and approval gates.
Complete Deployment Pipeline:
flowchart TB
subgraph "Pull Request Validation"
A[PR Created] --> B[YAML Lint Check]
B --> C[LHP Validate]
C --> D[Security Scan]
D --> E[Dry-run Generation]
E --> F[Schema Validation]
F --> G{All Checks Pass?}
G -->|No| H[❌ Block Merge]
G -->|Yes| I[✅ Allow Merge]
end
subgraph "Deployment Pipeline"
I --> J[Merge to Main]
J --> K[🚀 Deploy DEV]
K --> L[Integration Tests]
L --> M{Dev Tests Pass?}
M -->|No| N[🔄 Rollback DEV]
M -->|Yes| O[📊 Record Success]
O --> P[Developer Creates<br/>v1.2.3-test Tag]
P --> Q[🔄 Deploy TEST]
Q --> R[Comprehensive Tests]
R --> S{Test Validation?}
S -->|No| T[🔄 Rollback TEST]
S -->|Yes| U[Developer Creates<br/>v1.2.3-prod Tag]
U --> V[⚠️ Approval Gate]
V --> W[🚀 Deploy PROD]
W --> X[Health Checks]
X --> Y[📊 Success Metrics]
end
style K fill:#e8f5e8
style Q fill:#fff3e0
style W fill:#ffebee
style V fill:#f3e5f5
Pull Request Validation¶
Comprehensive validation ensures code quality before changes reach deployment pipelines.
1name: PR Validation
2
3on:
4 pull_request:
5 branches: [main]
6
7concurrency:
8 group: pr-${{ github.event.pull_request.number }}
9 cancel-in-progress: true
10
11permissions:
12 contents: read
13 id-token: write
14 pull-requests: write # For PR comments
15
16jobs:
17 validate:
18 runs-on: ubuntu-latest
19 timeout-minutes: 15
20
21 steps:
22 - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
23 with:
24 fetch-depth: 0
25
26 - name: Setup Python
27 uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c # v5.0.0
28 with:
29 python-version: '3.10'
30 cache: 'pip'
31
32 - name: Install Dependencies
33 run: |
34 pip install --upgrade pip
35 pip install lakehouse-plumber==0.3.8
36
37 - name: LHP Configuration Validation
38 run: |
39 lhp validate --env dev --verbose
40 lhp validate --env test --verbose
41 lhp validate --env prod --verbose
42
43 - name: Dry-Run Generation Test
44 run: |
45 lhp generate --env dev --dry-run --verbose
46
47 - name: Security Scan
48 uses: gitleaks/gitleaks-action@cb7149a9b57195b609c63e8518d2c6056677d2d0 # v2.3.3
49
50 - name: Comment PR Status
51 if: always()
52 uses: actions/github-script@60a0d83039c74a4aee543508d2ffcb1c3799cdea # v7.0.1
53 with:
54 script: |
55 const status = context.job.status === 'success' ? '✅' : '❌';
56 github.rest.issues.createComment({
57 issue_number: context.issue.number,
58 owner: context.repo.owner,
59 repo: context.repo.repo,
60 body: `${status} Validation ${context.job.status}`
61 })
Development Environment Deployment:
As indicated in the flowchart above, the development environment deployment is triggered by a push to the main branch.
1dev-deployment:
2 runs-on: ubuntu-latest
3 if: github.ref == 'refs/heads/main' && github.event_name == 'push'
4
5 steps:
6 - uses: actions/checkout@v4
7
8 - name: Generate Pipeline Code
9 run: |
10 lhp generate --env dev
11 # Output: generated/dev/ and resources/lhp/dev/
12
13 - name: Deploy to Databricks
14 run: databricks bundle deploy --target dev
15 env:
16 DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_DEV_TOKEN }}
17 DATABRICKS_HOST: ${{ secrets.DATABRICKS_DEV_HOST }}
18
19 - name: Run Integration Tests
20 run: ./scripts/integration-tests.sh dev
21
22 - name: Record Deployment
23 run: |
24 echo '{
25 "timestamp": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'",
26 "commit_hash": "'$GITHUB_SHA'",
27 "environment": "dev",
28 "lhp_version": "'$(lhp --version)'",
29 "pipeline_files": '$(find generated/dev/ -name "*.py" | jq -R . | jq -s .)',
30 "resource_files": '$(find resources/lhp/dev/ -name "*.yml" | jq -R . | jq -s .)'
31 }' > deployment-manifest-dev.json
Anatomy of deployment workflows:
Validation Gates: Multiple validation steps before any deployment
Environment Isolation: Separate credentials and configurations per environment
Test Integration: Automated testing after deployment
Audit Logging: Complete record of deployment activities
Failure Handling: Clear error messages and rollback procedures
Important
The above example code is not complete and is only for demonstration purposes.
Warning
Databricks recommends using Oauth for authentication to Databricks rather than using secrets or tokens.
Test Environment Promotion¶
Test environment promotion is triggered by developer-created tags and includes comprehensive testing.
1test-promotion:
2 runs-on: ubuntu-latest
3 if: startsWith(github.ref, 'refs/tags/v') && endsWith(github.ref, '-test')
4
5 steps:
6 - uses: actions/checkout@v4
7 with:
8 ref: ${{ github.ref }} # Checkout the tagged commit
9
10 - name: Validate Tag Format
11 run: |
12 if [[ ! "${{ github.ref_name }}" =~ ^v[0-9]+\.[0-9]+\.[0-9]+-test$ ]]; then
13 echo "❌ Invalid tag format. Use: v1.2.3-test"
14 exit 1
15 fi
16
17 - name: Generate for Test Environment
18 run: |
19 lhp generate --env test
20 # Output: generated/test/ and resources/lhp/test/
21
22 - name: Deploy to Test Environment
23 run: databricks bundle deploy --target test
24 env:
25 DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TEST_TOKEN }}
26 DATABRICKS_HOST: ${{ secrets.DATABRICKS_TEST_HOST }}
27
28 - name: Run Comprehensive Tests
29 run: |
30 ./scripts/smoke-tests.sh test
31 ./scripts/data-quality-tests.sh test
32 ./scripts/performance-tests.sh test
Selective Test Execution (Changed Pipelines Only) - COMING SOON¶
Note
This feature is coming soon and will integrate with LHP “Test” actions
Note
Future roadmap: an lhp impacted-pipelines command will accept changed paths or refs and output impacted pipeline names (and bundle resource names) for use with databricks bundle run <pipeline_name> -t <env>.
Production Deployment with Approval¶
Production deployment requires explicit approval and includes comprehensive validation and monitoring setup.
1prod-deployment:
2 runs-on: ubuntu-latest
3 if: startsWith(github.ref, 'refs/tags/v') && endsWith(github.ref, '-prod')
4 environment:
5 name: production
6 url: https://prod-workspace.databricks.com
7
8 steps:
9 - uses: actions/checkout@v4
10 with:
11 ref: ${{ github.ref }}
12
13 # Pre-deployment Validation handled by tag-based promotion and required approvals
14
15 - name: Generate Production Configuration
16 run: |
17 lhp generate --env prod
18 # Output: generated/prod/ and resources/lhp/prod/
19
20 - name: Production Deployment (manual approval gate)
21 run: databricks bundle deploy --target prod --mode production
22 env:
23 DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_PROD_TOKEN }}
24 DATABRICKS_HOST: ${{ secrets.DATABRICKS_PROD_HOST }}
25
26 - name: Post-deployment Verification
27 run: |
28 ./scripts/production-health-check.sh
29 ./scripts/validate-deployment.sh prod
30
31 - name: Setup Monitoring
32 run: ./scripts/setup-production-monitoring.sh
33
34 - name: Notify Stakeholders
35 run: |
36 ./scripts/notify-deployment-success.sh prod ${{ github.ref_name }}
Important
The above example code is not complete and is only for demonstration purposes.
Warning
Databricks recommends using Oauth for authentication to Databricks rather than using secrets or tokens.
Anatomy of production deployment:
Environment Protection: GitHub environment with required reviewers
Pre-deployment Validation: Ensures proper progression from test environment
Production Mode: Databricks bundle deployed with production-level validation
Health Checks: Comprehensive post-deployment verification
Monitoring Setup: Automated monitoring and alerting configuration
Stakeholder Communication: Automated notifications to relevant teams
Rollback Procedures¶
Rollback procedures provide rapid recovery from deployment issues while maintaining data consistency and audit trails.
Emergency Rollback Flow:
flowchart TD
A[🚨 Production Issue Detected] --> B{Issue Severity?}
B -->|Critical| C[Emergency Rollback<br/>Sub-10 minutes]
B -->|Minor| D[Planned Rollback<br/>Scheduled maintenance]
C --> E[Identify Last Good Commit]
E --> F[Create Rollback Tag<br/>v1.2.1-prod-rollback]
F --> G[Auto-trigger Rollback Pipeline]
G --> H[Deploy Previous Version<br/>Same commit SHA]
H --> I[Critical Path Tests]
I --> J{Tests Pass?}
J -->|Yes| K[✅ Rollback Complete<br/>Issue Resolved]
J -->|No| L[🆘 Escalate to Team<br/>Manual Intervention]
D --> M[Schedule Maintenance Window]
M --> N[Create Maintenance Tag<br/>v1.2.1-prod-maintenance]
N --> O[Controlled Rollback]
O --> P[Full Validation Suite]
P --> Q[📊 Success Metrics]
K --> R[📝 Incident Report<br/>Auto-generated]
Q --> R
L --> S[🚨 Page On-call Engineer]
style C fill:#ffebee
style H fill:#fff3e0
style K fill:#e8f5e8
style L fill:#ff5722
Immediate Rollback¶
Fast rollback for critical production issues using previous deployment artifacts.
1emergency-rollback:
2 runs-on: ubuntu-latest
3 if: startsWith(github.ref, 'refs/tags/v') && endsWith(github.ref, '-rollback')
4 environment:
5 name: production-emergency
6
7 steps:
8 - name: Parse Rollback Target
9 id: rollback-target
10 run: |
11 # Extract target version from tag (e.g., v1.2.1-rollback)
12 ROLLBACK_VERSION=$(echo "${{ github.ref_name }}" | sed 's/-rollback$//')
13 echo "rollback_version=$ROLLBACK_VERSION" >> $GITHUB_OUTPUT
14
15 # Find the commit SHA for the target version
16 ROLLBACK_COMMIT=$(git rev-list -n 1 ${ROLLBACK_VERSION}-prod)
17 echo "rollback_commit=$ROLLBACK_COMMIT" >> $GITHUB_OUTPUT
18
19 - uses: actions/checkout@v4
20 with:
21 ref: ${{ steps.rollback-target.outputs.rollback_commit }}
22
23 - name: Generate Rollback Configuration
24 run: |
25 lhp generate --env prod
26 # Regenerates from the rollback commit's YAML configurations
27
28 - name: Deploy Rollback
29 run: databricks bundle deploy --target prod --mode production
30 env:
31 DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_PROD_TOKEN }}
32 DATABRICKS_HOST: ${{ secrets.DATABRICKS_PROD_HOST }}
33
34 - name: Verify Rollback Success
35 run: |
36 ./scripts/critical-path-tests.sh prod
37 ./scripts/verify-rollback-success.sh
38
39 - name: Create Incident Report
40 run: |
41 ./scripts/create-incident-report.sh \
42 --rollback-from "$GITHUB_SHA" \
43 --rollback-to "${{ steps.rollback-target.outputs.rollback_commit }}" \
44 --environment "prod"
Anatomy of rollback procedures:
Fast Response: Sub-10-minute rollback capability for critical issues
Automated Discovery: Automatic identification of rollback targets
Data Consistency: Streaming checkpoints prevent data loss during rollback
Verification: Automated testing to confirm rollback success
Incident Tracking: Automatic creation of incident reports and documentation
Security and Compliance¶
Security and compliance considerations for CI/CD workflows ensure data protection, access control, and regulatory compliance throughout the deployment pipeline.
OIDC Authentication (Recommended)¶
Eliminate long-lived Databricks tokens using GitHub OIDC (OpenID Connect) for enhanced security.
Configure Databricks Federation Policies:
1# Replace placeholders:
2# <SP_ID>: Service Principal numeric ID
3# <org>/<repo>: Your GitHub organization and repository
4
5# Development environment
6databricks account service-principal-federation-policy create <SP_ID> --json '{
7 "oidc_policy": {
8 "issuer": "https://token.actions.githubusercontent.com",
9 "audiences": ["https://github.com/<org>"],
10 "subject": "repo:<org>/<repo>:environment:development"
11 }
12}'
13
14# Test environment
15databricks account service-principal-federation-policy create <SP_ID> --json '{
16 "oidc_policy": {
17 "issuer": "https://token.actions.githubusercontent.com",
18 "audiences": ["https://github.com/<org>"],
19 "subject": "repo:<org>/<repo>:environment:test"
20 }
21}'
22
23# Production environment
24databricks account service-principal-federation-policy create <SP_ID> --json '{
25 "oidc_policy": {
26 "issuer": "https://token.actions.githubusercontent.com",
27 "audiences": ["https://github.com/<org>"],
28 "subject": "repo:<org>/<repo>:environment:production"
29 }
30}'
GitHub Actions OIDC Configuration:
1jobs:
2 deploy:
3 runs-on: ubuntu-latest
4 environment: production # Must match federation policy subject
5 permissions:
6 contents: read
7 id-token: write # Required for OIDC token generation
8 env:
9 DATABRICKS_AUTH_TYPE: github-oidc
10 DATABRICKS_HOST: https://workspace.cloud.databricks.com
11 DATABRICKS_CLIENT_ID: <service-principal-application-id>
12 steps:
13 - uses: actions/checkout@<commit-sha>
14 - uses: databricks/setup-cli@<commit-sha>
15 - name: Deploy with OIDC
16 run: |
17 lhp generate --env prod
18 databricks bundle deploy --target prod
Benefits of OIDC:
No Stored Secrets: Eliminates long-lived tokens in GitHub secrets
Short-lived Tokens: Automatic token rotation reduces security risk
Centralized Management: Federation policies control access centrally
Audit Trail: All authentication tracked through identity provider
Multi-Layer Security Architecture:
graph TB
subgraph "Source Control Security"
A[Branch Protection Rules]
B[Required PR Reviews]
C[Signed Commits]
D[Secret Scanning]
end
subgraph "CI/CD Security"
E["Environment Secrets<br/>Platform Secret Stores"]
F["Approval Gates<br/>Production Protection"]
G["Audit Logging<br/>All Actions Tracked"]
H["Access Control<br/>Role-based Permissions"]
end
subgraph "Databricks Security"
I["Secret Scopes<br/>dbutils.secrets.get()"]
J["Unity Catalog Permissions<br/>Row/Column Level"]
K["Workspace Isolation<br/>Dev/Test/Prod"]
L["Network Security<br/>VPC/Private Links"]
end
subgraph "Compliance & Governance"
M["Complete Audit Trail<br/>SOX/GDPR/HIPAA"]
N["Data Lineage Tracking<br/>End-to-end Visibility"]
O["Retention Policies<br/>Automated Cleanup"]
P["Compliance Reporting<br/>Automated Generation"]
end
A --> E
B --> F
C --> G
D --> H
E --> I
F --> J
G --> K
H --> L
I --> M
J --> N
K --> O
L --> P
style A fill:#ffebee
style E fill:#fff3e0
style I fill:#e8f5e8
style M fill:#e1f5fe
GitHub Environment Protection:
1# .github/workflows/production-deploy.yml
2prod-deployment:
3 environment:
4 name: production
5 url: https://prod-workspace.databricks.com
6 required_reviewers:
7 - devops-team
8 - senior-data-engineers
9 deployment_branch_policy:
10 protected_branches: true
Anatomy of access control:
Multi-layer Security: GitHub + Databricks access controls
Principle of Least Privilege: Minimal required permissions per environment
Role-based Access: Group-based permissions for scalable management
Audit Integration: All access changes logged and tracked
Environment Protection: Production requires additional approval gates
Best Practices¶
Proven best practices for implementing robust CI/CD pipelines with Lakehouse Plumber.
Workflow Security Hardening¶
Apply these security measures to all CI/CD workflows:
Concurrency Control:
1concurrency:
2 group: ${{ github.workflow }}-${{ github.ref_type }}-${{ github.ref_name }}
3 cancel-in-progress: true
Least Privilege Permissions:
1permissions:
2 contents: read
3 id-token: write # Only if using OIDC
Pin Action Versions:
1steps:
2 - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
3 - uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c # v5.0.0
4 - uses: databricks/setup-cli@6071bbc2e5a862e896c755360cbc7a6a970c4e37 # v0.212.2
Version Pinning:
1- uses: actions/setup-python@<sha>
2 with:
3 python-version: '3.10'
4 cache: 'pip'
5
6- run: |
7 pip install --upgrade pip
8 pip install lakehouse-plumber==0.3.8 # Pin to project version
Platform-Specific Implementations¶
While the concepts above apply to all CI/CD platforms, this section provides specific implementation details for different platforms.
GitHub Actions Implementation¶
GitHub Actions is covered extensively in the examples above. Key features:
OIDC Auth Type:
github-oidcEnvironment Protection: Native GitHub environments
Secret Management: GitHub Secrets and Variables
Workflow Syntax: YAML with
on:,jobs:,steps:
Azure DevOps Implementation¶
Azure DevOps Pipelines support OIDC authentication and provide enterprise features for Lakehouse Plumber deployments.
OIDC Federation Policy for Azure DevOps:
databricks account service-principal-federation-policy create <SP_ID> --json '{
"oidc_policy": {
"issuer": "https://vstoken.dev.azure.com/<org_guid>",
"audiences": ["api://AzureADTokenExchange"],
"subject": "sc://<org>/<project>/<service_connection_name>"
}
}'
Azure DevOps Pipeline Example:
1trigger:
2 branches:
3 include:
4 - main
5 tags:
6 include:
7 - v*-test
8 - v*-prod
9
10pool:
11 vmImage: ubuntu-latest
12
13variables:
14 DATABRICKS_HOST: $(DATABRICKS_HOST)
15 DATABRICKS_AUTH_TYPE: azure-service-principal
16
17stages:
18- stage: Validate
19 condition: eq(variables['Build.Reason'], 'PullRequest')
20 jobs:
21 - job: ValidatePR
22 steps:
23 - task: UsePythonVersion@0
24 inputs:
25 versionSpec: '3.10'
26
27 - script: |
28 pip install --upgrade pip
29 pip install lakehouse-plumber==0.3.8
30 displayName: Install Dependencies
31
32 - script: |
33 lhp validate --env dev --verbose
34 lhp generate --env dev --dry-run
35 displayName: Validate Configuration
36
37- stage: DeployDev
38 condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
39 jobs:
40 - deployment: DeployToDev
41 environment: development
42 strategy:
43 runOnce:
44 deploy:
45 steps:
46 - checkout: self
47
48 - task: UsePythonVersion@0
49 inputs:
50 versionSpec: '3.10'
51
52 - task: AzureCLI@2
53 inputs:
54 azureSubscription: 'databricks-service-connection'
55 scriptType: 'bash'
56 scriptLocation: 'inlineScript'
57 inlineScript: |
58 # Get OIDC token
59 export DATABRICKS_AZURE_CLIENT_ID=$(servicePrincipalId)
60 export DATABRICKS_AZURE_TENANT_ID=$(tenantId)
61 export DATABRICKS_AZURE_CLIENT_SECRET=$(servicePrincipalKey)
62
63 pip install lakehouse-plumber==0.3.8
64 pip install databricks-cli
65
66 lhp generate --env dev
67 databricks bundle deploy --target dev
68
69- stage: DeployTest
70 condition: and(succeeded(), startsWith(variables['Build.SourceBranch'], 'refs/tags/v'), endsWith(variables['Build.SourceBranch'], '-test'))
71 jobs:
72 - deployment: DeployToTest
73 environment: test
74 strategy:
75 runOnce:
76 deploy:
77 steps:
78 - checkout: self
79
80 - task: UsePythonVersion@0
81 inputs:
82 versionSpec: '3.10'
83
84 - task: AzureCLI@2
85 inputs:
86 azureSubscription: 'databricks-service-connection'
87 scriptType: 'bash'
88 scriptLocation: 'inlineScript'
89 inlineScript: |
90 export DATABRICKS_AZURE_CLIENT_ID=$(servicePrincipalId)
91 export DATABRICKS_AZURE_TENANT_ID=$(tenantId)
92 export DATABRICKS_AZURE_CLIENT_SECRET=$(servicePrincipalKey)
93
94 pip install lakehouse-plumber==0.3.8
95 pip install databricks-cli
96
97 lhp generate --env test
98 databricks bundle deploy --target test
99
100- stage: DeployProd
101 condition: and(succeeded(), startsWith(variables['Build.SourceBranch'], 'refs/tags/v'), endsWith(variables['Build.SourceBranch'], '-prod'))
102 jobs:
103 - deployment: DeployToProd
104 environment: production
105 strategy:
106 runOnce:
107 deploy:
108 steps:
109 - checkout: self
110
111 - task: UsePythonVersion@0
112 inputs:
113 versionSpec: '3.10'
114
115 - task: AzureCLI@2
116 inputs:
117 azureSubscription: 'databricks-service-connection-prod'
118 scriptType: 'bash'
119 scriptLocation: 'inlineScript'
120 inlineScript: |
121 export DATABRICKS_AZURE_CLIENT_ID=$(servicePrincipalId)
122 export DATABRICKS_AZURE_TENANT_ID=$(tenantId)
123 export DATABRICKS_AZURE_CLIENT_SECRET=$(servicePrincipalKey)
124
125 pip install lakehouse-plumber==0.3.8
126 pip install databricks-cli
127
128 lhp generate --env prod
129 databricks bundle deploy --target prod --mode production
Bitbucket Pipelines Implementation¶
Bitbucket Pipelines support OIDC authentication and provide cloud-native CI/CD for Databricks
OIDC Federation Policy for Bitbucket:
databricks account service-principal-federation-policy create <SP_ID> --json '{
"oidc_policy": {
"issuer": "https://api.bitbucket.org/2.0/workspaces/<workspace>/pipelines-config/identity/oidc",
"audiences": ["ari:cloud:bitbucket::workspace/<workspace_uuid>"],
"subject": "{<workspace_uuid>}/{<repo_uuid>}:{<environment>}:<branch_or_tag>"
}
}'
Bitbucket Pipeline Example:
1image: python:3.10
2
3definitions:
4 steps:
5 - step: &validate
6 name: Validate Configuration
7 script:
8 - pip install --upgrade pip
9 - pip install lakehouse-plumber==0.3.8
10 - lhp validate --env dev --verbose
11 - lhp generate --env dev --dry-run
12
13 - step: &deploy-dev
14 name: Deploy to Development
15 deployment: development
16 oidc: true
17 script:
18 - export DATABRICKS_CLIENT_ID=$DATABRICKS_CLIENT_ID
19 - export DATABRICKS_HOST=$DATABRICKS_DEV_HOST
20 - export DATABRICKS_AUTH_TYPE=bitbucket-oidc
21 - export DATABRICKS_OIDC_TOKEN=$BITBUCKET_STEP_OIDC_TOKEN
22
23 - pip install --upgrade pip
24 - pip install lakehouse-plumber==0.3.8
25 - curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
26
27 - lhp generate --env dev
28 - databricks bundle deploy --target dev
29
30 - step: &deploy-test
31 name: Deploy to Test
32 deployment: test
33 oidc: true
34 script:
35 - export DATABRICKS_CLIENT_ID=$DATABRICKS_CLIENT_ID
36 - export DATABRICKS_HOST=$DATABRICKS_TEST_HOST
37 - export DATABRICKS_AUTH_TYPE=bitbucket-oidc
38 - export DATABRICKS_OIDC_TOKEN=$BITBUCKET_STEP_OIDC_TOKEN
39
40 - pip install --upgrade pip
41 - pip install lakehouse-plumber==0.3.8
42 - curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
43
44 - lhp generate --env test
45 - databricks bundle deploy --target test
46
47 - step: &deploy-prod
48 name: Deploy to Production
49 deployment: production
50 oidc: true
51 script:
52 - export DATABRICKS_CLIENT_ID=$DATABRICKS_CLIENT_ID
53 - export DATABRICKS_HOST=$DATABRICKS_PROD_HOST
54 - export DATABRICKS_AUTH_TYPE=bitbucket-oidc
55 - export DATABRICKS_OIDC_TOKEN=$BITBUCKET_STEP_OIDC_TOKEN
56
57 - pip install --upgrade pip
58 - pip install lakehouse-plumber==0.3.8
59 - curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
60
61 - lhp generate --env prod
62 - databricks bundle deploy --target prod --mode production
63
64pipelines:
65 pull-requests:
66 '**':
67 - step: *validate
68
69 branches:
70 main:
71 - step: *deploy-dev
72
73 tags:
74 'v*-test':
75 - step: *deploy-test
76
77 'v*-prod':
78 - step: *deploy-prod
79
80 custom:
81 rollback-prod:
82 - variables:
83 - name: ROLLBACK_VERSION
84 - step:
85 name: Rollback Production
86 deployment: production
87 oidc: true
88 script:
89 - export DATABRICKS_CLIENT_ID=$DATABRICKS_CLIENT_ID
90 - export DATABRICKS_HOST=$DATABRICKS_PROD_HOST
91 - export DATABRICKS_AUTH_TYPE=bitbucket-oidc
92 - export DATABRICKS_OIDC_TOKEN=$BITBUCKET_STEP_OIDC_TOKEN
93
94 - git checkout tags/${ROLLBACK_VERSION}-prod
95 - pip install lakehouse-plumber==0.3.8
96 - curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
97
98 - lhp generate --env prod
99 - databricks bundle deploy --target prod --mode production