• Challenge

    The customer faced multiple challenges including:

    • Lack of automated deployments for Composer/Dataflow and DBT.

    • Absence of unit tests and data quality checks.

    • Manual setup of development environments leading to inefficiencies.

  • Solution

    CI/CD Implementation:​

    • Set up automated pipelines using Bitbucket Pipelines.​
    • Automated deployments for Cloud Composer, Dataflow and DBT projects.​
    • Ensure possibility of integration of unit tests and data quality checks into CI pipelines.​

    Development Environment Setup:​

    • Configured automated deployment to dev/prod environments from feature branches.​
    • Established processes for minimal manual setup during testing. Project completion: Q4-2024​

    Documentation and Training:​

    • Provided comprehensive documentation of the implemented solutions.​
    • Conducted training sessions for the Softonic team on new processes and tools.​
  • Outcome

    The implemented solution resulted in:

    • Fully automated deployments, significantly reducing manual efforts.

    • Enhanced data quality through integrated unit tests and validation.

    • Streamlined development environment setup, reducing errors and improving efficiency.

    • Improved agility, enabling faster iteration and deployment of data workflows.

    This CI/CD pipeline transformation has empowered the customer’s data engineering team with an efficient, scalable, and automated workflow, enabling seamless deployment and data reliability.

  • Technology
    • Orchestration: Airflow running in Cloud Composer

    • Data Warehouse: BigQuery

    • Transformations: Migrating to DBT

    • Ingestion: Dataflow jobs deployed via GCS templates

    • Processing: Python jobs running on GCE VMs

    • Reporting: QlikSense

    • Version Control & CI/CD: Bitbucket Pipelines

How does it work?

1
Data Sources
  • cloud databases
  • on-premise database
  • Excel files with "pretty" formatting
  • csv files
2
Python Script
  • processing Excel files with formatting
  • conversion to *.csv
3
Linux Pipeline
  • Data filtering
4
Staging
  • Staging schema data load
5
Aggragation / MDS
  • Data aggregation at the month level
  • Populating Intermediate Fact Tables
  • Loading MD datamarts
  • Data transfer to MDS
6
MDS
  • MD Enrichment byuser
  • Enter MD required for calculations: courses, units. conversion reates.
  • Launch dataflow continuation
7
DWH Loading
  • Calculation and loading of data marts from fact tables and MDS user data
  • Recording the download log and the errors that occurred with the reasons
8
PowerBI
  • PowerBI dataset refresh