• Challenge

    GfK’s valuable market data was only accessible through a manual web-based login process, making sourcing and integration inefficient. The challenge was to automate data retrieval while navigating the complexities of SAML-based authentication and ensuring security.

  • Solution
    • Authentication Automation

      • The system automates SAML authentication, securely logging into GfK Connect using credentials stored in Azure Key Vault.

      • Mechanize library handles the login process programmatically, ensuring a seamless experience.

    • Automated File Retrieval

      • BeautifulSoup is used to scan and extract available data files, eliminating the need for manual discovery.

    • Data Processing & Integration

      • Databricks Compute Cluster orchestrates ingestion, filtering redundant files and transferring only new data to the Azure Data Lake.

      • The pipeline is optimized to ensure efficiency and scalability, enabling large-scale data handling.

  • Outcome
    • Fully automated and secure data retrieval, significantly reducing manual effort.

    • Seamless integration with cloud storage and data processing environments.

    • Enhanced security with strict credential management via Azure Key Vault.

    • A scalable solution that can adapt to growing data demands.

  • Technology
    • SAML Authentication – Secure access validation via GfK Federation.

    • Azure Key Vault – Secure storage of credentials and secrets.

    • Mechanize Library – Automated form submission for authentication.

    • BeautifulSoup – Automated discovery of data files.

    • Databricks Compute Cluster – Orchestration and processing of ingested data.

    • Azure Data Lake – Storage and integration of processed files.

How does it work?

1
Data Sources
  • cloud databases
  • on-premise database
  • Excel files with "pretty" formatting
  • csv files
2
Python Script
  • processing Excel files with formatting
  • conversion to *.csv
3
Linux Pipeline
  • Data filtering
4
Staging
  • Staging schema data load
5
Aggragation / MDS
  • Data aggregation at the month level
  • Populating Intermediate Fact Tables
  • Loading MD datamarts
  • Data transfer to MDS
6
MDS
  • MD Enrichment byuser
  • Enter MD required for calculations: courses, units. conversion reates.
  • Launch dataflow continuation
7
DWH Loading
  • Calculation and loading of data marts from fact tables and MDS user data
  • Recording the download log and the errors that occurred with the reasons
8
PowerBI
  • PowerBI dataset refresh