DataPipeline.Pro

Challenge
GfK’s valuable market data was only accessible through a manual web-based login process, making sourcing and integration inefficient. The challenge was to automate data retrieval while navigating the complexities of SAML-based authentication and ensuring security.
Solution
- Authentication Automation
  - The system automates SAML authentication, securely logging into GfK Connect using credentials stored in Azure Key Vault.
  - Mechanize library handles the login process programmatically, ensuring a seamless experience.
- Automated File Retrieval
  - BeautifulSoup is used to scan and extract available data files, eliminating the need for manual discovery.
- Data Processing & Integration
  - Databricks Compute Cluster orchestrates ingestion, filtering redundant files and transferring only new data to the Azure Data Lake.
  - The pipeline is optimized to ensure efficiency and scalability, enabling large-scale data handling.
Outcome
- Fully automated and secure data retrieval, significantly reducing manual effort.
- Seamless integration with cloud storage and data processing environments.
- Enhanced security with strict credential management via Azure Key Vault.
- A scalable solution that can adapt to growing data demands.
Technology
- SAML Authentication – Secure access validation via GfK Federation.
- Azure Key Vault – Secure storage of credentials and secrets.
- Mechanize Library – Automated form submission for authentication.
- BeautifulSoup – Automated discovery of data files.
- Databricks Compute Cluster – Orchestration and processing of ingested data.
- Azure Data Lake – Storage and integration of processed files.

How does it work?

Data Sources

cloud databases
on-premise database
Excel files with "pretty" formatting
csv files

Python Script

processing Excel files with formatting
conversion to *.csv

Linux Pipeline

Data filtering

Staging

Staging schema data load

Aggragation / MDS

Data aggregation at the month level
Populating Intermediate Fact Tables
Loading MD datamarts
Data transfer to MDS

MDS

MD Enrichment byuser
Enter MD required for calculations: courses, units. conversion reates.
Launch dataflow continuation

DWH Loading

Calculation and loading of data marts from fact tables and MDS user data
Recording the download log and the errors that occurred with the reasons

PowerBI

PowerBI dataset refresh

Have a project? Let's make it happen!

Fill out the feedback form or write to us at Team@DPLP.com and we will back to you!

How does it work?

Have a project? Let's make it happen!

Do you have any questions?

Do you want to work with us?