-
Challenge
GfK’s valuable market data was only accessible through a manual web-based login process, making sourcing and integration inefficient. The challenge was to automate data retrieval while navigating the complexities of SAML-based authentication and ensuring security.
-
Solution
-
Authentication Automation
-
The system automates SAML authentication, securely logging into GfK Connect using credentials stored in Azure Key Vault.
-
Mechanize library handles the login process programmatically, ensuring a seamless experience.
-
-
Automated File Retrieval
-
BeautifulSoup is used to scan and extract available data files, eliminating the need for manual discovery.
-
-
Data Processing & Integration
-
Databricks Compute Cluster orchestrates ingestion, filtering redundant files and transferring only new data to the Azure Data Lake.
-
The pipeline is optimized to ensure efficiency and scalability, enabling large-scale data handling.
-
-
-
Outcome
-
Fully automated and secure data retrieval, significantly reducing manual effort.
-
Seamless integration with cloud storage and data processing environments.
-
Enhanced security with strict credential management via Azure Key Vault.
-
A scalable solution that can adapt to growing data demands.
-
-
Technology
-
SAML Authentication – Secure access validation via GfK Federation.
-
Azure Key Vault – Secure storage of credentials and secrets.
-
Mechanize Library – Automated form submission for authentication.
-
BeautifulSoup – Automated discovery of data files.
-
Databricks Compute Cluster – Orchestration and processing of ingested data.
-
Azure Data Lake – Storage and integration of processed files.
-
How does it work?
- cloud databases
- on-premise database
- Excel files with "pretty" formatting
- csv files
- processing Excel files with formatting
- conversion to *.csv
- Data filtering
- Staging schema data load
- Data aggregation at the month level
- Populating Intermediate Fact Tables
- Loading MD datamarts
- Data transfer to MDS
- MD Enrichment byuser
- Enter MD required for calculations: courses, units. conversion reates.
- Launch dataflow continuation
- Calculation and loading of data marts from fact tables and MDS user data
- Recording the download log and the errors that occurred with the reasons
- PowerBI dataset refresh