GfK Data Ingestion via SAML

NDA

Sales & Marketing FMCG Connector / Integration Accelerators 2024
GfK data ingestion pipeline — SAML authentication, file discovery, and Data Lake integration

Challenge

GfK, a leading market research company, provides critical consumer goods data — but accessing it required manual downloads via a website, with login flow protected by SAML federation. Sourcing and integrating that data into a modern analytics layer was cumbersome and error-prone, and the manual step blocked any attempt at automation.

Approach

We built a fully automated workflow that handles SAML login, file discovery, and ingestion end-to-end. Authentication uses Python's Mechanize library to drive the GfK Federation login flow programmatically, with credentials retrieved at runtime from Azure Key Vault — secrets never live in code or config files.

Once authenticated, the system uses BeautifulSoup to scan the GfK Connect portal and extract the list of available data files automatically, eliminating the need for manual discovery. A Databricks Compute Cluster then orchestrates ingestion: it filters out files already processed and transfers only new data into the Azure Data Lake — efficient, scalable, and idempotent.

Outcomes

  • Fully automated and secure data retrieval, eliminating the manual download step
  • Seamless integration with Azure cloud storage and downstream processing
  • Enhanced security: credentials managed via Azure Key Vault, never in source
  • Scalable architecture that adapts to growing data volumes and additional GfK feeds
  • Reproducible pattern reusable for any SAML-protected data source

Technology

SAML authentication Python (Mechanize, BeautifulSoup) Azure Key Vault Databricks Azure Data Lake

Solution areas: Data Foundations

Want to discuss a similar challenge?

Tell us where you are today and what you're trying to move. We'll share what we've learned from comparable engagements and propose a focused way to start.

Book an intro call