DltHub Automation

This project is a scalable, incremental data automation pipeline that extracts data from Google Tag Manager (GTM), Google Ads, and Facebook Ads. The system uses dltHub for extraction, Apache Airflow for orchestration, and loads structured data into Google BigQuery, Cloud Storage, and Firestore with optimized incremental loading to reduce costs and improve efficiency.

PREVIOUS NEXT

PROJECT DETAILS

ROLE

Data Pipeline Engineer

CHALLENGES

A major challenge was implementing incremental loading for Firestore, as it is not natively supported by dltHub. Previously the pipeline performed full refreshes, which was costly and inefficient. I solved this by developing a custom incremental extractor using Firestore queries based on timestamps and document IDs. This allowed the pipeline to only process new or updated records on each run.

SOLUTION

I developed a comprehensive incremental data automation pipeline that pulls tracking and advertising data from Google Tag Manager (GTM), Google Ads, and Facebook Ads. The pipeline is fully orchestrated using Apache Airflow (Dockerized on GCP), leverages dltHub for reliable extraction and normalization, and stores data across BigQuery (analytics), Cloud Storage (raw backup), and Firestore (operational use). Strong emphasis was placed on incremental loading to minimize API costs and processing time.

PERFORMANCE

Built true incremental extraction for GTM, Google Ads, and Facebook Ads
Significantly reduced daily processing costs by fetching only new or changed data
Automated daily orchestration using Apache Airflow DAGs
Seamless data delivery to BigQuery, Google Cloud Storage, and Firestore
Implemented secure authentication using Google service accounts and OAuth

TECH STACK

Python

Apache Airflow

dltHub

Google Cloud Platform

BigQuery

Firestore

Docker

Google Auth

ARCHITECTURE

The architecture uses Apache Airflow as the central orchestrator running in Docker containers on GCP. dltHub handles data extraction and normalization from GTM, Google Ads, and Facebook Ads. A custom incremental layer was built for Firestore support. Data flows into BigQuery for analysis, Cloud Storage for archiving, and Firestore for real-time access — all processed incrementally for maximum efficiency.

Please Wait

DESCRIPTION

BACK TO PROJECTS

October, 2025

Incremental Multi-Source Data Pipeline