
October, 2025
Incremental Multi-Source Data Pipeline
This project is a scalable, incremental data automation pipeline that extracts data from Google Tag Manager (GTM), Google Ads, and Facebook Ads. The system uses dltHub for extraction, Apache Airflow for orchestration, and loads structured data into Google BigQuery, Cloud Storage, and Firestore with optimized incremental loading to reduce costs and improve efficiency.
ROLE
Data Pipeline Engineer
CHALLENGES
A major challenge was implementing incremental loading for Firestore, as it is not natively supported by dltHub. Previously the pipeline performed full refreshes, which was costly and inefficient. I solved this by developing a custom incremental extractor using Firestore queries based on timestamps and document IDs. This allowed the pipeline to only process new or updated records on each run.
SOLUTION
I developed a comprehensive incremental data automation pipeline that pulls tracking and advertising data from Google Tag Manager (GTM), Google Ads, and Facebook Ads. The pipeline is fully orchestrated using Apache Airflow (Dockerized on GCP), leverages dltHub for reliable extraction and normalization, and stores data across BigQuery (analytics), Cloud Storage (raw backup), and Firestore (operational use). Strong emphasis was placed on incremental loading to minimize API costs and processing time.
PERFORMANCE
TECH STACK
ARCHITECTURE
The architecture uses Apache Airflow as the central orchestrator running in Docker containers on GCP. dltHub handles data extraction and normalization from GTM, Google Ads, and Facebook Ads. A custom incremental layer was built for Firestore support. Data flows into BigQuery for analysis, Cloud Storage for archiving, and Firestore for real-time access — all processed incrementally for maximum efficiency.