
April, 2025
Scraping & Data Pipeline
This project is an automation pipeline built for scraping real estate data from a website and storing the results in structured JSONL files along with downloaded property images. The system was designed for fast and reliable large-scale extraction using Scrapy and Playwright.
ROLE
Automation Developer
CHALLENGES
The main challenge was handling authenticated sessions and dynamic website interactions. Playwright was used to automate login flows and browser actions, while rotating proxies and custom user agents helped reduce blocking during large-scale scraping.
SOLUTION
I built two different automation scripts for the client. The first script allows the client to enter keywords such as location, title, or search filters, then automatically scrapes all matching houses from the platform. The second script reads data from the user's liked or favorite house list and extracts all saved properties automatically. The scraped data is stored as JSONL files while all property images are downloaded into organized folders.
PERFORMANCE
TECH STACK
ARCHITECTURE
The automation system combines Scrapy for high-performance crawling with Playwright for browser automation and login handling. Extracted data is processed and stored into JSONL datasets with automatic image downloading and folder organization.