Web Scraper Monitoring System
A production-grade, distributed web scraping and monitoring system. Designed for high throughput and stealth, it incorporates asynchronous task queues, automatic rate throttling, proxy rotation, CAPTCHA integration, and automated database deduplication.
๐ System Architecture
The project splits scraping work into two distinct streams based on the scraper's execution requirements (static parsing vs. dynamic browser emulation).
flowchart TD
API[FastAPI Server] -->|CRUD Configs| Mongo[(MongoDB - Beanie ODM)]
API -->|Trigger Search| RedisBroker[(Redis Broker)]
subgraph Celery Task Pipeline
RedisBroker -->|Read http Queue| WorkerHTTP[HTTP Worker - 10 Concurrency]
RedisBroker -->|Read browser Queue| WorkerBrowser[Browser Worker - 3 Concurrency]
WorkerHTTP -->|fetch_page| Web1[Target Site HTML]
WorkerBrowser -->|fetch_browser Playwright| Web2[Target Site JS SPA]
end
WorkerHTTP -->|Check Deduplication| RedisCache[(Redis Deduplication)]
WorkerBrowser -->|Check Deduplication| RedisCache
WorkerHTTP -->|Insert Listings & ScrapeRuns| Mongo
WorkerBrowser -->|Insert Listings & ScrapeRuns| Mongo
Key Components
- FastAPI Web Server: Exposes RESTful endpoints for CRUD operations on Sites, Searches, Results, and Activity Logs (FastAPI router definitions located in
scraper/api/routes). - Celery Worker Pipeline (
celery_app.py): httpQueue: Tailored for fast, lightweight static HTTP clients utilizinghttpxwith exponential backoff and proxy/User-Agent rotation.browserQueue: Orchestrates high-resource headless browser tasks usingPlaywrightin stealth mode to bypass aggressive bot protection.- Database Layer (MongoDB + Beanie ODM):
- Automatically maintains indexes on startup (No Alembic migrations required).
- TTL Automation: Un-favorited listings expire automatically after 3 days using a MongoDB Partial TTL Index on
expires_at. - Logs Auto-cleanup: System activity logs are automatically purged after 30 days.
- Redis Cache Layer:
- Functions as the Celery task broker and results backend.
- Caches listing hashes for 3 days to avoid database lookups and prevent duplicate insertions.
- Daily Scheduler (APScheduler):
- Runs in the FastAPI process space.
- Automatically registers jobs on FastAPI startup using context manager lifespans.
- Includes two primary jobs:
daily_refresh(runs every night at midnight Amsterdam time to queue active searches) andretry_failed_runs(runs every 2 hours to retry failed scraping runs from the past 6 hours).
๐ Prerequisites
Ensure you have the following installed on your machine: * Python 3.12+ (Fully compatible with Python 3.14) * Docker & Docker Compose * Google Chrome / Chromium (For Playwright browser tasks)
๐ Getting Started
1. Environment Setup
Clone the repository and navigate to the project directory:
cd "Scraper System"
Create a virtual environment and activate it:
# Windows
python -m venv .venv
.venv\Scripts\activate
# Unix/macOS
python3 -m venv .venv
source .venv/bin/activate
2. Install Project Dependencies
Install the scraper module and all required libraries:
pip install .
Initialize standard Playwright browser binaries:
playwright install chromium
3. Configure Variables
Copy the template configurations to a local .env file:
copy .env.example .env
Open .env and fill in the details (MongoDB, Redis connection strings, proxies, and API keys).
โก Running the Platform
To start the system, launch the backing storage instances, seed the database, and then start the pipeline workers.
Step 1: Run Services (Docker Compose)
Start MongoDB, Redis, and Mongo Express (web GUI) in detached mode:
docker-compose up -d
- MongoDB Port:
27017 - Redis Port:
6379 - Mongo Express UI:
http://localhost:8081(Credentials:admin/pass)
Step 2: Seed the Site Configurations
Seed all 11 scraping platforms (Marktplaats + 10 new sites) into MongoDB:
python scripts/seed_sites.py
Step 3: Create an Admin Account
Create a default admin user for JWT-secured endpoints and WebSocket connections:
python scripts/create_admin.py --username admin --password adminpass
Step 4: Launch Celery Workers
Open two new terminals (with .venv activated) and launch the distinct worker queues:
- HTTP Worker (Standard Scrapers: Marktplaats, eBay, Vinted, Subito, Willhaben, Allegro, Wallapop, Blocket, Delcampe):
bash celery -A celery_app worker -Q http --concurrency 10 --loglevel=info -P solo - Browser Worker (Playwright Emulation: Leboncoin, Catawiki):
bash celery -A celery_app worker -Q browser --concurrency 3 --loglevel=info -P solo(Note:-P solois recommended when running Celery directly on Windows)
Step 5: Run the API Server
Start the FastAPI development server:
uvicorn main:app --reload --port 8000
- Interactive Swagger API Docs:
http://localhost:8000/docs - Base Endpoint Prefix:
http://localhost:8000/api/v1
๐ Authentication & Real-time WebSockets
JWT Authentication
All mutating endpoints (POST, PATCH, DELETE) require JWT authentication.
1. Authenticate via POST /api/v1/auth/login to obtain an access_token and its expires_in seconds (24h validity).
2. Pass the token as a Bearer token in the Authorization header: Authorization: Bearer <your_token>.
3. Refresh the token via POST /api/v1/auth/refresh.
WebSocket Live Event Stream
Connect to GET /ws/live?token={jwt} to listen to system events. The server validates the token on connection, closing with code 4001 if invalid.
Incoming JSON events pushed in real-time include:
* scrape_run_started: {type, site_id, site_name, search_id, run_id}
* scrape_run_finished: {type, site_id, site_name, run_id, status, new_count, duration_ms}
* new_listing: {type, listing_id, site_id, title, price, currency, url}
* scheduler_fired: {type, job_name, searches_dispatched}
๐ท Supported Scraping Engines
The platform includes 11 specialized scrapers normalized to the standard Listing model:
- Marktplaats (Netherlands): HTML scraper using
httpx. - eBay (Global): REST API utilizing the eBay Finding API. Requires
EBAY_APP_ID. - Vinted (Global): Unofficial catalog REST API. Requires
VINTED_SESSION_COOKIEsession bypass. - Leboncoin (France): Browser-based Playwright scraper. Uses stealth, random mouse movements, and post-load delays to bypass heavy bot detection.
- Subito (Italy): HTML scraper using
httpx. - Willhaben (Austria): HTML scraper using
httpx. - Allegro (Poland): HTML scraper using
httpx. Normalizes Polish Zloty (PLN) pricing. - Wallapop (Spain/Global): REST API search endpoint.
- Blocket (Sweden): HTML scraper using
httpx. Normalizes Swedish Krona (SEK) pricing. - Delcampe (Global): HTML scraper for collectibles.
- Catawiki (Global): Browser-based Playwright scraper for active auctions.
๐งช Pipeline & Parser Verification
We provide testing scripts to verify parser extraction accuracy, API endpoints, and WebSocket event integrations.
WebSocket Live Feed Test
Connects to the WebSocket server, triggers a manual search config, and streams live events for 60 seconds:
python scripts/test_websocket.py
Parse Extractor Test
Run the isolated Marktplaats parser script to check extraction output:
python scripts/test_marktplaats.py
Full Pipeline Test
Run the end-to-end task integration test:
python scripts/test_pipeline.py
Scheduler Verification
Run the standalone scheduler test to verify job registration, manual triggering, and ActivityLog persistence:
python scripts/test_scheduler.py
REST API Verification
Verify the complete FastAPI REST API endpoints using the automated test suite:
python scripts/test_api.py
๐ Project Directory Structure
scraper-system/
โโโ scraper/
โ โโโ api/ # FastAPI routers
โ โ โโโ routes/ # Endpoints (auth, searches, results, sites, logs, tasks)
โ โโโ core/ # Infrastructure configurations (db, redis, settings, pubsub)
โ โโโ models/ # Beanie ODM MongoDB Document definitions (user, site, listing, etc.)
โ โโโ scrapers/ # Extraction engines (BaseScraper, 11 site scrapers)
โ โโโ scheduler/ # Cron/APScheduler orchestration definitions
โ โโโ workers/ # Celery task definitions & scraper class registry
โโโ scripts/ # Script runners for seeding and standalone testing
โโโ celery_app.py # Celery broker configuration & custom AsyncTask base
โโโ docker-compose.yml # Multi-container orchestration config
โโโ main.py # FastAPI application server entrypoint
โโโ pyproject.toml # Pip packaging configurations & dependencies
โโโ README.md # Documentation guide
๐ค Developer & Website
Developed by Daniel Agbeni.
Check out my website and other projects at danielagbeni.uploaddoc.app.