Web Scraper Monitoring System

A production-grade, distributed web scraping and monitoring system. Designed for high throughput and stealth, it incorporates asynchronous task queues, automatic rate throttling, proxy rotation, CAPTCHA integration, and automated database deduplication.

🏗 System Architecture

The project splits scraping work into two distinct streams based on the scraper's execution requirements (static parsing vs. dynamic browser emulation).

flowchart TD
    API[FastAPI Server] -->|CRUD Configs| Mongo[(MongoDB - Beanie ODM)]
    API -->|Trigger Search| RedisBroker[(Redis Broker)]

    subgraph Celery Task Pipeline
        RedisBroker -->|Read http Queue| WorkerHTTP[HTTP Worker - 10 Concurrency]
        RedisBroker -->|Read browser Queue| WorkerBrowser[Browser Worker - 3 Concurrency]

        WorkerHTTP -->|fetch_page| Web1[Target Site HTML]
        WorkerBrowser -->|fetch_browser Playwright| Web2[Target Site JS SPA]
    end

    WorkerHTTP -->|Check Deduplication| RedisCache[(Redis Deduplication)]
    WorkerBrowser -->|Check Deduplication| RedisCache

    WorkerHTTP -->|Insert Listings & ScrapeRuns| Mongo
    WorkerBrowser -->|Insert Listings & ScrapeRuns| Mongo

Key Components

FastAPI Web Server: Exposes RESTful endpoints for CRUD operations on Sites, Searches, Results, and Activity Logs (FastAPI router definitions located in scraper/api/routes).
Celery Worker Pipeline (celery_app.py):
http Queue: Tailored for fast, lightweight static HTTP clients utilizing httpx with exponential backoff and proxy/User-Agent rotation.
browser Queue: Orchestrates high-resource headless browser tasks using Playwright in stealth mode to bypass aggressive bot protection.
Database Layer (MongoDB + Beanie ODM):
Automatically maintains indexes on startup (No Alembic migrations required).
TTL Automation: Un-favorited listings expire automatically after 3 days using a MongoDB Partial TTL Index on expires_at.
Logs Auto-cleanup: System activity logs are automatically purged after 30 days.
Redis Cache Layer:
Functions as the Celery task broker and results backend.
Caches listing hashes for 3 days to avoid database lookups and prevent duplicate insertions.
Daily Scheduler (APScheduler):
Runs in the FastAPI process space.
Automatically registers jobs on FastAPI startup using context manager lifespans.
Includes two primary jobs: daily_refresh (runs every night at midnight Amsterdam time to queue active searches) and retry_failed_runs (runs every 2 hours to retry failed scraping runs from the past 6 hours).

🛠 Prerequisites

Ensure you have the following installed on your machine: * Python 3.12+ (Fully compatible with Python 3.14) * Docker & Docker Compose * Google Chrome / Chromium (For Playwright browser tasks)

🚀 Getting Started

1. Environment Setup

Clone the repository and navigate to the project directory:

cd "Scraper System"

Create a virtual environment and activate it:

# Windows
python -m venv .venv
.venv\Scripts\activate

# Unix/macOS
python3 -m venv .venv
source .venv/bin/activate

2. Install Project Dependencies

Install the scraper module and all required libraries:

pip install .

Initialize standard Playwright browser binaries:

playwright install chromium

3. Configure Variables

Copy the template configurations to a local .env file:

copy .env.example .env

Open .env and fill in the details (MongoDB, Redis connection strings, proxies, and API keys).

⚡ Running the Platform

To start the system, launch the backing storage instances, seed the database, and then start the pipeline workers.

Step 1: Run Services (Docker Compose)

Start MongoDB, Redis, and Mongo Express (web GUI) in detached mode:

docker-compose up -d

MongoDB Port: 27017
Redis Port: 6379
Mongo Express UI: http://localhost:8081 (Credentials: admin/pass)

Step 2: Seed the Site Configurations

Seed all 11 scraping platforms (Marktplaats + 10 new sites) into MongoDB:

python scripts/seed_sites.py

Step 3: Create an Admin Account

Create a default admin user for JWT-secured endpoints and WebSocket connections:

python scripts/create_admin.py --username admin --password adminpass

Step 4: Launch Celery Workers

Open two new terminals (with .venv activated) and launch the distinct worker queues:

HTTP Worker (Standard Scrapers: Marktplaats, eBay, Vinted, Subito, Willhaben, Allegro, Wallapop, Blocket, Delcampe): bash celery -A celery_app worker -Q http --concurrency 10 --loglevel=info -P solo
Browser Worker (Playwright Emulation: Leboncoin, Catawiki): bash celery -A celery_app worker -Q browser --concurrency 3 --loglevel=info -P solo (Note: -P solo is recommended when running Celery directly on Windows)

Step 5: Run the API Server

Start the FastAPI development server:

uvicorn main:app --reload --port 8000

Interactive Swagger API Docs: http://localhost:8000/docs
Base Endpoint Prefix: http://localhost:8000/api/v1

🔒 Authentication & Real-time WebSockets

JWT Authentication

All mutating endpoints (POST, PATCH, DELETE) require JWT authentication. 1. Authenticate via POST /api/v1/auth/login to obtain an access_token and its expires_in seconds (24h validity). 2. Pass the token as a Bearer token in the Authorization header: Authorization: Bearer <your_token>. 3. Refresh the token via POST /api/v1/auth/refresh.

WebSocket Live Event Stream

Connect to GET /ws/live?token={jwt} to listen to system events. The server validates the token on connection, closing with code 4001 if invalid. Incoming JSON events pushed in real-time include: * scrape_run_started: {type, site_id, site_name, search_id, run_id} * scrape_run_finished: {type, site_id, site_name, run_id, status, new_count, duration_ms} * new_listing: {type, listing_id, site_id, title, price, currency, url} * scheduler_fired: {type, job_name, searches_dispatched}

🕷 Supported Scraping Engines

The platform includes 11 specialized scrapers normalized to the standard Listing model:

Marktplaats (Netherlands): HTML scraper using httpx.
eBay (Global): REST API utilizing the eBay Finding API. Requires EBAY_APP_ID.
Vinted (Global): Unofficial catalog REST API. Requires VINTED_SESSION_COOKIE session bypass.
Leboncoin (France): Browser-based Playwright scraper. Uses stealth, random mouse movements, and post-load delays to bypass heavy bot detection.
Subito (Italy): HTML scraper using httpx.
Willhaben (Austria): HTML scraper using httpx.
Allegro (Poland): HTML scraper using httpx. Normalizes Polish Zloty (PLN) pricing.
Wallapop (Spain/Global): REST API search endpoint.
Blocket (Sweden): HTML scraper using httpx. Normalizes Swedish Krona (SEK) pricing.
Delcampe (Global): HTML scraper for collectibles.
Catawiki (Global): Browser-based Playwright scraper for active auctions.

🧪 Pipeline & Parser Verification

We provide testing scripts to verify parser extraction accuracy, API endpoints, and WebSocket event integrations.

WebSocket Live Feed Test

Connects to the WebSocket server, triggers a manual search config, and streams live events for 60 seconds:

python scripts/test_websocket.py

Parse Extractor Test

Run the isolated Marktplaats parser script to check extraction output:

python scripts/test_marktplaats.py

Full Pipeline Test

Run the end-to-end task integration test:

python scripts/test_pipeline.py

Scheduler Verification

Run the standalone scheduler test to verify job registration, manual triggering, and ActivityLog persistence:

python scripts/test_scheduler.py

REST API Verification

Verify the complete FastAPI REST API endpoints using the automated test suite:

python scripts/test_api.py

📂 Project Directory Structure

scraper-system/
├── scraper/
│   ├── api/             # FastAPI routers
│   │   └── routes/      # Endpoints (auth, searches, results, sites, logs, tasks)
│   ├── core/            # Infrastructure configurations (db, redis, settings, pubsub)
│   ├── models/          # Beanie ODM MongoDB Document definitions (user, site, listing, etc.)
│   ├── scrapers/        # Extraction engines (BaseScraper, 11 site scrapers)
│   ├── scheduler/       # Cron/APScheduler orchestration definitions
│   └── workers/         # Celery task definitions & scraper class registry
├── scripts/             # Script runners for seeding and standalone testing
├── celery_app.py        # Celery broker configuration & custom AsyncTask base
├── docker-compose.yml   # Multi-container orchestration config
├── main.py              # FastAPI application server entrypoint
├── pyproject.toml       # Pip packaging configurations & dependencies
└── README.md            # Documentation guide

👤 Developer & Website

Developed by Daniel Agbeni.
Check out my website and other projects at danielagbeni.uploaddoc.app.