Web Scraper Monitoring System

Python 3.12+ FastAPI Beanie ODM Celery Docker

A production-grade, distributed web scraping and monitoring system. Designed for high throughput and stealth, it incorporates asynchronous task queues, automatic rate throttling, proxy rotation, CAPTCHA integration, and automated database deduplication.


๐Ÿ— System Architecture

The project splits scraping work into two distinct streams based on the scraper's execution requirements (static parsing vs. dynamic browser emulation).

flowchart TD
    API[FastAPI Server] -->|CRUD Configs| Mongo[(MongoDB - Beanie ODM)]
    API -->|Trigger Search| RedisBroker[(Redis Broker)]

    subgraph Celery Task Pipeline
        RedisBroker -->|Read http Queue| WorkerHTTP[HTTP Worker - 10 Concurrency]
        RedisBroker -->|Read browser Queue| WorkerBrowser[Browser Worker - 3 Concurrency]

        WorkerHTTP -->|fetch_page| Web1[Target Site HTML]
        WorkerBrowser -->|fetch_browser Playwright| Web2[Target Site JS SPA]
    end

    WorkerHTTP -->|Check Deduplication| RedisCache[(Redis Deduplication)]
    WorkerBrowser -->|Check Deduplication| RedisCache

    WorkerHTTP -->|Insert Listings & ScrapeRuns| Mongo
    WorkerBrowser -->|Insert Listings & ScrapeRuns| Mongo

Key Components

  1. FastAPI Web Server: Exposes RESTful endpoints for CRUD operations on Sites, Searches, Results, and Activity Logs (FastAPI router definitions located in scraper/api/routes).
  2. Celery Worker Pipeline (celery_app.py):
  3. http Queue: Tailored for fast, lightweight static HTTP clients utilizing httpx with exponential backoff and proxy/User-Agent rotation.
  4. browser Queue: Orchestrates high-resource headless browser tasks using Playwright in stealth mode to bypass aggressive bot protection.
  5. Database Layer (MongoDB + Beanie ODM):
  6. Automatically maintains indexes on startup (No Alembic migrations required).
  7. TTL Automation: Un-favorited listings expire automatically after 3 days using a MongoDB Partial TTL Index on expires_at.
  8. Logs Auto-cleanup: System activity logs are automatically purged after 30 days.
  9. Redis Cache Layer:
  10. Functions as the Celery task broker and results backend.
  11. Caches listing hashes for 3 days to avoid database lookups and prevent duplicate insertions.
  12. Daily Scheduler (APScheduler):
  13. Runs in the FastAPI process space.
  14. Automatically registers jobs on FastAPI startup using context manager lifespans.
  15. Includes two primary jobs: daily_refresh (runs every night at midnight Amsterdam time to queue active searches) and retry_failed_runs (runs every 2 hours to retry failed scraping runs from the past 6 hours).

๐Ÿ›  Prerequisites

Ensure you have the following installed on your machine: * Python 3.12+ (Fully compatible with Python 3.14) * Docker & Docker Compose * Google Chrome / Chromium (For Playwright browser tasks)


๐Ÿš€ Getting Started

1. Environment Setup

Clone the repository and navigate to the project directory:

cd "Scraper System"

Create a virtual environment and activate it:

# Windows
python -m venv .venv
.venv\Scripts\activate

# Unix/macOS
python3 -m venv .venv
source .venv/bin/activate

2. Install Project Dependencies

Install the scraper module and all required libraries:

pip install .

Initialize standard Playwright browser binaries:

playwright install chromium

3. Configure Variables

Copy the template configurations to a local .env file:

copy .env.example .env

Open .env and fill in the details (MongoDB, Redis connection strings, proxies, and API keys).


โšก Running the Platform

To start the system, launch the backing storage instances, seed the database, and then start the pipeline workers.

Step 1: Run Services (Docker Compose)

Start MongoDB, Redis, and Mongo Express (web GUI) in detached mode:

docker-compose up -d

Step 2: Seed the Site Configurations

Seed all 11 scraping platforms (Marktplaats + 10 new sites) into MongoDB:

python scripts/seed_sites.py

Step 3: Create an Admin Account

Create a default admin user for JWT-secured endpoints and WebSocket connections:

python scripts/create_admin.py --username admin --password adminpass

Step 4: Launch Celery Workers

Open two new terminals (with .venv activated) and launch the distinct worker queues:

Step 5: Run the API Server

Start the FastAPI development server:

uvicorn main:app --reload --port 8000

๐Ÿ”’ Authentication & Real-time WebSockets

JWT Authentication

All mutating endpoints (POST, PATCH, DELETE) require JWT authentication. 1. Authenticate via POST /api/v1/auth/login to obtain an access_token and its expires_in seconds (24h validity). 2. Pass the token as a Bearer token in the Authorization header: Authorization: Bearer <your_token>. 3. Refresh the token via POST /api/v1/auth/refresh.

WebSocket Live Event Stream

Connect to GET /ws/live?token={jwt} to listen to system events. The server validates the token on connection, closing with code 4001 if invalid. Incoming JSON events pushed in real-time include: * scrape_run_started: {type, site_id, site_name, search_id, run_id} * scrape_run_finished: {type, site_id, site_name, run_id, status, new_count, duration_ms} * new_listing: {type, listing_id, site_id, title, price, currency, url} * scheduler_fired: {type, job_name, searches_dispatched}


๐Ÿ•ท Supported Scraping Engines

The platform includes 11 specialized scrapers normalized to the standard Listing model:

  1. Marktplaats (Netherlands): HTML scraper using httpx.
  2. eBay (Global): REST API utilizing the eBay Finding API. Requires EBAY_APP_ID.
  3. Vinted (Global): Unofficial catalog REST API. Requires VINTED_SESSION_COOKIE session bypass.
  4. Leboncoin (France): Browser-based Playwright scraper. Uses stealth, random mouse movements, and post-load delays to bypass heavy bot detection.
  5. Subito (Italy): HTML scraper using httpx.
  6. Willhaben (Austria): HTML scraper using httpx.
  7. Allegro (Poland): HTML scraper using httpx. Normalizes Polish Zloty (PLN) pricing.
  8. Wallapop (Spain/Global): REST API search endpoint.
  9. Blocket (Sweden): HTML scraper using httpx. Normalizes Swedish Krona (SEK) pricing.
  10. Delcampe (Global): HTML scraper for collectibles.
  11. Catawiki (Global): Browser-based Playwright scraper for active auctions.

๐Ÿงช Pipeline & Parser Verification

We provide testing scripts to verify parser extraction accuracy, API endpoints, and WebSocket event integrations.

WebSocket Live Feed Test

Connects to the WebSocket server, triggers a manual search config, and streams live events for 60 seconds:

python scripts/test_websocket.py

Parse Extractor Test

Run the isolated Marktplaats parser script to check extraction output:

python scripts/test_marktplaats.py

Full Pipeline Test

Run the end-to-end task integration test:

python scripts/test_pipeline.py

Scheduler Verification

Run the standalone scheduler test to verify job registration, manual triggering, and ActivityLog persistence:

python scripts/test_scheduler.py

REST API Verification

Verify the complete FastAPI REST API endpoints using the automated test suite:

python scripts/test_api.py

๐Ÿ“‚ Project Directory Structure

scraper-system/
โ”œโ”€โ”€ scraper/
โ”‚   โ”œโ”€โ”€ api/             # FastAPI routers
โ”‚   โ”‚   โ””โ”€โ”€ routes/      # Endpoints (auth, searches, results, sites, logs, tasks)
โ”‚   โ”œโ”€โ”€ core/            # Infrastructure configurations (db, redis, settings, pubsub)
โ”‚   โ”œโ”€โ”€ models/          # Beanie ODM MongoDB Document definitions (user, site, listing, etc.)
โ”‚   โ”œโ”€โ”€ scrapers/        # Extraction engines (BaseScraper, 11 site scrapers)
โ”‚   โ”œโ”€โ”€ scheduler/       # Cron/APScheduler orchestration definitions
โ”‚   โ””โ”€โ”€ workers/         # Celery task definitions & scraper class registry
โ”œโ”€โ”€ scripts/             # Script runners for seeding and standalone testing
โ”œโ”€โ”€ celery_app.py        # Celery broker configuration & custom AsyncTask base
โ”œโ”€โ”€ docker-compose.yml   # Multi-container orchestration config
โ”œโ”€โ”€ main.py              # FastAPI application server entrypoint
โ”œโ”€โ”€ pyproject.toml       # Pip packaging configurations & dependencies
โ””โ”€โ”€ README.md            # Documentation guide

๐Ÿ‘ค Developer & Website

Developed by Daniel Agbeni.
Check out my website and other projects at danielagbeni.uploaddoc.app.