6.3 Background Scheduler
The Background Scheduler is a dedicated service responsible for orchestrating the end-to-end data pipeline. It ensures that the ATLAS and OSM datasets are periodically synchronized, matched, and imported into the database without manual intervention.
Core Role
The scheduler automates the transition through the four main phases of the project:
- 1. Download and Process Data: Fetching official ATLAS exports and OSM overpass data.
- 2. Matching Process: Running the multi-stage geospatial association logic.
- Problem detection (3. Problems): Identifying data quality issues.
- 4.1 Import Process: Rebuilding the
import_dbwith fresh results.
Implementation Details
Service Architecture
The scheduler is implemented as an APScheduler (BlockingScheduler) instance running within a dedicated Docker container (scheduler).
- Entrypoint:
matching_and_import_db/scheduler/service.py - Logic Runner:
matching_and_import_db/scheduler/job_runner.py
Redis Integration & Locking
To ensure system stability, the scheduler interacts with Redis for two critical functions:
- Distributed Lock: Before starting a run, the scheduler attempts to acquire a
pipeline_lockin Redis. This prevents multiple triggers (e.g., a scheduled task and a manualdocker exec) from running simultaneously and corrupting the data. - Status Reporting: The scheduler publishes its current state (e.g.,
downloading,matching,importing) to Redis. The Flask web application consumes this data to display a real-time progress bar and status message to users.
Configuration
The scheduler's behavior is controlled via environment variables in the scheduler service:
| Variable | Description | Default |
|---|---|---|
PIPELINE_SCHEDULE_INTERVAL_HOURS |
Interval between automatic runs, in hours | 24 |
PIPELINE_TIMEZONE |
Timezone used when computing the next run timestamp | Europe/Zurich |
PIPELINE_LOG_LEVEL |
Verbosity of the pipeline logs | INFO |
Operational Commands
Manual Trigger
You can force a pipeline run immediately by executing the job runner inside the running scheduler container:
docker compose exec scheduler python -m matching_and_import_db.scheduler.job_runner --mode full --trigger manual
Checking Status
The status can be checked via the API endpoint:
GET /api/system/pipeline_status
Error Handling
If a phase fails (e.g., a network timeout during OSM download), the scheduler:
- Logs the traceback to
stdout. - Updates the Redis status to
failurewith the error message. - Releases the distributed lock so subsequent runs can still execute.
- Retains the old database state (since the
importphase is only reached after successful matching).