7.3 Background Scheduler
The Background Scheduler is a dedicated Docker service that automatically runs the full data pipeline at a configurable interval. It coordinates downloads, matching, problem detection, and database import — then publishes progress and timing metadata so the web app can keep users informed.
Architecture Overview
The scheduler lives in its own container (scheduler) and is composed of two Python modules:
| Module | Role |
|---|---|
matching_and_import_db/scheduler/service.py |
Owns the APScheduler loop, computes the next-run timestamp, and delegates work to the job runner |
matching_and_import_db/scheduler/job_runner.py |
Executes the pipeline phases in order, manages the distributed lock, probes source freshness, and reports progress to the shared pipeline-state backend |
The app container never runs the pipeline itself. It only reads pipeline status from the shared pipeline-state store to drive the UI (navbar timestamp, maintenance overlay, progress bar).
Run Types
Run types are defined in matching_and_import_db/scheduler/job_types.py and selected by job_runner.py.
The job runner can execute three refresh scopes:
| Run type | Meaning |
|---|---|
complete |
Run preprocessing and rewrite the full import schema |
atlas_cached |
Reuse static ATLAS/GTFS preprocessing outputs and rewrite only the dynamic OSM + match-dependent tables |
atlas_cached_bootstrap |
ATLAS/GTFS sources are unchanged, but required static import tables are missing/empty; skip source download and rewrite static + dynamic import tables to bootstrap the DB |
PIPELINE_FORCE_FULL_REFRESH=1 forces the broader complete table rewrite even when the ATLAS and GTFS validators are unchanged.
Pipeline Phases
When a run is triggered (scheduled or manual), the job runner walks through these phases in sequence. Each phase publishes its name and a human-readable message to the shared pipeline-state backend so the web app can display live progress.
Only the import phase sets blocking_maintenance = true. During that window the web app shows a full-screen overlay to users. All other phases run silently in the background.
The match-import mode skips all download-related phases and jumps straight to matching, reusing whatever data files are already on disk.
Source freshness gate
Before re-running ATLAS and GTFS preprocessing, the scheduler probes the two permalinks with a tiny ranged GET (Range: bytes=0-0). This avoids downloading the full file while still reaching the final object response, which exposes file-level validators such as ETag and Last-Modified.
If both the ATLAS and GTFS validators match the last successfully preprocessed values stored in data/data_meta.json, the scheduler skips the get_atlas_data subprocess and reuses the existing preprocessed artifacts. If the probe fails or either validator changed, the scheduler fails open and runs preprocessing.
See 7.4 Atlas-Cached Import Optimization for the table scope used when those ATLAS/GTFS artifacts are reused.
Timing Model
How APScheduler's IntervalTrigger works
The scheduler uses IntervalTrigger(hours=N) which creates a fixed-cadence clock anchored to the moment the job is first registered. Fire times are computed independently of how long each run takes:
Container starts at T₀
First fire: T₀ + N hours
Second fire: T₀ + 2N hours
Third fire: T₀ + 3N hours
...
This means the interval is measured start-to-start on a fixed grid, not end-to-end and not relative to when a run finishes.
What if a run is still going when the next fire time arrives?
Two safeguards prevent overlapping runs:
| Setting | Effect |
|---|---|
max_instances=1 |
APScheduler will not launch a second instance of the job while the first is still running. The fire is simply skipped. |
coalesce=True |
If multiple fires were missed (e.g. because a run was very long), they collapse into a single execution rather than queuing up a burst. |
Practical example
With PIPELINE_SCHEDULE_INTERVAL_HOURS=24 and a pipeline that takes ~10 minutes:
T₀ : container starts
T₀ + 24h : first run starts, finishes at ~T₀ + 24h 10m
T₀ + 48h : second run starts
...
The 10 minutes of execution time do not push the next fire later. The schedule stays on the original 24-hour grid.
What happens when the container is stopped and restarted?
The scheduler uses an in-memory BlockingScheduler with no persistent job store. All scheduling state is lost when the container stops. On restart:
- [
service.py](https://github.com/openTdataCH/stop_sync_osm_atlas/blob/main/matching_and_import_db/scheduler/service.py) → main()runs fresh. - A new
IntervalTriggeris created → the first fire is scheduled atnow + N hours. - Any runs that were "supposed" to happen while the container was down are silently lost.
Key insight: if the container was down when a run was due, that run is lost. The clock resets from the moment the container starts, not from any previously scheduled time. In the diagram above, the 24:00 run never happens; the next actual run is at 40:00 (16:00 + 24h).
Distributed Locking
Before executing any pipeline phase, the job runner acquires a shared pipeline lock. In the default Docker configuration this uses a Redis lock (SET NX EX). When STATE_BACKEND=file, it uses a file-backed JSON lock protected with fcntl.flock(). This prevents two triggers — for example a scheduled run and a manual docker exec — from running simultaneously.
Lock heartbeat
Long-running pipelines could outlive the lock TTL. To prevent this, a background daemon thread (_RunLockHeartbeat) periodically refreshes the lock's expiry while the run is active:
| Variable | Purpose | Default |
|---|---|---|
PIPELINE_LOCK_TTL_SECONDS |
Initial (and refresh) TTL for the lock | 14400 (4 hours) |
PIPELINE_LOCK_HEARTBEAT_SECONDS |
How often the heartbeat thread refreshes | TTL ÷ 4 (clamped to 5–60s) |
If the scheduler crashes without releasing the lock, the lock auto-expires after PIPELINE_LOCK_TTL_SECONDS seconds. Additionally, stale-lock recovery only clears a lock when the persisted status is newer than the lock itself, which avoids a race where a second trigger could delete a freshly acquired lock before start_run() updates the status.
Refresh-Scope Reporting
At the start of each run, the job runner publishes:
run_typerefresh_scope_tables_rewrittenrefresh_scope_tables_reused
These values are stored both in the shared pipeline-status backend and in data/data_meta.json when the run finishes successfully. That makes it visible whether the last import was a full rewrite or an atlas_cached refresh.
Status Reporting and the "Next Run" Timestamp
How status flows to the UI
The browser polls /api/system/pipeline_status every 10 seconds when idle and every 1.5 seconds while a run is active. The response drives:
- Navbar label: shows "Data updated: YYYY-MM-DD HH:MM" or "Pipeline running in the background"
- Info tooltip: shows the next scheduled run time (ℹ️ icon)
- Maintenance overlay: full-screen blocker during the
importphase - Progress timer: elapsed time and ETA during maintenance
When next_run_at is updated
The _update_next_run_timestamp() function reads job.next_run_time from APScheduler and writes it to the shared state store. It is called at four points:
- On startup — right after the job is registered, before
scheduler.start(). - When the scheduler fires the
EVENT_SCHEDULER_STARTEDevent — to ensure Redis has the correct value even if the scheduler loop takes a moment to begin. - Before each run — inside
_scheduled_job(), so the UI shows the next fire time while the current run is in progress. - After each run — so the timestamp reflects the updated schedule after the run completes.
Configuration
All values are set via environment variables on the scheduler service in docker-compose.yml, with optional overrides in .env:
| Variable | Description | Default |
|---|---|---|
PIPELINE_SCHEDULE_INTERVAL_HOURS |
Hours between automatic runs | 24 |
PIPELINE_TIMEZONE |
Timezone for fire-time computation and display | Europe/Zurich |
PIPELINE_LOG_LEVEL |
Log verbosity (DEBUG, INFO, WARNING, etc.) |
INFO |
PIPELINE_IMPORT_ETA_SECONDS |
Estimated duration of the import phase (drives the UI countdown) | 150 |
PIPELINE_LOCK_TTL_SECONDS |
Pipeline lock TTL | 14400 |
PIPELINE_LOCK_HEARTBEAT_SECONDS |
Lock refresh interval | TTL ÷ 4 |
STATE_BACKEND |
Shared runtime-state backend (redis or file) |
redis in Docker |
STATE_REDIS_URL |
Redis URL for shared runtime state | redis://redis:6379/0 |
STATE_DIR |
Directory used by the file-backed runtime-state backend | data/runtime |
Operational Commands
Manual trigger
Force a full pipeline run immediately by executing the job runner inside the scheduler container:
docker compose exec scheduler python -m matching_and_import_db.scheduler.job_runner --mode full --trigger manual
To re-run matching and import without re-downloading data:
docker compose exec scheduler python -m matching_and_import_db.scheduler.job_runner --mode match-import --trigger manual
Both commands respect the distributed lock — if a scheduled run is already in progress, the manual trigger will log "Another pipeline run is already active" and exit with code 2.
Checking status
GET /api/system/pipeline_status
Returns the full status object including status, phase, message, blocking_maintenance, next_run_at, data_updated_at, and more.
Error Handling
If any phase fails (e.g. a network timeout during OSM download, a matching assertion, or a database error):
- The traceback is logged to
stdout(visible viadocker compose logs scheduler). - Redis status is set to
{ status: "failed", phase: "failed", last_error: "..." }. - The distributed lock is released in the
finallyblock, so subsequent runs are not blocked. - Data Integrity:
- Database: The previous state is preserved. The destructive
TRUNCATE + INSERTonly happens in theimportphase, which is only reached if all downloads and matching succeed. - Files: Raw data files are protected. Downloads use atomic writes or
.partfiles, ensuring that a failed download does not leave a corrupted file that would break the next run.
- Database: The previous state is preserved. The destructive
- UI Stability: The "Data updated" timestamp in the navbar remains unchanged. It only refreshes at the very end of a successful run, so it always accurately reflects the freshness of the data currently in the database.
- Self-Healing: The scheduler loop continues running. The next fire at
T₀ + Nwill automatically attempt a fresh run from scratch.
Key Source Files
| File | Description |
|---|---|
service.py |
APScheduler setup, interval trigger, next-run publishing |
job_runner.py |
Pipeline orchestration, phase sequencing, lock management |
job_types.py |
Canonical pipeline run-type enum (complete, atlas_cached, atlas_cached_bootstrap) |
pipeline_status.py |
Public status + lock API backed by the selected pipeline-state store |
pipeline_state_store.py |
Redis, file, and memory implementations for shared pipeline state |
source_freshness.py |
HTTP-validator probing for ATLAS and GTFS permalinks |
pipeline-status.js |
Browser-side polling, navbar/overlay updates |
navbar.html |
Server-rendered "Data updated" label + next-run tooltip |
docker-compose.yml |
Container definition, env vars, depends_on ordering |