5.1 Import Process

The import process turns matching output plus processed CSV artifacts into the PostgreSQL schema that powers the web application.

The scheduled flow is split into two broad phases:

precompute phase outside the maintenance window
write phase inside the maintenance window

The goal is to keep the blocking phase limited to TRUNCATE plus bulk INSERT work. The same precompute/write functions are also used when matching_and_import_db/database/importer.py is run directly.

High-Level Flow

flowchart TB A[run_matching output] --> B[precompute_problem_artifacts] A --> C[precompute_route_artifacts] B --> D[build_fast_insert_payloads] C --> D E[GTFS cache CSVs or GTFS rebuild] --> D D --> F[import_to_database] F --> G[record import timestamp] G --> H[stats finalization]

Precompute Phase

The scheduler runs these steps before entering the maintenance window.

1. Problem precompute

precompute_problem_artifacts() builds the shared ProblemContext and attaches the stop-level ProblemResult objects needed later by the importer.

2. Route precompute

precompute_route_artifacts():

loads the processed route CSVs
collects the ATLAS SLOIDs that will actually exist in the import
calls build_route_write_payload()

That route payload already contains:

source ATLAS line-family rows (atlas_line_families)
source route relation rows used by the web and stats (osm_route_relations)
normalized line_families, itineraries, and stop_calls
line_family_matches
itinerary_matches

3. GTFS identity payloads

_build_gtfs_insert_payloads() prepares the raw GTFS stop rows and GTFS-to-ATLAS identity rows while build_fast_insert_payloads() is assembling the final write payload. It uses the cached import artifacts when both files are present:

data/processed/gtfs_stops_raw.csv
data/processed/gtfs_stop_identity_resolution.csv

If either cache file is missing, it rebuilds from data/raw/stops_ATLAS.csv plus data/raw/gtfs/.

The rebuild path reuses the canonical matcher from matching_and_import_db/downloader/get_atlas_gtfs.py, rewrites data/gtfs_atlas_stats.json, and refreshes the cache CSVs.

4. Build insert payloads

build_fast_insert_payloads() converts everything into plain row dictionaries.

The payload is split conceptually into:

static rows reused by atlas_cached runs: ATLAS/GTFS tables
dynamic rows rewritten every run: OSM tables, stops_matched, problems, osm_route_relations, normalized route tables, and route match tables
meta values such as matched_routes and no_nearby_osm_sloids

Write Phase

import_to_database() performs the blocking database work.

Refresh-scope selection

Before truncation, the importer chooses the table set from get_refresh_scope_tables(run_type):

complete rewrites everything
atlas_cached_bootstrap also rewrites the full table set, but is used specifically to rebuild missing or empty ATLAS/GTFS static tables from cached preprocessing artifacts
atlas_cached rewrites only the dynamic tables and validates that the required static ATLAS/GTFS tables already exist

Bulk insert order

The write order is intentionally staged:

osm_nodes
osm_stops
osm_stop_members
static stop/GTFS tables in complete and atlas_cached_bootstrap modes only:
- atlas_operators
- atlas_stops
- gtfs_stops_raw
- gtfs_stop_identity_resolution
stops_matched
problems after the new stops_matched.id values are known
static ATLAS route source table in complete and atlas_cached_bootstrap modes only:
- atlas_line_families
dynamic OSM route source table:
- osm_route_relations
normalized route tables:
- line_families
- itineraries
- stop_calls
route match tables:

line_family_matches
itinerary_matches

The importer commits in batches using DB_IMPORT_BATCH_SIZE.

export DB_IMPORT_BATCH_SIZE=5000

Synthetic OSM Route Nodes

Before route rows are inserted, build_fast_insert_payloads() checks whether any route stop rows reference OSM node IDs that are not already present in the stop-matching output. If they do, it creates minimal synthetic osm_nodes rows so the route tables can keep referential integrity.

Stats Finalization

After the write phase, the scheduler records the completed import timestamp and then runs stats finalization. The detailed stats sources, timestamp fields, and regeneration behavior are documented in 5.2 Stats and timestamps.

Running the Import Directly

Environment Variables

Variable	Description	Default
`DATABASE_URI`	Import database connection string	`postgresql+psycopg://stops_user:1234@localhost:5432/import_db`
`DB_IMPORT_BATCH_SIZE`	Records per batch commit	`10000`
`ATLAS_STOPS_CSV`	Override used when GTFS cache files must be rebuilt	`data/raw/stops_ATLAS.csv`
`GTFS_FOLDER`	Extracted GTFS folder used when GTFS cache files must be rebuilt	`data/raw/gtfs`

Command

export DATABASE_URI="postgresql://user:pass@localhost/stopdb"
python [matching_and_import_db/database/importer.py](https://github.com/openTdataCH/stop_sync_osm_atlas/blob/main/matching_and_import_db/database/importer.py)

Persistence Behavior

The importer does not preserve manual database edits between runs. The import database is expected to be reproducible from the current files and matching output.

Generating Report

Documentation