5.1 Import Process

The import process turns matching output plus processed CSV artifacts into the PostgreSQL schema that powers the web application.

The scheduled flow is split into two broad phases:

  • precompute phase outside the maintenance window
  • write phase inside the maintenance window

The goal is to keep the blocking phase limited to TRUNCATE plus bulk INSERT work. The same precompute/write functions are also used when matching_and_import_db/database/importer.py is run directly.

High-Level Flow

flowchart TB A[run_matching output] --> B[precompute_problem_artifacts] A --> C[precompute_route_artifacts] B --> D[build_fast_insert_payloads] C --> D E[GTFS cache CSVs or GTFS rebuild] --> D D --> F[import_to_database] F --> G[record import timestamp] G --> H[stats finalization]

Precompute Phase

The scheduler runs these steps before entering the maintenance window.

1. Problem precompute

precompute_problem_artifacts() builds the shared ProblemContext and attaches the stop-level ProblemResult objects needed later by the importer.

2. Route precompute

precompute_route_artifacts():

  1. loads the processed route CSVs
  2. collects the ATLAS SLOIDs that will actually exist in the import
  3. calls build_route_write_payload()

That route payload already contains:

  • source ATLAS line-family rows (atlas_line_families)
  • source route relation rows used by the web and stats (osm_route_relations)
  • normalized line_families, itineraries, and stop_calls
  • line_family_matches
  • itinerary_matches

3. GTFS identity payloads

_build_gtfs_insert_payloads() prepares the raw GTFS stop rows and GTFS-to-ATLAS identity rows while build_fast_insert_payloads() is assembling the final write payload. It uses the cached import artifacts when both files are present:

  1. data/processed/gtfs_stops_raw.csv
  2. data/processed/gtfs_stop_identity_resolution.csv

If either cache file is missing, it rebuilds from data/raw/stops_ATLAS.csv plus data/raw/gtfs/.

The rebuild path reuses the canonical matcher from matching_and_import_db/downloader/get_atlas_gtfs.py, rewrites data/gtfs_atlas_stats.json, and refreshes the cache CSVs.

4. Build insert payloads

build_fast_insert_payloads() converts everything into plain row dictionaries.

The payload is split conceptually into:

  • static rows reused by atlas_cached runs: ATLAS/GTFS tables
  • dynamic rows rewritten every run: OSM tables, stops_matched, problems, osm_route_relations, normalized route tables, and route match tables
  • meta values such as matched_routes and no_nearby_osm_sloids

Write Phase

import_to_database() performs the blocking database work.

Refresh-scope selection

Before truncation, the importer chooses the table set from get_refresh_scope_tables(run_type):

  • complete rewrites everything
  • atlas_cached_bootstrap also rewrites the full table set, but is used specifically to rebuild missing or empty ATLAS/GTFS static tables from cached preprocessing artifacts
  • atlas_cached rewrites only the dynamic tables and validates that the required static ATLAS/GTFS tables already exist

Bulk insert order

The write order is intentionally staged:

  1. osm_nodes
  2. osm_stops
  3. osm_stop_members
  4. static stop/GTFS tables in complete and atlas_cached_bootstrap modes only:
    • atlas_operators
    • atlas_stops
    • gtfs_stops_raw
    • gtfs_stop_identity_resolution
  5. stops_matched
  6. problems after the new stops_matched.id values are known
  7. static ATLAS route source table in complete and atlas_cached_bootstrap modes only:
    • atlas_line_families
  8. dynamic OSM route source table:
    • osm_route_relations
  9. normalized route tables:
    • line_families
    • itineraries
    • stop_calls
  10. route match tables:
  • line_family_matches
  • itinerary_matches

The importer commits in batches using DB_IMPORT_BATCH_SIZE.

export DB_IMPORT_BATCH_SIZE=5000

Synthetic OSM Route Nodes

Before route rows are inserted, build_fast_insert_payloads() checks whether any route stop rows reference OSM node IDs that are not already present in the stop-matching output. If they do, it creates minimal synthetic osm_nodes rows so the route tables can keep referential integrity.

Stats Finalization

After the write phase, the scheduler records the completed import timestamp and then runs stats finalization. The detailed stats sources, timestamp fields, and regeneration behavior are documented in 5.2 Stats and timestamps.

Running the Import Directly

Environment Variables

Variable Description Default
DATABASE_URI Import database connection string postgresql+psycopg://stops_user:1234@localhost:5432/import_db
DB_IMPORT_BATCH_SIZE Records per batch commit 10000
ATLAS_STOPS_CSV Override used when GTFS cache files must be rebuilt data/raw/stops_ATLAS.csv
GTFS_FOLDER Extracted GTFS folder used when GTFS cache files must be rebuilt data/raw/gtfs

Command

export DATABASE_URI="postgresql://user:pass@localhost/stopdb"
python [matching_and_import_db/database/importer.py](https://github.com/openTdataCH/stop_sync_osm_atlas/blob/main/matching_and_import_db/database/importer.py)

Persistence Behavior

The importer does not preserve manual database edits between runs. The import database is expected to be reproducible from the current files and matching output.

Related Documentation

Data update in progress
Elapsed: -- ETA: -- Phase: idle