GTFS ATLAS Data

The GTFS integration stage produces two complementary output families:

  1. data/processed/atlas_routes_gtfs.csv: A stop-level GTFS route sidecar keyed by sloid, used by AtlasState for stop-level route matching and route stats/UI helpers.
  2. data/processed/atlas_routes.csv, atlas_route_directions.csv, and atlas_route_stops.csv: Entity-first GTFS route tables used during route import and route-route linking.

Together, these files map ATLAS stops to their associated transit routes and directions using GTFS (General Transit Feed Specification) timetable data.

GTFS provides the underlying data that allows us to know which routes (e.g., "Tram 11") serve which stops, and in which direction (e.g., "towards Rehalp").

Statistics

We track how many ATLAS stops are successfully matched to GTFS routes.

  • Total ATLAS Stops:
  • Stops Matched to GTFS:
  • Coverage: %

We also track how the GTFS stop_ids map to the ATLAS sloids:

  • Total GTFS Stops Processed: 50,790
  • Matched GTFS Stops: 19,911 (39.2%)
  • Unmatched GTFS Stops: 30,879
  • Strict Matching Assignments: 26,941
  • Unique-Number Fallback Assignments: 2,277

GTFS File Structure

Before processing the data, it is helpful to understand the core structure of a GTFS feed. A GTFS feed is a collection of CSV files (with a .txt extension) zipped together. The files we use to build route associations are:

  1. stops.txt: Contains the physical locations of transit stops. Each stop has a unique stop_id (e.g., 8503000:0:1 for a specific platform), coordinates, and names.
  2. routes.txt: Contains the definition of transit lines, such as "Tram 11" or "S12". Each route has a unique route_id, a route_short_name, and optional route metadata such as route_long_name, route_desc, and route_type.
  3. trips.txt: Defines individual journeys along a route. It links a specific trip_id to a route_id and includes direction information (direction_id). The file also contains fields like trip_headsign, but the current route export focuses on route/direction linkage.
  4. stop_times.txt: This is the connective tissue. It defines the exact schedule. Each row links a trip_id to a stop_id and provides the arrival/departure times, as well as the order the stop appears in the trip (stop_sequence).

Example of stop_times.txt:

trip_id arrival_time departure_time stop_id stop_sequence
trip_1001 08:00:00 08:00:00 8503000:0:1 1
trip_1001 08:05:00 08:05:00 8503001:0:2 2
trip_1001 08:10:00 08:10:00 8503002:0:1 3

To know what route serves a physical stop_id, we must join these tables: stop_times.txt -> trips.txt -> routes.txt.

Entity Relationship Diagram

Here is a conceptual database diagram illustrating how the GTFS files relate to each other, and how they ultimately map to the ATLAS stops data:

erDiagram ATLAS_STOPS ||--o{ GTFS_STOPS : "matches via stop rules" GTFS_STOPS ||--o{ GTFS_STOP_TIMES : "contains" GTFS_TRIPS ||--o{ GTFS_STOP_TIMES : "contains" GTFS_ROUTES ||--o{ GTFS_TRIPS : "contains" ATLAS_STOPS { string sloid PK string number string designation } GTFS_STOPS { string stop_id PK string stop_name float stop_lat float stop_lon } GTFS_STOP_TIMES { string trip_id FK string stop_id FK int stop_sequence } GTFS_TRIPS { string trip_id PK string route_id FK int direction_id } GTFS_ROUTES { string route_id PK string route_short_name string route_long_name }

The Challenge

The Swiss GTFS dataset is massive. The stop_times.txt file, which links stops to trips, contains over 25 million rows. Loading this entire dataset into memory using standard tools like pandas would require excessive RAM (2.5GB+).

To handle this efficiently, we use a streaming approach that reads stop_times.txt in chunks (e.g., 500,000 rows at a time), keeping only the data relevant to Swiss stops.

Processing Pipeline

The Entity Relationship diagram above shows a chain of joins: stopsstop_timestripsroutes. In a standard database, you would join all these tables together. However, because stop_times.txt is too large for memory, we perform this "chaining join" virtually while streaming, using pre-loaded lookup dictionaries.

flowchart TD subgraph Inputs S["stops.txt"] ST["stop_times.txt"] T["trips.txt"] R["routes.txt"] end subgraph Step1["1. Filter Swiss Stops"] F1["Filter stops.txt<br/>Keep only Swiss stops (UIC 85...)<br/>inside CH border"] end subgraph Step2["2. Pre-load Trip Lookups"] F1B["Load trips.txt in full<br/>Build trip→route and<br/>trip→direction lookup maps"] end subgraph Step3["3. Stream stop_times (The 'Join')"] F2["Read next chunk<br/>(500k rows)"] F3["Keep rows matching Swiss stops"] F4["Per chunk, apply lookup maps:<br/>stop_id → trip_id → route_id<br/>Output: Unique (stop_id, route_id, direction_id)"] F4B["Global reduce:<br/>Find true first/last stop per trip"] end subgraph Step4["4. Route Metadata & Directions"] F5["Load routes.txt<br/>Join route metadata (names)"] F6["Build direction strings<br/>(First Stop → Last Stop)"] end subgraph Step5["5. Match to ATLAS"] M1["Match GTFS stop_id<br/>to ATLAS sloid"] end S --> F1 T --> F1B ST --> F2 R --> F5 F1 -.->|DataFrame of<br/>Swiss stop_ids| F3 F1B -.->|Dicts: trip_id→route_id<br/>trip_id→direction_id| F4 F2 --> F3 F3 --> F4 F4 -->|Loop until EOF| F2 F4 -->|Once all chunks processed<br/>List of unique stop-route-dir triples| F4B F4B -->|List of unique routes and termini| Step4 Step4 -->|DataFrame: stop_id, route_id, metadata| Step5

Detailed Steps & Data Structures

To understand how the "chaining joins" (stopsstop_timestripsroutes) happen without loading everything into memory, here is the breakdown of the data structures at each step:

1. Filter Swiss Stops

Goal: Reduce the number of stops we care about.
Action: Load stops.txt and keep only geographically Swiss stops.
Data Structure: A pandas DataFrame of swiss_stops containing stop_id, stop_name, stop_lat, and stop_lon. (e.g. ['8503000:0:1', 'Zürich HB', ...])

2. Pre-load Trip Lookups

Goal: Prepare for the tripsroutes join.
Action: Load trips.txt completely. Instead of a bulky table, we create two lightweight Python dictionary maps (pandas Series) to instantly look up a route or direction for any given trip.
Data Structure:

  • route_by_trip map: {'trip_1001': 'route_11', ...}
  • dir_by_trip map: {'trip_1001': '0', ...}

3. Stream stop_times (The "Join" Step)

Goal: Perform the stopsstop_timestrips join in a memory-efficient way.
Action: We read stop_times.txt in chunks of 500,000 rows. For each chunk:

  1. Filter (stopsstop_times): Discard any row where the stop_id is not in our swiss_stops DataFrame from Step 1.
  2. Join (stop_timestripsroutes): For the surviving rows, we have a stop_id and a trip_id. We immediately use our maps from Step 2 to find the route_id and direction_id.
  3. Collect: We save the unique combinations of (stop_id, route_id, direction_id).
    Data Structure Output: A concise DataFrame listing every unique route and direction serving a stop: [['8503000:0:1', 'route_11', '0'], ...].

4. Fetch Route Metadata & Finalize Directions

Goal: Now that we know exactly which route_ids serve our stops, we can safely load their metadata (like "Tram 11").
Action:

  1. We load routes.txt and filter it to only the route_ids we discovered in Step 3.
  2. We join this route metadata (like route_short_name) onto our main DataFrame.
  3. We format the "first and last stops" we tracked during the stream into a readable string (e.g., Zürich HB → Bern).
    Data Structure Output: A DataFrame containing stop_id, route_id, route_short_name, direction_id, and direction string.

5. Match GTFS to ATLAS (stop_idsloid)

Goal: Link the physical GTFS platform to our ATLAS database.
Action: Apply matching algorithms to connect the GTFS stop_id (e.g., 8503000:0:1) to the ATLAS sloid (e.g., ch:1:sloid:8503000:1).
Data Structure Output: The final joined table linking ATLAS stops to Route metadata.

Strategy Description
1. Strict Match Parse the GTFS stop_id into uic_number and local_ref, normalize special platform codes (10000 -> 1, 10001 -> 2), then match (uic_number, normalized_local_ref) against ATLAS (number, designation).
2. Unique-Number Fallback Only for GTFS stops that failed strict matching: if the ATLAS number appears exactly once, assign that single sloid.

GTFS Route Identifiers

A Note on GTFS route_ids

The route_id is an internally generated ID specific to the system creating the timetable data. In Switzerland, it follows a strict naming convention:
<Betriebszweig>-<Liniennummer>-<Projektkurzbezeichnung>-<Linienversionsnummer>
(Branch of operation) - (Line number) - (Project short designation) - (Line version number)

Normalization Strategy

In Switzerland's GTFS data, the route_id format includes a <Projektkurzbezeichnung> (project short designation) that denotes the specific timetable year (e.g., -j24, -j25, -j26). This means a physical route gets a new ID every year.

If an OSM mapper entered gtfs:route_id=11-T-j24-1 in 2024, it wouldn't match the backend's 2026 data (11-T-j26-1). The route_id_normalized field replaces these year-specific codes with -jXX (e.g., 11-T-jXX-1), allowing historical OSM tags to successfully match against current datasets.

Outputs

The pipeline materializes two GTFS-derived output families in data/processed/.

1. atlas_routes_gtfs.csv

This is the stop-level route sidecar consumed by AtlasState.from_dataframe() for the stop-level route predicate.

Column Description
sloid ATLAS unique identifier
route_id GTFS route ID
route_id_normalized Simplified -jXX version
route_name_short Short route name
route_name_long Long route name
direction_id Direction ID (0 or 1)
direction_name Derived direction label (e.g. Zürich HB → Bern)

2. atlas_routes.csv

Defines the route entities for the normalized route import path.

Column Description
run_id Pipeline run timestamp
route_id GTFS route ID
route_id_normalized Simplified -jXX version
agency_id GTFS Agency ID
route_short_name Short route name
route_long_name Long route name
route_desc Route description
route_type Transport mode

3. atlas_route_directions.csv

Defines the directions for each route.

Column Description
route_id Foreign key to atlas_routes.csv
direction_id Direction ID (0 or 1)
representative_headsign Reserved headsign field for downstream route UI/import use
direction_label Derived direction name (e.g. Zürich HB → Bern)
trip_count Placeholder count currently emitted as 0

4. atlas_route_stops.csv

Maps individual stops to the routes and directions.

Column Description
route_id Foreign key to atlas_routes.csv
direction_id Direction ID
sloid ATLAS unique identifier
stop_sequence Sequence of the stop (relative)
mapping_method Method used to match GTFS stop to ATLAS sloid

Implementation

  • Scripts: matching_and_import_db/downloader/get_atlas_gtfs.py, matching_and_import_db/downloader/get_atlas_data.py
  • Key Functions:
    • load_gtfs_data_streaming(): Efficiently streams the massive GTFS dataset in a single pass, using pre-loaded trip lookup maps to avoid per-chunk merges.
    • build_integrated_gtfs_data_streaming(): Integrates the loaded GTFS data with ATLAS stops.
    • match_gtfs_to_atlas(): Maps GTFS stop_id to ATLAS sloid using strict matching plus unique-number fallback.
    • write_atlas_route_csvs(): Writes the entity-first route tables from the integrated GTFS data.
Data update running in background
Preparing update... | Phase: initializing
Data update in progress
Core data is being refreshed. Use this time to read the documentation.
Elapsed: -- ETA: -- Phase: idle