GTFS ATLAS Data

The GTFS integration stage produces two complementary output families:

data/processed/atlas_routes_gtfs.csv: A stop-level GTFS route sidecar keyed by sloid, used by AtlasState for stop-level route matching and route stats/UI helpers.
data/processed/atlas_routes.csv, atlas_route_directions.csv, and atlas_route_stops.csv: Entity-first GTFS route tables used during route import and route-route linking.

Together, these files map ATLAS stops to their associated transit routes and directions using GTFS (General Transit Feed Specification) timetable data.

GTFS provides the underlying data that allows us to know which routes (e.g., "Tram 11") serve which stops, and in which direction (e.g., "towards Rehalp").

Statistics

We track how many ATLAS stops are successfully matched to GTFS routes.

Total ATLAS Stops: —
Stops Matched to GTFS: —
Coverage: —%

We also track how the GTFS stop_ids map to the ATLAS sloids:

Total GTFS Stops Processed: 50,790
Matched GTFS Stops: 19,911 (39.2%)
Unmatched GTFS Stops: 30,879
Strict Matching Assignments: 26,941
Unique-Number Fallback Assignments: 2,277

GTFS File Structure

Before processing the data, it is helpful to understand the core structure of a GTFS feed. A GTFS feed is a collection of CSV files (with a .txt extension) zipped together. The files we use to build route associations are:

stops.txt: Contains the physical locations of transit stops. Each stop has a unique stop_id (e.g., 8503000:0:1 for a specific platform), coordinates, and names.
routes.txt: Contains the definition of transit lines, such as "Tram 11" or "S12". Each route has a unique route_id, a route_short_name, and optional route metadata such as route_long_name, route_desc, and route_type.
trips.txt: Defines individual journeys along a route. It links a specific trip_id to a route_id and includes direction information (direction_id). The file also contains fields like trip_headsign, but the current route export focuses on route/direction linkage.
stop_times.txt: This is the connective tissue. It defines the exact schedule. Each row links a trip_id to a stop_id and provides the arrival/departure times, as well as the order the stop appears in the trip (stop_sequence).

Example of stop_times.txt:

`trip_id`	`arrival_time`	`departure_time`	`stop_id`	`stop_sequence`
`trip_1001`	`08:00:00`	`08:00:00`	`8503000:0:1`	`1`
`trip_1001`	`08:05:00`	`08:05:00`	`8503001:0:2`	`2`
`trip_1001`	`08:10:00`	`08:10:00`	`8503002:0:1`	`3`

To know what route serves a physical stop_id, we must join these tables: stop_times.txt -> trips.txt -> routes.txt.

Entity Relationship Diagram

Here is a conceptual database diagram illustrating how the GTFS files relate to each other, and how they ultimately map to the ATLAS stops data:

erDiagram ATLAS_STOPS ||--o{ GTFS_STOPS : "matches via stop rules" GTFS_STOPS ||--o{ GTFS_STOP_TIMES : "contains" GTFS_TRIPS ||--o{ GTFS_STOP_TIMES : "contains" GTFS_ROUTES ||--o{ GTFS_TRIPS : "contains" ATLAS_STOPS { string sloid PK string number string designation } GTFS_STOPS { string stop_id PK string stop_name float stop_lat float stop_lon } GTFS_STOP_TIMES { string trip_id FK string stop_id FK int stop_sequence } GTFS_TRIPS { string trip_id PK string route_id FK int direction_id } GTFS_ROUTES { string route_id PK string route_short_name string route_long_name }

The Challenge

The Swiss GTFS dataset is massive. The stop_times.txt file, which links stops to trips, contains over 25 million rows. Loading this entire dataset into memory using standard tools like pandas would require excessive RAM (2.5GB+).

To handle this efficiently, we use a streaming approach that reads stop_times.txt in chunks (e.g., 500,000 rows at a time), keeping only the data relevant to Swiss stops.

Processing Pipeline

The Entity Relationship diagram above shows a chain of joins: stops → stop_times → trips → routes. In a standard database, you would join all these tables together. However, because stop_times.txt is too large for memory, we perform this "chaining join" virtually while streaming, using pre-loaded lookup dictionaries.

flowchart TD subgraph Inputs S["stops.txt"] ST["stop_times.txt"] T["trips.txt"] R["routes.txt"] end subgraph Step1["1. Filter Swiss Stops"] F1["Filter stops.txt Keep only Swiss stops (UIC 85...) inside CH border"] end subgraph Step2["2. Pre-load Trip Lookups"] F1B["Load trips.txt in full Build trip→route and trip→direction lookup maps"] end subgraph Step3["3. Stream stop_times (The 'Join')"] F2["Read next chunk (500k rows)"] F3["Keep rows matching Swiss stops"] F4["Per chunk, apply lookup maps: stop_id → trip_id → route_id Output: Unique (stop_id, route_id, direction_id)"] F4B["Global reduce: Find true first/last stop per trip"] end subgraph Step4["4. Route Metadata & Directions"] F5["Load routes.txt Join route metadata (names)"] F6["Build direction strings (First Stop → Last Stop)"] end subgraph Step5["5. Match to ATLAS"] M1["Match GTFS stop_id to ATLAS sloid"] end S --> F1 T --> F1B ST --> F2 R --> F5 F1 -.->|DataFrame of Swiss stop_ids| F3 F1B -.->|Dicts: trip_id→route_id trip_id→direction_id| F4 F2 --> F3 F3 --> F4 F4 -->|Loop until EOF| F2 F4 -->|Once all chunks processed List of unique stop-route-dir triples| F4B F4B -->|List of unique routes and termini| Step4 Step4 -->|DataFrame: stop_id, route_id, metadata| Step5

Detailed Steps & Data Structures

To understand how the "chaining joins" (stops → stop_times → trips → routes) happen without loading everything into memory, here is the breakdown of the data structures at each step:

1. Filter Swiss Stops

Goal: Reduce the number of stops we care about.
Action: Load stops.txt and keep only geographically Swiss stops.
Data Structure: A pandas DataFrame of swiss_stops containing stop_id, stop_name, stop_lat, and stop_lon. (e.g. ['8503000:0:1', 'Zürich HB', ...])

2. Pre-load Trip Lookups

Goal: Prepare for the trips → routes join.
Action: Load trips.txt completely. Instead of a bulky table, we create two lightweight Python dictionary maps (pandas Series) to instantly look up a route or direction for any given trip.
Data Structure:

route_by_trip map: {'trip_1001': 'route_11', ...}
dir_by_trip map: {'trip_1001': '0', ...}

3. Stream `stop_times` (The "Join" Step)

Goal: Perform the stops → stop_times → trips join in a memory-efficient way.
Action: We read stop_times.txt in chunks of 500,000 rows. For each chunk:

Filter (stops → stop_times): Discard any row where the stop_id is not in our swiss_stops DataFrame from Step 1.
Join (stop_times → trips → routes): For the surviving rows, we have a stop_id and a trip_id. We immediately use our maps from Step 2 to find the route_id and direction_id.
Collect: We save the unique combinations of (stop_id, route_id, direction_id).
Data Structure Output: A concise DataFrame listing every unique route and direction serving a stop: [['8503000:0:1', 'route_11', '0'], ...].

4. Fetch Route Metadata & Finalize Directions

Goal: Now that we know exactly which route_ids serve our stops, we can safely load their metadata (like "Tram 11").
Action:

We load routes.txt and filter it to only the route_ids we discovered in Step 3.
We join this route metadata (like route_short_name) onto our main DataFrame.
We format the "first and last stops" we tracked during the stream into a readable string (e.g., Zürich HB → Bern).
Data Structure Output: A DataFrame containing stop_id, route_id, route_short_name, direction_id, and direction string.

5. Match GTFS to ATLAS (`stop_id` → `sloid`)

Goal: Link the physical GTFS platform to our ATLAS database.
Action: Apply matching algorithms to connect the GTFS stop_id (e.g., 8503000:0:1) to the ATLAS sloid (e.g., ch:1:sloid:8503000:1).
Data Structure Output: The final joined table linking ATLAS stops to Route metadata.

Strategy	Description
1. Strict Match	Parse the GTFS `stop_id` into `uic_number` and `local_ref`, normalize special platform codes (`10000 -> 1`, `10001 -> 2`), then match `(uic_number, normalized_local_ref)` against ATLAS `(number, designation)`.
2. Unique-Number Fallback	Only for GTFS stops that failed strict matching: if the ATLAS `number` appears exactly once, assign that single `sloid`.

GTFS Route Identifiers

A Note on GTFS route_ids

The route_id is an internally generated ID specific to the system creating the timetable data. In Switzerland, it follows a strict naming convention:
<Betriebszweig>-<Liniennummer>-<Projektkurzbezeichnung>-<Linienversionsnummer>
(Branch of operation) - (Line number) - (Project short designation) - (Line version number)

Normalization Strategy

In Switzerland's GTFS data, the route_id format includes a <Projektkurzbezeichnung> (project short designation) that denotes the specific timetable year (e.g., -j24, -j25, -j26). This means a physical route gets a new ID every year.

If an OSM mapper entered gtfs:route_id=11-T-j24-1 in 2024, it wouldn't match the backend's 2026 data (11-T-j26-1). The route_id_normalized field replaces these year-specific codes with -jXX (e.g., 11-T-jXX-1), allowing historical OSM tags to successfully match against current datasets.

Outputs

The pipeline materializes two GTFS-derived output families in data/processed/.

1. `atlas_routes_gtfs.csv`

This is the stop-level route sidecar consumed by AtlasState.from_dataframe() for the stop-level route predicate.

Column	Description
`sloid`	ATLAS unique identifier
`route_id`	GTFS route ID
`route_id_normalized`	Simplified `-jXX` version
`route_name_short`	Short route name
`route_name_long`	Long route name
`direction_id`	Direction ID (`0` or `1`)
`direction_name`	Derived direction label (e.g. `Zürich HB → Bern`)

2. `atlas_routes.csv`

Defines the route entities for the normalized route import path.

Column	Description
`run_id`	Pipeline run timestamp
`route_id`	GTFS route ID
`route_id_normalized`	Simplified `-jXX` version
`agency_id`	GTFS Agency ID
`route_short_name`	Short route name
`route_long_name`	Long route name
`route_desc`	Route description
`route_type`	Transport mode

3. `atlas_route_directions.csv`

Defines the directions for each route.

Column	Description
`route_id`	Foreign key to `atlas_routes.csv`
`direction_id`	Direction ID (`0` or `1`)
`representative_headsign`	Reserved headsign field for downstream route UI/import use
`direction_label`	Derived direction name (e.g. `Zürich HB → Bern`)
`trip_count`	Placeholder count currently emitted as `0`

4. `atlas_route_stops.csv`

Maps individual stops to the routes and directions.

Column	Description
`route_id`	Foreign key to `atlas_routes.csv`
`direction_id`	Direction ID
`sloid`	ATLAS unique identifier
`stop_sequence`	Sequence of the stop (relative)
`mapping_method`	Method used to match GTFS stop to ATLAS sloid

Implementation

Scripts: matching_and_import_db/downloader/get_atlas_gtfs.py, matching_and_import_db/downloader/get_atlas_data.py
Key Functions:
- load_gtfs_data_streaming(): Efficiently streams the massive GTFS dataset in a single pass, using pre-loaded trip lookup maps to avoid per-chunk merges.
- build_integrated_gtfs_data_streaming(): Integrates the loaded GTFS data with ATLAS stops.
- match_gtfs_to_atlas(): Maps GTFS stop_id to ATLAS sloid using strict matching plus unique-number fallback.
- write_atlas_route_csvs(): Writes the entity-first route tables from the integrated GTFS data.

1.1 Atlas Stops

1.3 Osm Data

Generating Report

Documentation

GTFS ATLAS Data

Statistics

GTFS File Structure

Entity Relationship Diagram

The Challenge

Processing Pipeline

Detailed Steps & Data Structures

1. Filter Swiss Stops

2. Pre-load Trip Lookups

3. Stream `stop_times` (The "Join" Step)

4. Fetch Route Metadata & Finalize Directions

5. Match GTFS to ATLAS (`stop_id` → `sloid`)

GTFS Route Identifiers

A Note on GTFS route_ids

Normalization Strategy

Outputs

1. `atlas_routes_gtfs.csv`

2. `atlas_routes.csv`

3. `atlas_route_directions.csv`

4. `atlas_route_stops.csv`

Implementation

Generating Report

Documentation

GTFS ATLAS Data

Statistics

GTFS File Structure

Entity Relationship Diagram

The Challenge

Processing Pipeline

Detailed Steps & Data Structures

1. Filter Swiss Stops

2. Pre-load Trip Lookups

3. Stream stop_times (The "Join" Step)

4. Fetch Route Metadata & Finalize Directions

5. Match GTFS to ATLAS (stop_id → sloid)

GTFS Route Identifiers

A Note on GTFS route_ids

Normalization Strategy

Outputs

1. atlas_routes_gtfs.csv

2. atlas_routes.csv

3. atlas_route_directions.csv

4. atlas_route_stops.csv

Implementation

3. Stream `stop_times` (The "Join" Step)

5. Match GTFS to ATLAS (`stop_id` → `sloid`)

1. `atlas_routes_gtfs.csv`

2. `atlas_routes.csv`

3. `atlas_route_directions.csv`

4. `atlas_route_stops.csv`