GTFS ATLAS Data
The GTFS integration stage produces a unified output family:
data/processed/atlas_line_families.csv,atlas_itineraries.csv, andatlas_itinerary_stop_calls.csv: normalized ATLAS route artifacts used during route import, stop-level route matching (AtlasState), and line-family / itinerary comparison.data/processed/gtfs_stops_raw.csvandgtfs_stop_identity_resolution.csv: cached GTFS stop identity artifacts used by the main database refresh and the GTFS diagnostic map.data/gtfs_atlas_stats.json: canonical GTFS-to-ATLAS stats sidecar later embedded intodata/stats.json.
Together, these artifacts map ATLAS stops to their associated transit routes and directions using GTFS (General Transit Feed Specification) timetable data, and they also materialize the canonical GTFS stop_id <-> ATLAS sloid state used by the Routes diagnostic map.
GTFS provides the underlying data that allows us to know which routes (e.g., "Tram 11") serve which stops, and in which direction (e.g., "towards Rehalp").
Statistics
We track how many ATLAS stops are successfully touched by GTFS-derived route data.
- ATLAS Stops: 54,998
- ATLAS Stops Touched by GTFS Routes: 42,494
- ATLAS Coverage: 77.3%
These ATLAS-side coverage stats are distinct from the GTFS stop_id mapping stats below. An ATLAS stop is counted here if at least one GTFS-derived route row touches its sloid, while the next block measures how many GTFS stop_ids were successfully mapped to any ATLAS sloid.
How the GTFS stop_ids map to the ATLAS sloids:
- Total GTFS Stops Processed: 81,545
- Matched GTFS Stops: 48,421 (59.4%)
- Unmatched GTFS Stops: 33,124
- Original stop_id Assignments: 40,920
- Strict Matching Assignments: 15
- Coordinate Proximity Assignments: 911
- Unique-Number Fallback Assignments: 6,575
GTFS File Structure
Before processing the data, it is helpful to understand the core structure of a GTFS feed. A GTFS feed is a collection of CSV files (with a .txt extension) zipped together. The files we use to build route associations are:
stops.txt: Contains the physical locations of transit stops. Each stop has a uniquestop_id(e.g.,8503000:0:1for a specific platform), coordinates, and names.routes.txt: Contains the definition of transit lines, such as "Tram 11" or "S12". Each route has a uniqueroute_id, aroute_short_name, and optional route metadata such asroute_long_name,route_desc, androute_type.trips.txt: Defines individual journeys along a route. It links a specifictrip_idto aroute_idand includes direction information (direction_id). The file also contains fields liketrip_headsign, but the current route export focuses on route/direction linkage.stop_times.txt: This is the connective tissue. It defines the exact schedule. Each row links atrip_idto astop_idand provides the arrival/departure times, as well as the order the stop appears in the trip (stop_sequence).
Example of stop_times.txt:
trip_id |
arrival_time |
departure_time |
stop_id |
stop_sequence |
|---|---|---|---|---|
trip_1001 |
08:00:00 |
08:00:00 |
8503000:0:1 |
1 |
trip_1001 |
08:05:00 |
08:05:00 |
8503001:0:2 |
2 |
trip_1001 |
08:10:00 |
08:10:00 |
8503002:0:1 |
3 |
To know what route serves a physical stop_id, we must join these tables: stop_times.txt -> trips.txt -> routes.txt.
Entity Relationship Diagram
Here is a conceptual database diagram illustrating how the GTFS files relate to each other, and how they ultimately map to the ATLAS stops data:
The Challenge
The Swiss GTFS dataset is massive. The stop_times.txt file, which links stops to trips, contains over 25 million rows. Loading this entire dataset into memory using standard tools like pandas would require excessive RAM (2.5GB+).
To handle this efficiently, we use a streaming approach that reads stop_times.txt in chunks (e.g., 500,000 rows at a time), keeping only the data relevant to Swiss stops.
Processing Pipeline
The Entity Relationship diagram above shows a chain of joins: stops → stop_times → trips → routes. In a standard database, you would join all these tables together. However, because stop_times.txt is too large for memory, we perform this "chaining join" virtually while streaming, using pre-loaded lookup dictionaries.
Detailed Steps & Data Structures
How the "chaining joins" (stops → stop_times → trips → routes) happen without loading everything into memory:
1. Filter Swiss Stops
Goal: Reduce the number of stops we care about.
Action: Load stops.txt and keep only geographically Swiss stops.
Data Structure: A pandas DataFrame of swiss_stops containing stop_id, stop_name, stop_lat, and stop_lon. (e.g. ['8503000:0:1', 'Zürich HB', ...])
2. Pre-load Trip Lookups
Goal: Prepare for the trips → routes join.
Action: Load trips.txt completely. Instead of a bulky table, we create two lightweight Python dictionary maps (pandas Series) to instantly look up a route or direction for any given trip.
Data Structure:
route_by_tripmap:{'trip_1001': 'route_11', ...}dir_by_tripmap:{'trip_1001': '0', ...}
3. Stream stop_times (The "Join" Step)
Goal: Perform the stops → stop_times → trips join in a memory-efficient way.
Action: We read stop_times.txt in chunks of 500,000 rows. For each chunk:
- Filter (
stops→stop_times): Discard any row where thestop_idis not in ourswiss_stopsDataFrame from Step 1. - Join (
stop_times→trips→routes): For the surviving rows, we have astop_idand atrip_id. We immediately use our maps from Step 2 to find theroute_idanddirection_id. - Collect: We save the unique combinations of
(stop_id, route_id, direction_id).
Data Structure Output: A concise DataFrame listing every unique route and direction serving a stop:[['8503000:0:1', 'route_11', '0'], ...].
4. Fetch Route Metadata & Finalize Directions
Goal: Now that we know exactly which route_ids serve our stops, we can safely load their metadata (like "Tram 11").
Action:
- We load
routes.txtand filter it to only theroute_ids we discovered in Step 3. - We join this route metadata (like
route_short_name) onto our main DataFrame. - We format the "first and last stops" we tracked during the stream into a readable string (e.g.,
Zürich HB → Bern).
Data Structure Output: A DataFrame containingstop_id,route_id,route_short_name,direction_id, anddirectionstring.
5. Match GTFS to ATLAS (stop_id → sloid)
Goal: Link the physical GTFS platform to our ATLAS database.
Action: Apply matching algorithms to connect the GTFS stop_id (e.g., 8503000:0:1) to the ATLAS sloid (e.g., ch:1:sloid:8503000:1).
Data Structure Output: A reusable GTFS-to-ATLAS payload containing the integrated route rows, the canonical stop_id <-> sloid matches, and exported mapping stats.
| Strategy | Description |
|---|---|
| 1. Original stop_id | If the GTFS feed already exposes an original_stop_id equal to an ATLAS sloid, that identity is accepted directly. |
| 2. Strict Match | Parse the GTFS stop_id into uic_number and local_ref, normalize special platform codes (10000 -> 1, 10001 -> 2), then match (uic_number, normalized_local_ref) against ATLAS (number, designation). |
| 3. Coordinate Proximity | For stops that fail strict matching, the system searches for ATLAS stops with the same UIC number within a 0.5-meter radius. The match is only assigned if it is unambiguous (1-to-1). |
| 4. Unique-Number Fallback | Only for GTFS stops that failed the earlier steps: if the ATLAS number appears exactly once in the dataset, assign that single sloid. |
The canonical matcher is reused in three places:
- route integration (
atlas_line_families.csv,atlas_itineraries.csv,atlas_itinerary_stop_calls.csv) - import payload generation (
gtfs_stops_raw.csv,gtfs_stop_identity_resolution.csv) - exported stats (
data/gtfs_atlas_stats.json)
How GTFS Trips Become ATLAS Itineraries
The route export does not write one row per raw GTFS trip_id.
Instead, the exporter reconstructs a compact ATLAS-side route model with three levels:
- Line family: one row per GTFS
route_id - Itinerary: one representative direction/variant bucket inside that line family
- Stop call: one ordered stop row inside the chosen itinerary
The itinerary bucketing logic is intentionally conservative:
- the outer grouping key is
route_idplus GTFSdirection_id - within that bucket,
representative_headsignis preferred when present - if no headsign exists,
trip_short_nameis used - if neither text field is present, the fallback grouping key is an ordered UIC-level stop-pattern hash
When several raw trips collapse into the same emitted itinerary:
- the dominant underlying stop pattern becomes the representative stop sequence
- merged platform-level SLOID variants are retained on the stop-call rows
trip_countrecords how many GTFS trips were collapsed into that itinerary
This keeps the exported route layer compact enough for comparison while preserving the physical-stop identity needed by the importer.
GTFS Route Identifiers
A Note on GTFS route_ids
The route_id is an internally generated ID specific to the system creating the timetable data. In Switzerland, it follows a strict naming convention:
<Betriebszweig>-<Liniennummer>-<Projektkurzbezeichnung>-<Linienversionsnummer>
(Branch of operation) - (Line number) - (Project short designation) - (Line version number)
Normalization Strategy
In Switzerland's GTFS data, the route_id format includes a <Projektkurzbezeichnung> (project short designation) that denotes the specific timetable year (e.g., -j24, -j25, -j26). This means a physical route gets a new ID every year.
If an OSM mapper entered gtfs:route_id=11-T-j24-1 in 2024, it wouldn't match the backend's 2026 data (11-T-j26-1). The route_id_normalized field replaces these year-specific codes with -jXX (e.g., 11-T-jXX-1), allowing historical OSM tags to successfully match against current datasets.
Outputs
The pipeline materializes the GTFS-derived output family in data/processed/.
1. atlas_line_families.csv
Defines the ATLAS line-family entities for the normalized route import path.
| Column | Description |
|---|---|
atlas_line_id |
Stable ATLAS line-family identifier |
route_id |
GTFS route ID |
route_id_normalized |
Simplified -jXX version |
agency_id |
GTFS Agency ID |
route_short_name |
Short route name |
route_long_name |
Long route name |
route_desc |
Route description |
route_type |
Transport mode |
2. atlas_itineraries.csv
Defines the itinerary rows derived from GTFS trips/directions.
| Column | Description |
|---|---|
atlas_itinerary_id |
Stable itinerary identifier |
atlas_line_id |
Foreign key to atlas_line_families.csv |
route_id |
GTFS route ID |
direction_id |
Direction ID (0 or 1) |
representative_headsign |
Reserved headsign field for downstream route UI/import use |
direction_label |
Derived direction name (e.g. Zürich HB → Bern) |
3. atlas_itinerary_stop_calls.csv
Maps individual stops to the normalized itineraries.
| Column | Description |
|---|---|
atlas_itinerary_id |
Foreign key to atlas_itineraries.csv |
sloid |
ATLAS unique identifier |
stop_sequence |
Sequence of the stop (relative) |
mapping_method |
GTFS stop-to-ATLAS assignment method used for that row (strict, coordinate_proximity, or unique_number) |
4. gtfs_stops_raw.csv
Cached raw GTFS stop rows for the database import and GTFS diagnostic map.
| Column | Description |
|---|---|
stop_id |
GTFS stop identifier |
stop_code |
Raw GTFS stop code |
stop_name |
GTFS stop name |
platform_code |
GTFS platform code when present |
original_stop_id |
Original feed-level stop identifier used by direct sloid matches |
location_type |
GTFS location type |
parent_station |
GTFS parent station identifier |
uic_number |
Parsed Swiss UIC number |
local_ref |
Raw local/platform suffix parsed from stop_id |
normalized_local_ref |
Local ref normalized for strict matching |
stop_lat / stop_lon |
GTFS coordinates |
5. gtfs_stop_identity_resolution.csv
Cached GTFS <-> ATLAS identity-resolution rows for the database import and the /routes/gtfs-stop-id-sloid map.
| Column | Description |
|---|---|
stop_id |
GTFS stop identifier |
source_location_type |
GTFS location_type copied onto the state row |
identity_level |
Derived identity class such as platform, station, or child_stop |
resolved_sloid |
Matched ATLAS SLOID, or NULL when unresolved |
resolution_method |
original_stop_id, strict, coordinate_proximity, unique_number, or unmatched |
confidence |
Confidence score derived from the resolution method |
distance_m |
Coordinate-proximity distance when applicable |
gtfs_stop_lat / gtfs_stop_lon |
GTFS-side coordinates copied onto the state row |
atlas_lat / atlas_lon |
ATLAS-side coordinates copied onto the state row |
details_json |
Extra GTFS identity fields such as original_stop_id, platform_code, and parent_station |
6. data/gtfs_atlas_stats.json
Canonical GTFS-to-ATLAS stats sidecar later embedded into data/stats.json under the gtfs_atlas key.
It includes:
- ATLAS-side coverage (
atlas.total,atlas.touched_by_gtfs_routes,atlas.coverage_percent) - GTFS-side mapping coverage (
gtfs_stop_ids.*) - assignment counts (
original_stop_id,strict,coordinate_proximity,unique_number_fallback) - coordinate-proximity diagnostics
- cardinality and unmatched-reason diagnostics
Implementation
- Scripts:
matching_and_import_db/downloader/get_atlas_gtfs.py,matching_and_import_db/downloader/get_atlas_data.py - Key Functions:
load_gtfs_data_streaming(): Efficiently streams the massive GTFS dataset in a single pass, using pre-loaded trip lookup maps to avoid per-chunk merges.build_gtfs_atlas_payload(): Builds the canonical reusable GTFS-to-ATLAS payload for route export, stats, and DB import.build_integrated_gtfs_data_streaming(): Integrates the loaded GTFS data with ATLAS stops for route-level outputs.match_gtfs_to_atlas(): Maps GTFSstop_idto ATLASsloidusing original-stop identity, strict matching, coordinate proximity rescue, and unique-number fallback.build_gtfs_db_payload_rows(): Materializes GTFS stop rows and GTFS <-> ATLAS state rows for database import.write_gtfs_db_payload_cache(): Writesgtfs_stops_raw.csvandgtfs_stop_identity_resolution.csvintodata/processed/.write_atlas_route_csvs(): Writes the normalized ATLAS route tables from the integrated GTFS data.
The Routes tab diagnostic map reads the imported versions of these payloads from the gtfs_stops_raw and gtfs_stop_identity_resolution tables. See 6.6 Routes Pages.