GTFS ATLAS Data
The GTFS integration stage produces two complementary output families:
data/processed/atlas_routes_gtfs.csv: A stop-level GTFS route sidecar keyed bysloid, used byAtlasStatefor stop-level route matching and route stats/UI helpers.data/processed/atlas_routes.csv,atlas_route_directions.csv, andatlas_route_stops.csv: Entity-first GTFS route tables used during route import and route-route linking.
Together, these files map ATLAS stops to their associated transit routes and directions using GTFS (General Transit Feed Specification) timetable data.
GTFS provides the underlying data that allows us to know which routes (e.g., "Tram 11") serve which stops, and in which direction (e.g., "towards Rehalp").
Statistics
We track how many ATLAS stops are successfully matched to GTFS routes.
- Total ATLAS Stops: —
- Stops Matched to GTFS: —
- Coverage: —%
We also track how the GTFS stop_ids map to the ATLAS sloids:
- Total GTFS Stops Processed: 50,790
- Matched GTFS Stops: 19,911 (39.2%)
- Unmatched GTFS Stops: 30,879
- Strict Matching Assignments: 26,941
- Unique-Number Fallback Assignments: 2,277
GTFS File Structure
Before processing the data, it is helpful to understand the core structure of a GTFS feed. A GTFS feed is a collection of CSV files (with a .txt extension) zipped together. The files we use to build route associations are:
stops.txt: Contains the physical locations of transit stops. Each stop has a uniquestop_id(e.g.,8503000:0:1for a specific platform), coordinates, and names.routes.txt: Contains the definition of transit lines, such as "Tram 11" or "S12". Each route has a uniqueroute_id, aroute_short_name, and optional route metadata such asroute_long_name,route_desc, androute_type.trips.txt: Defines individual journeys along a route. It links a specifictrip_idto aroute_idand includes direction information (direction_id). The file also contains fields liketrip_headsign, but the current route export focuses on route/direction linkage.stop_times.txt: This is the connective tissue. It defines the exact schedule. Each row links atrip_idto astop_idand provides the arrival/departure times, as well as the order the stop appears in the trip (stop_sequence).
Example of stop_times.txt:
trip_id |
arrival_time |
departure_time |
stop_id |
stop_sequence |
|---|---|---|---|---|
trip_1001 |
08:00:00 |
08:00:00 |
8503000:0:1 |
1 |
trip_1001 |
08:05:00 |
08:05:00 |
8503001:0:2 |
2 |
trip_1001 |
08:10:00 |
08:10:00 |
8503002:0:1 |
3 |
To know what route serves a physical stop_id, we must join these tables: stop_times.txt -> trips.txt -> routes.txt.
Entity Relationship Diagram
Here is a conceptual database diagram illustrating how the GTFS files relate to each other, and how they ultimately map to the ATLAS stops data:
The Challenge
The Swiss GTFS dataset is massive. The stop_times.txt file, which links stops to trips, contains over 25 million rows. Loading this entire dataset into memory using standard tools like pandas would require excessive RAM (2.5GB+).
To handle this efficiently, we use a streaming approach that reads stop_times.txt in chunks (e.g., 500,000 rows at a time), keeping only the data relevant to Swiss stops.
Processing Pipeline
The Entity Relationship diagram above shows a chain of joins: stops → stop_times → trips → routes. In a standard database, you would join all these tables together. However, because stop_times.txt is too large for memory, we perform this "chaining join" virtually while streaming, using pre-loaded lookup dictionaries.
Detailed Steps & Data Structures
To understand how the "chaining joins" (stops → stop_times → trips → routes) happen without loading everything into memory, here is the breakdown of the data structures at each step:
1. Filter Swiss Stops
Goal: Reduce the number of stops we care about.
Action: Load stops.txt and keep only geographically Swiss stops.
Data Structure: A pandas DataFrame of swiss_stops containing stop_id, stop_name, stop_lat, and stop_lon. (e.g. ['8503000:0:1', 'Zürich HB', ...])
2. Pre-load Trip Lookups
Goal: Prepare for the trips → routes join.
Action: Load trips.txt completely. Instead of a bulky table, we create two lightweight Python dictionary maps (pandas Series) to instantly look up a route or direction for any given trip.
Data Structure:
route_by_tripmap:{'trip_1001': 'route_11', ...}dir_by_tripmap:{'trip_1001': '0', ...}
3. Stream stop_times (The "Join" Step)
Goal: Perform the stops → stop_times → trips join in a memory-efficient way.
Action: We read stop_times.txt in chunks of 500,000 rows. For each chunk:
- Filter (
stops→stop_times): Discard any row where thestop_idis not in ourswiss_stopsDataFrame from Step 1. - Join (
stop_times→trips→routes): For the surviving rows, we have astop_idand atrip_id. We immediately use our maps from Step 2 to find theroute_idanddirection_id. - Collect: We save the unique combinations of
(stop_id, route_id, direction_id).
Data Structure Output: A concise DataFrame listing every unique route and direction serving a stop:[['8503000:0:1', 'route_11', '0'], ...].
4. Fetch Route Metadata & Finalize Directions
Goal: Now that we know exactly which route_ids serve our stops, we can safely load their metadata (like "Tram 11").
Action:
- We load
routes.txtand filter it to only theroute_ids we discovered in Step 3. - We join this route metadata (like
route_short_name) onto our main DataFrame. - We format the "first and last stops" we tracked during the stream into a readable string (e.g.,
Zürich HB → Bern).
Data Structure Output: A DataFrame containingstop_id,route_id,route_short_name,direction_id, anddirectionstring.
5. Match GTFS to ATLAS (stop_id → sloid)
Goal: Link the physical GTFS platform to our ATLAS database.
Action: Apply matching algorithms to connect the GTFS stop_id (e.g., 8503000:0:1) to the ATLAS sloid (e.g., ch:1:sloid:8503000:1).
Data Structure Output: The final joined table linking ATLAS stops to Route metadata.
| Strategy | Description |
|---|---|
| 1. Strict Match | Parse the GTFS stop_id into uic_number and local_ref, normalize special platform codes (10000 -> 1, 10001 -> 2), then match (uic_number, normalized_local_ref) against ATLAS (number, designation). |
| 2. Unique-Number Fallback | Only for GTFS stops that failed strict matching: if the ATLAS number appears exactly once, assign that single sloid. |
GTFS Route Identifiers
A Note on GTFS route_ids
The route_id is an internally generated ID specific to the system creating the timetable data. In Switzerland, it follows a strict naming convention:
<Betriebszweig>-<Liniennummer>-<Projektkurzbezeichnung>-<Linienversionsnummer>
(Branch of operation) - (Line number) - (Project short designation) - (Line version number)
Normalization Strategy
In Switzerland's GTFS data, the route_id format includes a <Projektkurzbezeichnung> (project short designation) that denotes the specific timetable year (e.g., -j24, -j25, -j26). This means a physical route gets a new ID every year.
If an OSM mapper entered gtfs:route_id=11-T-j24-1 in 2024, it wouldn't match the backend's 2026 data (11-T-j26-1). The route_id_normalized field replaces these year-specific codes with -jXX (e.g., 11-T-jXX-1), allowing historical OSM tags to successfully match against current datasets.
Outputs
The pipeline materializes two GTFS-derived output families in data/processed/.
1. atlas_routes_gtfs.csv
This is the stop-level route sidecar consumed by AtlasState.from_dataframe() for the stop-level route predicate.
| Column | Description |
|---|---|
sloid |
ATLAS unique identifier |
route_id |
GTFS route ID |
route_id_normalized |
Simplified -jXX version |
route_name_short |
Short route name |
route_name_long |
Long route name |
direction_id |
Direction ID (0 or 1) |
direction_name |
Derived direction label (e.g. Zürich HB → Bern) |
2. atlas_routes.csv
Defines the route entities for the normalized route import path.
| Column | Description |
|---|---|
run_id |
Pipeline run timestamp |
route_id |
GTFS route ID |
route_id_normalized |
Simplified -jXX version |
agency_id |
GTFS Agency ID |
route_short_name |
Short route name |
route_long_name |
Long route name |
route_desc |
Route description |
route_type |
Transport mode |
3. atlas_route_directions.csv
Defines the directions for each route.
| Column | Description |
|---|---|
route_id |
Foreign key to atlas_routes.csv |
direction_id |
Direction ID (0 or 1) |
representative_headsign |
Reserved headsign field for downstream route UI/import use |
direction_label |
Derived direction name (e.g. Zürich HB → Bern) |
trip_count |
Placeholder count currently emitted as 0 |
4. atlas_route_stops.csv
Maps individual stops to the routes and directions.
| Column | Description |
|---|---|
route_id |
Foreign key to atlas_routes.csv |
direction_id |
Direction ID |
sloid |
ATLAS unique identifier |
stop_sequence |
Sequence of the stop (relative) |
mapping_method |
Method used to match GTFS stop to ATLAS sloid |
Implementation
- Scripts:
matching_and_import_db/downloader/get_atlas_gtfs.py,matching_and_import_db/downloader/get_atlas_data.py - Key Functions:
load_gtfs_data_streaming(): Efficiently streams the massive GTFS dataset in a single pass, using pre-loaded trip lookup maps to avoid per-chunk merges.build_integrated_gtfs_data_streaming(): Integrates the loaded GTFS data with ATLAS stops.match_gtfs_to_atlas(): Maps GTFSstop_idto ATLASsloidusing strict matching plus unique-number fallback.write_atlas_route_csvs(): Writes the entity-first route tables from the integrated GTFS data.