Matching Process

The matching pipeline correlates ATLAS boarding platforms with OSM public transport nodes. It is ATLAS-driven and sequential: predicates run in a fixed priority order, and every successful commit() immediately locks the matched ATLAS and OSM representatives so later logic only sees what is still free.

This page is organized in the same order the code runs:

Load raw data into state managers
Pre-group duplicates and obvious OSM pair/trio groups
Run the 10 predicate executions in order
Record matches through ctx.commit()
Return PipelineResult / MatchingOutput

End-to-End Flow

flowchart LR A["Load ATLAS CSV"] --> B["AtlasState.from_dataframe()"] C["Load OSM XML"] --> D["OsmState.from_xml_file()"] B --> E["Pre-group ATLAS duplicates"] D --> F["Build OSM groups"] E --> G["MatchingContext"] F --> G G --> H["Predicates"] H --> I["PipelineResult"] I --> J["MatchingOutput"]

Conceptual Abstractions

The runtime flow above shows execution order. The diagram below shows the conceptual layers the code is built around: predicates decide, the context orchestrates, the state layer answers queries and tracks locks, and domain models flow upward as the shared data abstraction.

flowchart TD subgraph T1["Tier 1 - Behavior"] P["8 concrete predicate classes\n10 predicate executions"]:::behavior end subgraph T2["Tier 2 - Orchestration"] Ctx["MatchingContext"]:::orchestration Commit["commit(...)"]:::action end subgraph T3["Tier 3 - State and Indexing"] AtlasState["AtlasState\nDataFrame + duplicate groups + matched_ids"]:::state OsmState["OsmState\nUIC/name maps + routes + lazy KDTree + used_ids"]:::state end subgraph T4["Tier 4 - Domain Models"] AtlasEntity["AtlasEntity / AtlasNode"]:::model OsmEntity["OsmEntity / OsmNode"]:::model MatchRecord["MatchRecord"]:::model Output["PipelineResult / MatchingOutput"]:::model end P -->|query unmatched records and candidates| Ctx Ctx -->|read view| AtlasState Ctx -->|read view| OsmState AtlasState -.->|returns| AtlasEntity OsmState -.->|returns| OsmEntity P -->|successful match| Commit Commit -->|lock representative + expand siblings| AtlasState Commit -->|lock representative + expand siblings| OsmState Commit --->|create| MatchRecord MatchRecord --> Output

What Counts as a Match

A match is represented by a MatchRecord, which links one AtlasNode to one OsmNode plus metadata about how the link was made.

Field	Description
`atlas_node`	The matched ATLAS platform
`osm_node`	The matched OSM node
`match_type`	Which rule produced the link, for example `exact`, `name`, `distance_matching_3b`
`distance_m`	Haversine distance stored by the predicate
`notes`	Short explanation such as `gtfs_tokens` or `Exact local_ref match within max_distance`
`problems`	Quality flags populated later by problem evaluation

Most links are 1 ATLAS → 1 OSM, but the pipeline intentionally allows these exceptions:

Cardinality	Why it happens	Where it comes from
N ATLAS → 1 OSM	One OSM node is the only candidate for a UIC	`ExactUicPredicate` single-OSM case
1 ATLAS → N OSM	One ATLAS entry is matched to multiple OSM nodes sharing the same UIC	`ExactUicPredicate` single-ATLAS case
N ATLAS → 1 OSM	Duplicate ATLAS siblings are auto-expanded after the representative matches	`ctx.commit()`
1 ATLAS → N OSM	OSM pair siblings are auto-expanded after the representative matches	`ctx.commit()` for pair groups only

For osm_trio, the pipeline still creates only the two primary side-node MatchRecords. The middle stop_position is not a MatchRecord; it becomes an effectively_matched database row later during import when both trio sides were matched.

Phase 1: Build State Before Matching

The matcher does not work directly on raw DataFrame rows or raw XML tags. It first builds two state managers.

State	Built from	Main job
`AtlasState`	`stops_ATLAS.csv`	Convert ATLAS rows into `AtlasNode`s, track matched SLOIDs, load route evidence
`OsmState`	`osm_data.xml`	Parse OSM nodes and route relations, build attribute indexes, build spatial index on demand

`AtlasState`

AtlasState.from_dataframe() does three important things before the first predicate runs:

Converts ATLAS rows into immutable AtlasNode models on demand.
Detects duplicate groups where entries share the same number + designation.
Loads data/processed/atlas_routes_gtfs.csv into get_routes(sloid) so route matching can stay file-I/O free.

ATLAS duplicate grouping

ATLAS duplicates are grouped before matching, with no distance heuristic.

duplicate_sloid_map groups rows sharing the same (number, designation).
The first sorted SLOID becomes the representative.
The remaining SLOIDs become hidden siblings.
get_unmatched_records() returns one AtlasEntity wrapper for the representative instead of exposing all siblings individually.

When that representative is committed, siblings get their own MatchRecords with match_type='duplicate_propagation'.

`OsmState`

OsmState.from_xml_file() parses OSM nodes and route relations into several indexes over the same underlying node set.

Index / Store	Purpose	Used by
`_uic_ref_dict`	Lookup by `uic_ref`	Exact and post-pass matching
`_name_index`	Lookup by `name`, `uic_name`, `gtfs:name`	Name matching
`_node_routes`	Per-node GTFS route memberships from route relations	Route matching
`name_dirs` / `uic_dirs`	Per-node direction strings	Route matching
Lazy KDTree	Radius queries in metres	Distance and route matching

OsmNode.is_station returns True for public_transport=station or railway=station, but explicitly not for public_transport=stop_position and not for aerialway=station.

OSM grouping before predicates

OsmState.build_groups() runs before matching so obvious OSM siblings behave like one logical entity during predicate execution.

Grouping now runs in two layers, in this order:

Trios first (osm_trio) for strict 3-node UIC cases.
Pairs second for all remaining reciprocal sibling pairs, with a perfect-count branch first.

The map uses one consistent visual convention for grouped OSM structures:

Solid blue lines are ATLAS-OSM matches.
Dashed green lines are OSM-OSM structural links inside a pair or trio.
Trio middles are shown on the map once imported as effectively_matched, but they are still not pipeline MatchRecords.

Group type	Representative	Sibling(s)	Core rule
`osm_trio`	One non-middle side node	Side partner + identified middle	UIC-scoped trio: exactly 3 OSM nodes + exactly 1 stop_position + exactly 2 ATLAS rows for that UIC, with both side nodes within 15 m of the middle
`osm_pair_uic_equal_15m`	`public_transport=platform`	`public_transport=stop_position`	UIC-scoped perfect-count branch: `platform_count == stop_position_count == atlas_effective_count`; reciprocal nearest-neighbour within 15 m; no ratio gate
`osm_pair_uic`	`public_transport=platform`	`public_transport=stop_position`	Reciprocal nearest-neighbour pairing within 12 m, UIC-scoped
`osm_pair_name_equal_15m`	`public_transport=platform`	`public_transport=stop_position`	Name-scoped perfect-count branch (anchored to UIC): equal counts + reciprocal nearest-neighbour within 15 m; no ratio gate
`osm_pair_name`	`public_transport=platform`	`public_transport=stop_position`	Same spatial pairing, but name-scoped and anchored back to a UIC
`osm_pair_tram_equal_15m`	`railway=tram_stop`	`public_transport=stop_position`	Tram-scoped perfect-count branch: equal counts + reciprocal nearest-neighbour within 15 m; no ratio gate
`osm_pair_tram`	`railway=tram_stop`	`public_transport=stop_position`	Same pairing logic for tram stops

Visual examples

The screenshots below are represented as simplified SVG schematics so the pair/trio semantics remain readable in the documentation PDF as well.

OSM trio matched to 2 ATLAS stops

OSM trio matched to two ATLAS stops

The trio's two side nodes are the only true pipeline matches. The middle stop_position remains unmatched during predicate execution, then becomes effectively_matched during import so it can be rendered on the map with dashed green OSM-OSM links to both sides.

OSM pair matched to 1 ATLAS stop

OSM pair matched to one ATLAS stop

For a pair, the representative node matches first and ctx.commit() propagates that result to the sibling as osm_group_propagation. On the map this reads as two blue ATLAS-OSM links plus one dashed green OSM-OSM pair link.

Important details:

Grouping uses reciprocal nearest-neighbour checks, not just raw proximity.
Trios are only registered when both side nodes are within 15 m of the middle stop_position; otherwise those nodes stay available for the later pair grouping paths.
For perfect-count anchors (left_count == right_count == atlas_effective_count), grouping first tries reciprocal conflict-free pairing within 15 m and bypasses ratio checks.
atlas_effective_count is computed per UIC by counting only ATLAS rows whose nearest same-UIC OSM node is within 30 m.
The strict/relaxed fallback path still uses the original (unfiltered) atlas_count.
If the perfect-count branch does not apply or cannot produce a full 1:1 pairing, it tries a 1.5 ratio threshold and accepts those pairs only when grouped + ungrouped OSM entities exactly match the ATLAS count for that anchor.
If the strict complete-count check fails, it retries with 2.0 and accepts reciprocal pairs it finds as an incomplete group set.
In ratio-based paths, the ratio test compares the nearest candidate (which must be within 12 m) against the true second-nearest candidate, even if that second candidate is beyond 12 m. If no second candidate exists at all, the ratio test does not reject the pair.
The same perfect-count-first then strict-vs-incomplete policy applies to UIC, name, and tram pairing paths.

After grouping, lookup methods such as get_by_uic(), get_by_name(), and batch_query_radius() return OsmEntity wrappers for representatives while hiding siblings.

Phase 2: Run the Predicate Pipeline

The runner executes predicates in this order:

flowchart LR T["2.1 Trio distance"] --> E["2.2 Exact"] --> N["2.3 Name"] --> D1["2.3a Group proximity"] --> D2["2.3b Local ref"] --> D3["2.3c Nearest 3a"] --> D4["2.3d Nearest 3b"] --> D5["2.3e Nearest 3a retry"] --> R["2.4 Route"] --> P["2.5 Post-pass"]

Each predicate sees only the current unmatched view of the state. Once a representative is committed, later predicates skip it automatically.

Step	Predicate	Match types	What it actually does
`2.1`	`TrioDistanceMatchingPredicate`	`distance_matching_trio`	For each trio UIC with exactly 2 unmatched ATLAS rows, match the two non-middle side nodes via minimum-total-distance 2x2 assignment; keep middle node unmatched
`2.2`	`ExactUicPredicate`	`exact`	Matches `AtlasNode.uic_ref` to `OsmNode.uic_ref`; when both sides have multiples, refines by `designation == local_ref`
`2.3`	`NameMatchPredicate`	`name`	Matches `designationOfficial` through the OSM name index and optionally refines by `designation == local_ref`
`2.3a`	`GroupProximityPredicate`	`distance_matching_1_uic_ref`, `distance_matching_1_uic_name`, `distance_matching_1_name`	Conflict-free maximum-cardinality assignment within grouped ATLAS and OSM candidates
`2.3b`	`LocalRefDistancePredicate`	`distance_matching_2`	Exact `designation == local_ref` within 50 m
`2.3c`	`NearestDistancePredicate`	`distance_matching_3a`	First single-candidate nearest-distance pass within 50 m
`2.3d`	`NearestDistancePredicate`	`distance_matching_3b`	Ratio-test nearest-distance pass within 50 m
`2.3e`	`NearestDistancePredicate`	`distance_matching_3a_second_pass`	Second single-candidate nearest-distance retry after ratio-pass locking
`2.4`	`RouteMatchPredicate`	`route_gtfs_gtfs`	Still limited to 50 m, then tries normalized GTFS route-id tokens and direction-name fallback
`2.5`	`PostpassUniqueUicPredicate`	`exact_postpass`	Last pass for `1 ATLAS + 1 OSM` left for a UIC when the OSM node has no `local_ref`

Predicate contract

All predicates implement the same interface:

class BasePredicate(ABC):
    @abstractmethod
    def run(self, ctx: MatchingContext) -> None:
        ...

They do not return match records. They query state and call ctx.commit() when they decide to match.

Phase 3: Record Matches Through `ctx.commit()`

MatchingContext.commit() is the only mutation gateway in the pipeline.

flowchart TB A["Predicate found a match"] --> B["Append primary MatchRecord"] B --> C["Lock ATLAS representative in matched_ids"] C --> D["Lock OSM representative in used_ids"] D --> E["Expand ATLAS siblings as duplicate_propagation"] E --> F["Expand OSM siblings as osm_group_propagation (pairs only)"]

For osm_trio groups, ctx.commit() intentionally skips OSM sibling propagation so the trio middle node does not become an osm_group_propagation match. During the database import phase, if both of its side nodes are successfully matched, the importer promotes the middle node directly to effectively_matched for fast querying and map rendering.

That immediate locking is why the pipeline is safe even inside a single predicate loop: later rows consult the updated state, not just the original snapshot.

This matters especially for the predicates that batch KDTree lookups up front:

LocalRefDistancePredicate
NearestDistancePredicate
RouteMatchPredicate

They all re-check used_ids against the live state before committing so they do not reuse a stale candidate computed earlier in the same run.

Phase 4: Global Matching Rules

Station filtering

Predicate family	Stations excluded?	Notes
Exact, name, distance, post-pass	Partially	`get_by_uic()`, `get_by_name()`, and non-station KDTree queries exclude `public_transport=station` and `railway=station`, but `aerialway=station` remains eligible because `OsmNode.is_station` explicitly returns `False` for aerialway stations
Route matching	Partially	Uses the same non-station filters as the other matching predicates, so `aerialway=station` remains eligible

Spatial constants

Constant	Value	Used by
`max_distance`	`50 m`	All spatial predicates, including route matching
`RATIO_TEST_FACTOR`	`4`	`NearestDistancePredicate`
`RATIO_TEST_MIN_D2`	`10 m`	`NearestDistancePredicate`
`GROUP_MAX_DISTANCE_M`	`12 m`	OSM pre-grouping
`GROUP_PERFECT_COUNT_MAX_DISTANCE_M`	`15 m`	OSM pre-grouping perfect-count branch
`ATLAS_NEARBY_OSM_MAX_DISTANCE_M`	`30 m`	Perfect-count ATLAS effective count

Distance storage

All predicates store the actual haversine distance they used, including exact UIC matches.

Outputs

run_pipeline() returns a PipelineResult, and run_matching() wraps it into MatchingOutput.

Type	Fields	Description
`PipelineResult`	`matched`, `unmatched_atlas`, `unmatched_osm`	Pure pipeline output
`MatchingOutput`	`PipelineResult` fields plus `duplicate_sloid_map`, `osm_stop_units`, `all_osm_nodes`	Pipeline output plus pre-pipeline grouping state needed by the importer

unmatched_atlas contains all unmatched nodes, including hidden duplicate siblings of unmatched representatives. unmatched_osm contains all unmatched OSM nodes, including group siblings and stations if they were never used.

No-nearby-OSM detection

After matching, the importer separately flags unmatched ATLAS rows that have no OSM node at all within 50 m. That does not create additional matches; it feeds problem detection.

Domain Models

The pipeline uses three core domain models plus two entity wrappers.

Type	Purpose
`AtlasNode`	Immutable ATLAS platform model
`OsmNode`	Immutable OSM node model
`MatchRecord`	Mutable link object with match metadata
`AtlasEntity`	Wrapper that exposes one representative plus duplicate siblings
`OsmEntity`	Wrapper that exposes one representative plus grouped OSM siblings

Key fields on AtlasNode:

Field	Source
`sloid`	`sloid`
`uic_ref`	`number`
`designation`	`designation`
`designation_official`	`designationOfficial`
`business_org_abbr`	`servicePointBusinessOrganisationAbbreviationEn`

Key fields on OsmNode:

Field	Source
`node_id`	XML node id
`local_ref`	`local_ref`, falling back to `ref`
`name`	`name`
`uic_name`	`uic_name`
`uic_ref`	`uic_ref`
`network`, `operator`	Same-named OSM tags
`public_transport`, `railway`, `amenity`, `aerialway`	Same-named OSM tags
`tags`	Full tag dict

Performance Notes

Spatial matching uses a lazy scipy.spatial.cKDTree over 3D unit-sphere coordinates. The tree is cached and only rebuilt when the station-inclusion mode changes; matched nodes are filtered at query time rather than by rebuilding the tree after every commit.

The batched KDTree path is used by the local-ref, nearest-distance, and route predicates. GroupProximityPredicate does not use the KDTree; it builds a full pairwise distance matrix with NumPy broadcasting over each grouped candidate set.

Results Summary

Metric	Value
ATLAS platforms	54,882
Matched pairs	60,863
Match rate	88.7%
Unmatched ATLAS	6,200
Unmatched OSM	7,987

Code References

Component	File
Domain models	models.py
Orchestrator	orchestrator.py
Pipeline framework	pipeline.py
State management	state.py
Exact matching	predicates/exact_matching.py
Name matching	predicates/name_matching.py
Distance matching	predicates/distance_matching.py
Route matching	predicates/route_matching_gtfs.py
Post-processing	predicates/postpass_matching.py
Spatial index	utils/spatial_index.py

1.4 Route-Route Matching

2.1 Exact Matching

Generating Report

Documentation

Matching Process

End-to-End Flow

Conceptual Abstractions

What Counts as a Match

Phase 1: Build State Before Matching

`AtlasState`

ATLAS duplicate grouping

`OsmState`

OSM grouping before predicates

Visual examples

Phase 2: Run the Predicate Pipeline

Predicate contract

Phase 3: Record Matches Through `ctx.commit()`

Phase 4: Global Matching Rules

Station filtering

Spatial constants

Distance storage

Outputs

No-nearby-OSM detection

Domain Models

Performance Notes

Results Summary

Code References

Generating Report

Documentation

Matching Process

End-to-End Flow

Conceptual Abstractions

What Counts as a Match

Phase 1: Build State Before Matching

AtlasState

ATLAS duplicate grouping

OsmState

OSM grouping before predicates

Visual examples

Phase 2: Run the Predicate Pipeline

Predicate contract

Phase 3: Record Matches Through ctx.commit()

Phase 4: Global Matching Rules

Station filtering

Spatial constants

Distance storage

Outputs

No-nearby-OSM detection

Domain Models

Performance Notes

Results Summary

Code References

`AtlasState`

`OsmState`

Phase 3: Record Matches Through `ctx.commit()`