Problem Detection and Prioritization

After the matching pipeline completes (Section 2), the system performs a comprehensive analysis to identify data quality issues in both datasets.

Problem Categories

The system detects three categories of problems:

Category Question User Action
Stop Problems "Is this stop at the correct location with correct attributes?" Move node, update tags, fix matching.
Route Entity Problems "Does this route exist and is it defined correctly?" Create or delete route relation; fix route tags.
Route Membership Problems "Is the list of stops for this route consistent?" Add/remove stops from relation; reorder stops; fix roles.

1. Stop Problems (Implemented)

Issues with individual stops (Distance, Attributes, Isolation, Duplicates). See 3.1 Stop Problems.

2. Route Entity Problems (Planned)

Issues with the Route object itself.

3. Route Membership Problems (Planned)

Issues with the stop-route relationship.


Architecture

Problem detection is built on the same domain models as the matching pipeline. The PipelineResult (containing MatchRecord, AtlasNode, and OsmNode entities) flows directly into problem detection — no ORM or dictionary conversion is needed.

Polymorphic Predicates

Each problem predicate is a plain function with a polymorphic signature:

def predicate(ctx: ProblemContext, record: MatchRecord | AtlasNode | OsmNode) -> list[ProblemResult]

Predicates use isinstance checks to decide what to evaluate:

  • distance_problem and attributes_problem only act on MatchRecord (return [] for bare nodes)
  • unmatched_problem only acts on bare AtlasNode or OsmNode (returns [] for MatchRecord)
  • duplicates_problem acts on all three types

This allows the same STOP_PROBLEM_PIPELINE list to be used for both matched and unmatched records.

Two Invocation Paths

Record type Invocation Where
Matched (MatchRecord) match_record.evaluate_problems(problem_ctx, STOP_PROBLEM_PIPELINE) importer.py — calls the method natively on the domain entity
Unmatched (AtlasNode / OsmNode) run_problem_pipeline(STOP_PROBLEM_PIPELINE, problem_ctx, node) importer.py — uses the standalone runner

Both paths produce list[ProblemResult], which is then mapped to ORM Problem rows via apply_problem_results().

ProblemResult Value Object

All predicates return ProblemResult, a lightweight frozen dataclass decoupled from SQLAlchemy:

@dataclass(frozen=True)
class ProblemResult:
    problem_type: str        # 'distance', 'attributes', 'unmatched', 'duplicates'
    priority: int            # 1 = P1, 2 = P2, 3 = P3
    has_atlas_duplicate: bool = False
    has_osm_duplicate: bool = False

Priority Levels

All problems use a consistent three-level priority system:

Level Meaning
P1 Critical
P2 Significant
P3 Minor

Priority assignment is rule-based and considers factors like:

  • Distance thresholds (configurable constants in context.py)
  • Operator type (e.g., SBB railway platforms have higher distance tolerance)
  • Attribute importance (UIC references are more critical than operator names)
  • Isolation status (stops with no nearby counterparts are higher priority)


Code Reference

Component File Purpose
Result value object problem_detection/result.py ProblemResult — frozen dataclass
Shared context problem_detection/context.py ProblemContext.build() — KDTrees, UIC counts, duplicate maps
Pipeline runner problem_detection/pipeline.py run_problem_pipeline(), STOP_PROBLEM_PIPELINE
Distance predicate predicates/distance.py distance_problem()
Attributes predicate predicates/attributes.py attributes_problem()
Unmatched predicate predicates/unmatched.py unmatched_problem()
Duplicates predicate predicates/duplicates.py duplicates_problem()
Domain models models.py MatchRecord.evaluate_problems(), AtlasNode, OsmNode
Database import database/importer.py Calls problem detection per stop during insertion
API endpoints problems.py Problem listing, aggregation, and duplicates grouping
Data update running in background
Preparing update... | Phase: initializing
Data update in progress
Core data is being refreshed. Use this time to read the documentation.
Elapsed: -- ETA: -- Phase: idle