Importing 617 East Bay Trails: Geospatial Clustering and Path Assembly

The East Bay Regional Parks District maintains a GeoJSON dataset of trails across its 73 parks. At face value it’s a straightforward import: read the file, generate a record per trail. In practice, the geometry is a bag of disconnected LineString segments that need substantial reassembly before they’re useful.

The two main problems are:

Same-named trails in different parks. “Ridge Trail” appears in a dozen parks. A naive import would merge all those segments into a single incoherent path spanning the entire East Bay.
Fragmented trail segments. A single trail is often split into dozens of LineString features by intersections, park boundaries, or quirks of how the data was captured. They need to be chained back into a coherent route.

Here’s how I solved both.

Step 1: Geographic Clustering with Union-Find

Before chaining segments, I need to separate same-named trails by location. The approach: if two segments with the same name are within 1.5 km of each other (endpoint-to-endpoint), treat them as part of the same trail. Otherwise, they’re distinct trails that happen to share a name.

Union-Find (disjoint set union) is the natural data structure here. Each segment starts in its own set. I iterate over all pairs of same-named segments and union them if any endpoint pair is within the distance threshold:

class UnionFind:
    def __init__(self, n):
        self.parent = list(range(n))

    def find(self, x):
        while self.parent[x] != x:
            self.parent[x] = self.parent[self.parent[x]]
            x = self.parent[x]
        return x

    def union(self, x, y):
        self.parent[self.find(x)] = self.find(y)

def cluster_segments(segments: list[dict], threshold_km: float = 1.5) -> list[list[dict]]:
    uf = UnionFind(len(segments))
    for i, a in enumerate(segments):
        for j, b in enumerate(segments[i+1:], i+1):
            if min_endpoint_distance_km(a, b) < threshold_km:
                uf.union(i, j)
    groups: dict[int, list[dict]] = {}
    for i, seg in enumerate(segments):
        root = uf.find(i)
        groups.setdefault(root, []).append(seg)
    return list(groups.values())

This separates “Ridge Trail in Tilden” from “Ridge Trail in Las Trampas” without needing any park-boundary data.

Step 2: Greedy Nearest-Neighbor Chaining

Within each cluster, the segments need to be assembled into a single path. The approach is greedy: start with an arbitrary segment, then repeatedly find the unvisited segment whose nearest endpoint is closest to the current path’s leading endpoint, and append it (flipping it if needed so the endpoints connect).

def chain_segments(segments: list[list[tuple]]) -> list[tuple]:
    remaining = list(segments)
    chain = remaining.pop(0)

    while remaining:
        head = chain[-1]
        best_idx, best_dist, best_flip = None, float("inf"), False
        for i, seg in enumerate(remaining):
            d_fwd = haversine_km(head, seg[0])
            d_rev = haversine_km(head, seg[-1])
            if d_fwd < best_dist:
                best_idx, best_dist, best_flip = i, d_fwd, False
            if d_rev < best_dist:
                best_idx, best_dist, best_flip = i, d_rev, True
        next_seg = remaining.pop(best_idx)
        chain.extend(reversed(next_seg) if best_flip else next_seg)

    return chain

Greedy nearest-neighbor works for most trails. It fails at genuine junctions: when three or more segments meet at a point, you can assemble them in multiple orders, some of which produce a coherent route and some of which don’t.

Step 3: DFS Backtracking for Branching Junctions

For the trails where greedy chaining produces a tangled mess, I added DFS backtracking. Build a graph where segments are edges and endpoints are nodes (snapped to a grid at ~10m resolution to handle floating-point mismatches). Then do a depth-first search for the longest simple path through that graph.

def longest_path_dfs(graph: dict, start: str, visited_edges: set) -> list:
    best = []
    stack = [(start, [], visited_edges.copy())]
    while stack:
        node, path, visited = stack.pop()
        if len(path) > len(best):
            best = path[:]
        for neighbor, edge_id, coords in graph.get(node, []):
            if edge_id not in visited:
                stack.append((neighbor, path + [coords], visited | {edge_id}))
    return best

DFS is slower than greedy — O(E!) in the worst case — but trail networks are small graphs. A park with 30 trail segments finds the optimal assembly in milliseconds.

Handling Short Non-Loop Trails

One last edge case: short out-and-back trails where the total assembled distance is under ~1.5 km. These are typically spur trails or short connectors, and displaying a one-way distance of 0.6 km undersells what’s actually a 1.2 km round trip.

The fix: if the trail isn’t a loop (start and end points are more than 200m apart) and the total distance is under a threshold, mirror the coordinates to simulate the return trip. The elevation profile reflects the out-and-back shape, and the displayed distance matches what a hiker would actually walk.

Results

The importer processed the EBRPD GeoJSON in under 3 seconds, producing 617 distinct trail records with clean geometries. About 80% were assembled by greedy chaining alone. The DFS backtracker handled the remaining 20%, mostly in parks with dense, intersecting trail networks like Briones and Redwood Regional.

A small number of features were filtered entirely: parking lots, fire roads under 0.5 km, and maintenance paths that appear in the dataset but aren’t meant for hikers. That left 617 trails ready for the frontend.