2 min read

Parsing GPX Files with Python

A practical walkthrough of loading, cleaning, and analyzing GPS track data from a hiking trip using Python and gpxpy.

GPX (GPS Exchange Format) is the lingua franca of outdoor GPS data. Nearly every trail app, GPS watch, and handheld device can export it. But raw GPX files are surprisingly messy: duplicate points, elevation noise from barometric sensors, and timestamp gaps from signal loss.

Here’s a baseline pipeline for loading and cleaning a GPX track.

Loading the File

The gpxpy library handles GPX parsing cleanly:

import gpxpy
import pandas as pd

with open("trail.gpx", "r") as f:
    gpx = gpxpy.parse(f)

points = []
for track in gpx.tracks:
    for segment in track.segments:
        for point in segment.points:
            points.append({
                "lat": point.latitude,
                "lon": point.longitude,
                "ele": point.elevation,
                "time": point.time,
            })

df = pd.DataFrame(points)

Calculating Distance

The Haversine formula gives the great-circle distance between two lat/lon pairs. For hiking distances, this is accurate to well within GPS measurement error:

import numpy as np

def haversine(lat1, lon1, lat2, lon2):
    R = 6_371_000  # Earth radius in meters
    phi1, phi2 = np.radians(lat1), np.radians(lat2)
    dphi = np.radians(lat2 - lat1)
    dlambda = np.radians(lon2 - lon1)
    a = np.sin(dphi/2)**2 + np.cos(phi1)*np.cos(phi2)*np.sin(dlambda/2)**2
    return 2 * R * np.arcsin(np.sqrt(a))

df["dist_m"] = haversine(
    df["lat"].shift(), df["lon"].shift(),
    df["lat"], df["lon"],
).fillna(0)

total_km = df["dist_m"].sum() / 1000
print(f"Total distance: {total_km:.2f} km")

Smoothing Elevation

GPS elevation data is noisy — consumer devices typically have ±5–15 m of vertical error, and the noise compounds when computing cumulative gain. A rolling median removes most of the spike artifacts without introducing the lag that a rolling mean would:

df["ele_smooth"] = (
    df["ele"]
    .rolling(window=5, center=True, min_periods=1)
    .median()
)

gain = df["ele_smooth"].diff().clip(lower=0).sum()
loss = df["ele_smooth"].diff().clip(upper=0).abs().sum()

print(f"Elevation gain: {gain:.0f} m")
print(f"Elevation loss: {loss:.0f} m")

Window size is a tunable parameter. A window of 5 works well for tracks recorded at 1-second intervals; for tracks at 5-second intervals, 3 is usually enough.

Next Steps

This gets you clean distance and elevation numbers from any GPX file. Future posts will build on this foundation: plotting elevation profiles, computing pace zones, and aggregating multiple tracks across a season into a single dataset.