Research Notes
What I Learned From The Hike Data
A few more details on how this calculator was built: what I downloaded, how I validated the data, and what the different model tests showed.
Archive
269
Downloaded activity archive from April 25, 2025 to June 7, 2026.
Model set
257
Usable for the current moving-time model after review flags.
Distance
991 mi
Total distance in the normal analysis archive.
Elevation gain
121,784 ft
Total climbing in the normal analysis archive.
Breaks
101 hr
Estimated stopped time across normal activities.
Step 1: Validating The Data
Before testing any models, I needed to make sure the downloaded files were usable enough to build from. I verified that my calculations of distance, elevation gain, descent, moving time, and stopped time aligned closely with AllTrails' numbers for the same activities.
The results were close enough to build on. One note on elevation: GPS devices and apps all process elevation data differently, so those numbers have a bit more variance than the time-based metrics. Moving time and stopped time were the most reliable.
Mean absolute difference from AllTrails
Distance
0.05 mi
Compared with AllTrails distance.
Elevation gain
78 ft
Compared with AllTrails gain.
Time hiking
1.6 min
Compared with AllTrails moving time.
Total time outside
0.2 min
Compared with AllTrails elapsed time.
Stopped time
1.5 min
Compared after subtracting moving time from elapsed time.
The Original Naismith Rule Held Up
I started with the original Naismith rule and tuned it against my activity data. The tuned formula calibrates flat walking speed and the time added for elevation gain.
speed_kmh = 5.85
climb_min_per_100m = 16.5
The tuned estimate was hard to beat. I also tested whether adding a personal correction factor, adjusting based on an individual's recent hiking pace, would improve accuracy. It reduced average error, but did not improve accuracy enough on routes the model had not seen before to apply it as the default.
The default is tuned Naismith. Personal correction only applies when your uploaded history gives a clear reason to trust it for the planned route.
Default modeling choice
Default: tuned Naismith. Personal correction only applies when your uploaded history gives a clear reason to trust it for the planned route.
Original-style Naismith
mean error 12.8%
Median error 10.6%. Mean absolute time error 12.3 minutes. The original distance-plus-elevation rule, used as the starting point.
Tuned Naismith
mean error 10.3%
Median error 6.4%. Mean absolute time error 10.7 minutes. The current default because it stayed simple and did well in testing.
Personal correction factor
mean error 11.7%
Median error 9.6%. Mean absolute time error 10.7 minutes. Adjusted for recent hiking pace. It reduced average bias but did not improve held-out error enough to be the default.
Slope-bucket model
mean error 15.9%
Median error 15.9%. Mean absolute time error 13.4 minutes. Useful for exploration, not strong enough yet as the main estimate.
Tuned Naismith estimated time vs actual time
Each dot is one clean activity in the model set. Dots near the diagonal are close estimates. The shaded band is about +/- 30 minutes.
Testing Machine Learning
For my own learning, I wanted to see if machine learning could beat a tuned Naismith estimate. Rather than replacing the formula, I set up each ML model to predict the remaining error after Naismith ran first:
This kept the formula as the foundation. The question was whether machine learning could learn that remaining error and make future-hike estimates meaningfully better.
Target
Difference from Naismith
Difference between actual moving time and tuned Naismith estimate.
Rows
179 / 39 / 39
Training / validation / final holdout.
Features
20
Route-known only: things you would know before the hike.
Segments
47,763
Small GPX route pieces used in the segment-level tests.
Similar-route matching estimated time vs actual time
Each dot is one final-holdout hike scored after model selection. Dots near the diagonal are close estimates. The shaded band is about +/- 30 minutes.
How I Validated The Models
I used chronological validation: earlier hikes trained the models, the next block was used to choose the best model, and the final block was scored once after that choice. The final block had no influence on which model was picked.
The features I allowed were ones you could know before starting a hike. They were route-known features, not facts learned after the hike was over.
Route size
distance, elevation gain, descent, and tuned Naismith hours
Route steepness
gain per mile, loss per mile, total vertical per mile, and steepness mix
Route profile
steep downhill, downhill, flat, uphill, and steep uphill distance fractions
Route shape
turn density from the planned route file when available
I did not let the models use anything you would only know after the hike: actual elapsed time, stopped time, raw coordinates, or post-hike notes.
The final holdout was scored once after model selection, so it was not used to choose the model.
What the rolling test means
Rolling tests only use earlier hikes to predict later hikes. Mean is the average error across all test hikes. Median is the middle value, so it is less skewed by outliers. The 80th percentile means 80% of hikes fell within that error.
ML sweep: validation vs final holdout
Lower error is better. Models are ordered by validation error, because the validation slice is where the model choice was made.
What I Tested
I tested a wide range of approaches: linear shrinkage, robust linear models, polynomial interactions, nearest-neighbor models, kernel models, decision trees, random forests, extra trees, boosting, model blends, and a small neural net.
Similar-route matching won the validation slice and barely beat tuned Naismith on the final holdout: 7.8% mean absolute error vs. 8.0%. Promising, but still not enough evidence to fully move away from the simpler formula.
Tree and boosting models found patterns, but several performed better on validation than on the final holdout. The classic small-data problem: the model learned details that did not repeat on new hikes.
Other Hiking Formulas I Compared
I compared a few well-known hiking formulas to see if any of them outperformed tuned Naismith on future-style data. None did.
Tuned Naismith
The current baseline uses about 5.85 km/h on flat distance plus 16.5 minutes per 100 m of climb. This remained the reference model every ML result had to beat.
Tobler-style slope curve
Tobler changes speed continuously with grade. It is useful for thinking about segments, but the original curve did not beat the tuned baseline on this dataset.
Langmuir-style downhill correction
Langmuir gives different treatment to gentle and steep downhill. The tuned Langmuir-style pass was useful as a comparison, but it did not become the selected model.
Tranter-style fatigue correction
Tranter frames time through fitness and fatigue. One fixed fitness row was competitive, but it was not strong enough to become a general rule.
Personal terrain pace profile
This shows how pace differed by slope bucket in my history. It is useful terrain information, but when rolled back into a full-route estimate it was less accurate than the tuned Naismith baseline.
What The Segment Work Added
The segment pass split GPX routes into many small pieces and tested whether route profile information could beat the whole-route residual model.
On my data set, I was slower on downhill terrain and faster on uphill terrain than the tuned Naismith baseline expects. That makes the terrain profile worth showing, even though it did not perform well enough as the main full-route estimate.
The production lesson is conservative. Segment features are good for showing how you handle steep downhill, downhill, flat, uphill, and steep uphill terrain. They are not yet strong enough to replace whole-route similar-hike matching.
Largest Final-Holdout Misses
The biggest misses on the final holdout mostly came from one-off hiking context the route-only model could not know before the hike: morel hunting, off-trail travel, steep off-trail terrain, and log downfall.
How many uploads help?
More useful uploads give the calculator more to work with. Lower lines mean lower error on the same later test hikes.
How Many Uploads Are Enough?
The calculator should work with no history, start learning from a handful of timestamped activities, and become more useful around 10 to 20 good uploads.
One or two uploads are useful for reading and summarizing. Five to nine can start learning pace and breaks. Around 20 or more can support a higher-confidence personal profile.
More data did not automatically solve everything. The model improved when it used recent and route-relevant history, not when it averaged all history into one correction.
The current product tiers use that lesson directly. Pace + breaks starts at 5 usable hikes, terrain patterns start at 10, a higher-confidence personal profile starts at 20, and similar-hike matching starts at 50.
What Similar-Hike Matching Tests
Manually entered route matching
For manually entered routes, similar-hike matching compares the main hiking estimate with past routes that look similar by distance, climbing, downhill, and route size. It asks: when past routes looked similar in those dimensions, did they usually take more or less time than the Naismith baseline?
Route profile matching
For GPX routes, similar-hike matching can also compare route profile. It uses slope buckets and turn density to test whether the planned route resembles past routes in a more detailed way.
The main hiking estimate remains first because it is easier to explain and has fewer ways to overfit. Similar-hike matching is useful learning material: it shows what nearby past routes suggest, how many similar routes it found, and whether terrain buckets add a meaningful route-specific correction.
Stopped time grows on longer hikes
This is why the calculator keeps time hiking and stopped time separate.
Under 1 hour
0.0 min/hr
106 activities, middle range 0.0-2.0 min/hr.
1-2 hours
3.1 min/hr
92 activities, middle range 1.1-8.2 min/hr.
2-4 hours
18.3 min/hr
48 activities, middle range 14.0-28.0 min/hr.
4+ hours
25.8 min/hr
11 activities, middle range 23.2-28.7 min/hr.
Stopped Time Grows On Longer Hikes
I wanted to handle breaks because stopped time is a real part of planning a day on trail. The data backed that up. On short hikes, stopped time was often near zero. On longer hikes, it became a major part of the day.
That is why the calculator estimates time hiking and stopped time separately. And it is why the result is shown as a range. A break estimate is useful, but it is not a promise.
A sample route and elevation profile from the data set
The route is normalized before plotting, so this shows route geometry and turns without showing raw coordinates or a precise public location.
A planned route GPX lets the calculator measure downhill, uphill, flat distance, steep sections, turn density, point spacing, and whether the file is detailed enough to trust for route profile matching.
What The Data Supports
The tuned Naismith formula is a strong default estimate.
Downloaded file timestamps are accurate enough to validate time hiking, stopped time, and total time outside.
Personal uploads are valuable, especially for break behavior, route matching, and confidence.
A larger dataset in the future could test whether route type, season, terrain, pack weight, and fitness trends should change the estimate.
What The Model Does Not Cover Yet
A precise pack-weight penalty has not been validated yet. The backpacking notes are promising but too sparse for a general rule.
One person's personal correction does not transfer automatically to other hikers. That needs shared data from more people.
The model does not fully account for weather, snow, trail obstruction, altitude, injury, sleep, group pace, or route-finding problems.
A lower error in one test does not mean a model should become the default. The tool needs to be stable and explainable, not just show a lower number in one table.
Where This Goes Next
Better upload experience
Help users understand what the calculator can learn from their files, which ones were usable, and why any were ignored.
Crowdsource a larger dataset
I would like to build a larger opt-in dataset to verify which findings from my own history hold for other hikers and to improve accuracy. That system is not set up yet.
Keep similar-hike matching secondary
Keep similar-hike matching secondary until it beats the simpler baseline by a clearer margin across more people.
References
Local analysis sources include the AllTrails validation report, calibration prototype, route-family calibration report, residual model report, and upload requirement analysis from the hiking-time-estimate workspace.