Back to How It Works

Research Notes

What I Learned From The Hike Data

A few more details on how this calculator was built: what I downloaded, how I validated the data, and what the different model tests showed.

AI coding assistants were used in this research and in the development of this calculator.

Archive

269

Downloaded activity archive from April 25, 2025 to June 7, 2026.

Model set

257

Usable for the current moving-time model after review flags.

Distance

991 mi

Total distance in the normal analysis archive.

Elevation gain

121,784 ft

Total climbing in the normal analysis archive.

Breaks

101 hr

Estimated stopped time across normal activities.

Step 1: Validating The Data

Before testing any models, I needed to make sure the downloaded files were usable enough to build from. I verified that my calculations of distance, elevation gain, descent, moving time, and stopped time aligned closely with AllTrails' numbers for the same activities.

The results were close enough to build on. One note on elevation: GPS devices and apps all process elevation data differently, so those numbers have a bit more variance than the time-based metrics. Moving time and stopped time were the most reliable.

Mean absolute difference from AllTrails

Distance

0.05 mi

Compared with AllTrails distance.

Elevation gain

78 ft

Compared with AllTrails gain.

Time hiking

1.6 min

Compared with AllTrails moving time.

Total time outside

0.2 min

Compared with AllTrails elapsed time.

Stopped time

1.5 min

Compared after subtracting moving time from elapsed time.

The Original Naismith Rule Held Up

I started with the original Naismith rule and tuned it against my activity data. The tuned formula calibrates flat walking speed and the time added for elevation gain.

Moving time = distance_km / speed_kmh + elevation_gain_100m * climb_min_per_100m
Current fitted values:
speed_kmh = 5.85
climb_min_per_100m = 16.5

The tuned estimate was hard to beat. I also tested whether adding a personal correction factor, adjusting based on an individual's recent hiking pace, would improve accuracy. It reduced average error, but did not improve accuracy enough on routes the model had not seen before to apply it as the default.

The default is tuned Naismith. Personal correction only applies when your uploaded history gives a clear reason to trust it for the planned route.

Default modeling choice

Default: tuned Naismith. Personal correction only applies when your uploaded history gives a clear reason to trust it for the planned route.

Original-style Naismith

mean error 12.8%

Median error 10.6%. Mean absolute time error 12.3 minutes. The original distance-plus-elevation rule, used as the starting point.

Tuned Naismith

mean error 10.3%

Median error 6.4%. Mean absolute time error 10.7 minutes. The current default because it stayed simple and did well in testing.

Personal correction factor

mean error 11.7%

Median error 9.6%. Mean absolute time error 10.7 minutes. Adjusted for recent hiking pace. It reduced average bias but did not improve held-out error enough to be the default.

Slope-bucket model

mean error 15.9%

Median error 15.9%. Mean absolute time error 13.4 minutes. Useful for exploration, not strong enough yet as the main estimate.

Tuned Naismith estimated time vs actual time

Each dot is one clean activity in the model set. Dots near the diagonal are close estimates. The shaded band is about +/- 30 minutes.

tuned Naismithperfect match
005050100100150150200200250250300300350350400400450450500500actual time hiking, minutesestimated time hiking, minutesperfect match+/- 30 min band

Testing Machine Learning

For my own learning, I wanted to see if machine learning could beat a tuned Naismith estimate. Rather than replacing the formula, I set up each ML model to predict the remaining error after Naismith ran first:

predicted hiking time = tuned Naismith time + predicted residual

This kept the formula as the foundation. The question was whether machine learning could learn that remaining error and make future-hike estimates meaningfully better.

Target

Difference from Naismith

Difference between actual moving time and tuned Naismith estimate.

Rows

179 / 39 / 39

Training / validation / final holdout.

Features

20

Route-known only: things you would know before the hike.

Segments

47,763

Small GPX route pieces used in the segment-level tests.

Similar-route matching estimated time vs actual time

Each dot is one final-holdout hike scored after model selection. Dots near the diagonal are close estimates. The shaded band is about +/- 30 minutes.

similar-route matchingperfect match
005050100100150150200200250250300300350350actual time hiking, minutesestimated time hiking, minutesperfect match+/- 30 min band

How I Validated The Models

I used chronological validation: earlier hikes trained the models, the next block was used to choose the best model, and the final block was scored once after that choice. The final block had no influence on which model was picked.

The features I allowed were ones you could know before starting a hike. They were route-known features, not facts learned after the hike was over.

Route size

distance, elevation gain, descent, and tuned Naismith hours

Route steepness

gain per mile, loss per mile, total vertical per mile, and steepness mix

Route profile

steep downhill, downhill, flat, uphill, and steep uphill distance fractions

Route shape

turn density from the planned route file when available

I did not let the models use anything you would only know after the hike: actual elapsed time, stopped time, raw coordinates, or post-hike notes.

The final holdout was scored once after model selection, so it was not used to choose the model.

ModelMeanMedian80th
Tuned Naismith10.3%6%14.7%
Recent-history adjustment10.3%4.7%12.4%
Adaptive route family9.5%5%14.2%
Similar route and slope9.8%5.2%12.7%

What the rolling test means

Rolling tests only use earlier hikes to predict later hikes. Mean is the average error across all test hikes. Median is the middle value, so it is less skewed by outliers. The 80th percentile means 80% of hikes fell within that error.

ML sweep: validation vs final holdout

Lower error is better. Models are ordered by validation error, because the validation slice is where the model choice was made.

validationfinal holdout
0%2%4%6%8%10%12%14%Similar-route residualdistance7.8%Kernel ridge residualkernel8.7%K-nearest residual, k=24distance8.5%Recent median residualpersonal baseline8.1%Extra trees residualtree ensemble9.9%Random forest residualtree ensemble10.2%Tuned Naismith baselinebaseline8%Support-vector residualkernel7.5%Decision tree, shallowtree11.2%Small neural netneural net9.3%mean absolute error, percent of actual moving time

What I Tested

I tested a wide range of approaches: linear shrinkage, robust linear models, polynomial interactions, nearest-neighbor models, kernel models, decision trees, random forests, extra trees, boosting, model blends, and a small neural net.

Similar-route matching won the validation slice and barely beat tuned Naismith on the final holdout: 7.8% mean absolute error vs. 8.0%. Promising, but still not enough evidence to fully move away from the simpler formula.

Tree and boosting models found patterns, but several performed better on validation than on the final holdout. The classic small-data problem: the model learned details that did not repeat on new hikes.

FamilyTriedBest validation modelValidationFinal holdout
baseline1Tuned Naismith baseline6.1%8%
distance3Similar-route residual4.9%7.8%
kernel2Kernel ridge residual5.3%8.7%
tree ensemble3Extra trees residual5.7%9.9%
linear5Lasso residual8.6%10.1%
boosting3AdaBoost tree residual7%9.4%
ensemble blend2Voting ensemble residual6.4%9.4%
neural net1Small neural-net residual11.5%9.3%

Other Hiking Formulas I Compared

I compared a few well-known hiking formulas to see if any of them outperformed tuned Naismith on future-style data. None did.

Tuned Naismith

The current baseline uses about 5.85 km/h on flat distance plus 16.5 minutes per 100 m of climb. This remained the reference model every ML result had to beat.

Tobler-style slope curve

Tobler changes speed continuously with grade. It is useful for thinking about segments, but the original curve did not beat the tuned baseline on this dataset.

Langmuir-style downhill correction

Langmuir gives different treatment to gentle and steep downhill. The tuned Langmuir-style pass was useful as a comparison, but it did not become the selected model.

Tranter-style fatigue correction

Tranter frames time through fitness and fatigue. One fixed fitness row was competitive, but it was not strong enough to become a general rule.

Segment testFamilyValidationFinal holdoutFinal interpretation
Whole-route similar-route residualwhole-route ML4.9%7.8%Best validation result. Kept as the main advanced direction.
Recent median residualpersonal baseline5.7%8.1%Simple and competitive.
Segment k-nearest residualsegment ML5.7%8%Richer route profile features almost matched whole-route similarity.
Segment-integrated tuned Naismithfixed segment formula5.7%8.7%Summed the tuned estimate over route segments.
Langmuir downhill on tuned Naismithfixed segment formula6.1%8.7%A useful comparison, but it did not beat the selected route-matching approach.
Tranter correction, fitness 30fixed endurance formula6.6%8.8%Best fixed Tranter-style table row in this pass.
Segment support-vector residualsegment ML6.7%7.8%Good final score, but the model choice was made before this final holdout score.
Personal Langmuir-style downhillpersonal formula8.5%11.2%Tested a personal downhill adjustment; it performed worse on the final holdout.
Tobler original slope functionfixed segment formula13.4%12.3%Classic slope-speed curve without personal scaling.
Personal grade-bucket pacespersonal formula23.1%12.4%Useful for terrain insight, but too unstable as the main estimator.

Personal terrain pace profile

This shows how pace differed by slope bucket in my history. It is useful terrain information, but when rolled back into a full-route estimate it was less accurate than the tuned Naismith baseline.

observedtuned estimate
020406080Steep downhillsteeper than -12%; 56.2 mi35.916.5Downhill-12% to -3%; 129.5 mi26.816.5Flat or rolling-3% to +3%; 412.5 mi21.317.8Uphill+3% to +12%; 132.2 mi27.034.0Steep uphillsteeper than +12%; 57.3 mi34.366.3pace, minutes per mile

What The Segment Work Added

The segment pass split GPX routes into many small pieces and tested whether route profile information could beat the whole-route residual model.

On my data set, I was slower on downhill terrain and faster on uphill terrain than the tuned Naismith baseline expects. That makes the terrain profile worth showing, even though it did not perform well enough as the main full-route estimate.

The production lesson is conservative. Segment features are good for showing how you handle steep downhill, downhill, flat, uphill, and steep uphill terrain. They are not yet strong enough to replace whole-route similar-hike matching.

Largest Final-Holdout Misses

The biggest misses on the final holdout mostly came from one-off hiking context the route-only model could not know before the hike: morel hunting, off-trail travel, steep off-trail terrain, and log downfall.

DateRouteMiGainErrorKnown context
2026-05-31Stanley Lake to Bridalveil Falls Trail9.51,007 ft-72.7 minMorel hunting and slow searching.
2026-05-03Afternoon hike5.4863 ft-50 minOff-trail hike.
2026-05-25Mule Creek Trail6.4449 ft-46.1 minLots of off-trail travel and steep terrain.
2026-05-17Casino Creek Loop8.61,535 ft+21.8 minNormal hike, but with log downfall.
2026-04-12Afternoon hike7.0666 ft-18.3 minNo unusual context remembered.

How many uploads help?

More useful uploads give the calculator more to work with. Lower lines mean lower error on the same later test hikes.

AdaptiveSimilar routesBaseline
8%9%10%11%12%13%01020304050usable uploaded past hikesmean absolute error, %

How Many Uploads Are Enough?

The calculator should work with no history, start learning from a handful of timestamped activities, and become more useful around 10 to 20 good uploads.

One or two uploads are useful for reading and summarizing. Five to nine can start learning pace and breaks. Around 20 or more can support a higher-confidence personal profile.

More data did not automatically solve everything. The model improved when it used recent and route-relevant history, not when it averaged all history into one correction.

The current product tiers use that lesson directly. Pace + breaks starts at 5 usable hikes, terrain patterns start at 10, a higher-confidence personal profile starts at 20, and similar-hike matching starts at 50.

What Similar-Hike Matching Tests

Manually entered route matching

For manually entered routes, similar-hike matching compares the main hiking estimate with past routes that look similar by distance, climbing, downhill, and route size. It asks: when past routes looked similar in those dimensions, did they usually take more or less time than the Naismith baseline?

Route profile matching

For GPX routes, similar-hike matching can also compare route profile. It uses slope buckets and turn density to test whether the planned route resembles past routes in a more detailed way.

The main hiking estimate remains first because it is easier to explain and has fewer ways to overfit. Similar-hike matching is useful learning material: it shows what nearby past routes suggest, how many similar routes it found, and whether terrain buckets add a meaningful route-specific correction.

Stopped time grows on longer hikes

This is why the calculator keeps time hiking and stopped time separate.

Under 1 hour

0.0 min/hr

106 activities, middle range 0.0-2.0 min/hr.

1-2 hours

3.1 min/hr

92 activities, middle range 1.1-8.2 min/hr.

2-4 hours

18.3 min/hr

48 activities, middle range 14.0-28.0 min/hr.

4+ hours

25.8 min/hr

11 activities, middle range 23.2-28.7 min/hr.

Stopped Time Grows On Longer Hikes

I wanted to handle breaks because stopped time is a real part of planning a day on trail. The data backed that up. On short hikes, stopped time was often near zero. On longer hikes, it became a major part of the day.

That is why the calculator estimates time hiking and stopped time separately. And it is why the result is shown as a range. A break estimate is useful, but it is not a promise.

Under 1 hour: median moving-time residual -0.1 min; 80th-percentile absolute residual 5.0 min.
1-2 hours: median moving-time residual 4.6 min; 80th-percentile absolute residual 8.9 min.
2-4 hours: median moving-time residual 11.0 min; 80th-percentile absolute residual 51.0 min.
4+ hours: median moving-time residual 54.7 min; 80th-percentile absolute residual 77.8 min.

A sample route and elevation profile from the data set

13.8 mi4,144 ft up8 hrs, 11 mins hiking11 hrs, 42 mins outside
Route profile from GPX pointsnormalized 0-10000252550507575100100start / finishnormalized route x positionnormalized route y position

The route is normalized before plotting, so this shows route geometry and turns without showing raw coordinates or a precise public location.

Elevation profile4,144 ft climbing0%25%50%75%100%6,0007,0008,0009,00010,00011,000high 10,519 ftdistance along route (%)elevation (ft)

A planned route GPX lets the calculator measure downhill, uphill, flat distance, steep sections, turn density, point spacing, and whether the file is detailed enough to trust for route profile matching.

What The Data Supports

The tuned Naismith formula is a strong default estimate.

Downloaded file timestamps are accurate enough to validate time hiking, stopped time, and total time outside.

Personal uploads are valuable, especially for break behavior, route matching, and confidence.

A larger dataset in the future could test whether route type, season, terrain, pack weight, and fitness trends should change the estimate.

What The Model Does Not Cover Yet

A precise pack-weight penalty has not been validated yet. The backpacking notes are promising but too sparse for a general rule.

One person's personal correction does not transfer automatically to other hikers. That needs shared data from more people.

The model does not fully account for weather, snow, trail obstruction, altitude, injury, sleep, group pace, or route-finding problems.

A lower error in one test does not mean a model should become the default. The tool needs to be stable and explainable, not just show a lower number in one table.

Where This Goes Next

Better upload experience

Help users understand what the calculator can learn from their files, which ones were usable, and why any were ignored.

Crowdsource a larger dataset

I would like to build a larger opt-in dataset to verify which findings from my own history hold for other hikers and to improve accuracy. That system is not set up yet.

Keep similar-hike matching secondary

Keep similar-hike matching secondary until it beats the simpler baseline by a clearer margin across more people.

References

Local analysis sources include the AllTrails validation report, calibration prototype, route-family calibration report, residual model report, and upload requirement analysis from the hiking-time-estimate workspace.