Creating a Sample of Marathon Race Results for Age Analysis
The first step in analyzing the age grading system and its alternatives
How do you effectively compare the race results of two athletes in different age groups?
That’s the question I’m tackling in a series that I’ll be working on over the next few weeks. Currently, the most widely adopted method is to use the age grading system maintained by World Masters Athletics.
While this system has its uses, it’s not without its flaws. See the original article in the series for some of the potential issues.
The first step in developing an alternative to the current age grading system is to collect a suitable sample of race results to use as the basis for analysis.
And that’s what I want to talk about today. What races should we look at — and once I’ve collected the data what does the sample include?
Identifying the Races to Collect Results For
I already have some of the relevant data, and I’m familiar with what’s out there. In the fall, I wrote a ten part series that analyzed twenty years of race results to see if American runners were getting faster or slower.
It turns out, if you compare within age and gender groups, times have generally been improving.
My main area of focus in that analysis was on the fastest runners — and so while the sample I used was generally representative of all American runners, it may have been weighted a little more heavily in favor of faster runners.
For this analysis, I want to make sure that the sample is broadly representative of the running population as a whole. That means fast, slow, old, young, men, and women. Basically everybody and anybody who runs a marathon in a given year.
Before we get into the actual data, here are a few of the questions I’ve thought about.
What Time Period Should We Cover?
My previous work has shown that times among American marathoners have been improving since 2000 — so I’m hesitant to use a time period that is overly broad. The results may not be indicative of results today.
At the same time, COVID threw a wrench in the running industry. Results from 2020–2022 are hardly representative of normal years. 2023 is the closest thing to a normal year post-COVID, and participation still hasn’t fully rebounded.
For now, I’m going to focus on 2010 to 2019. This is ten years, so it’s broad enough to ease out any outliers. It also limits the length of time over which times may have changed.
I think it’s a good compromise for now to show a baseline — and I’ll return later to see how the data for 2023 compares.
Can We Narrow It Down a Little Further?
Using 2010 to 2019 as a starting point leaves a pretty big population of runners to analyze — 6,649 total races with 5,111,548 total finishes.
We could randomly sample races from this list, but ensuring that the sample is representative gets a little complicated when you consider the huge disparities in size between races.
Another option is to limit the analysis to a specific part of the year. Fall is a popular racing season, so if we target all marathons in September, October, and November, we can get a large sample that should still be pretty representative.
This also narrows things down to one season and reduces the likelihood of an individual runner competing multiple times in a year in different marathons. Although serious runners may compete in 2–3 races per year, they’re typically spread out across the seasons.
This narrows things down considerably — to 2,455 races and 2,340,518 finishes.
What About Eliminating Smaller Races?
While that’s a smaller sample, it’s still a pretty large number of races. And I’ll bet that the majority of them only have 100 or 200 runners.
Eliminating those smaller races could drastically reduce the total number of races — and thus the complexity of the data sourcing issue — without drastically reducing the number of runners — and thus reducing the quality of the dataset.
If you set the bar at 100 finishers, that brings the number of races down to 1,549 and the number of finishers to 2,295,701. That’s hardly any difference in the number of finishers (about 2%), but it’s still a pretty big number of races.
If you set the bar at 500 finishers, that brings the number of races down to 543 and the number of finishers to 2,061,500. That’s about an 80% reduction in the number of races while only reducing the number of finishers by about 10%.
If you set the bar at 1,000 finishers, you bring down the number of races to 324 and the number of finishers to 1,904,094.
