Creating a Sample of Marathon Race Results for Age Analysis

The first step in analyzing the age grading system and its alternatives

How do you effectively compare the race results of two athletes in different age groups?

That’s the question I’m tackling in a series that I’ll be working on over the next few weeks. Currently, the most widely adopted method is to use the age grading system maintained by World Masters Athletics.

How Do You Compare Race Results Between Age Groups?

The promise of and problems with the age grading system

medium.com

While this system has its uses, it’s not without its flaws. See the original article in the series for some of the potential issues.

The first step in developing an alternative to the current age grading system is to collect a suitable sample of race results to use as the basis for analysis.

And that’s what I want to talk about today. What races should we look at — and once I’ve collected the data what does the sample include?

Identifying the Races to Collect Results For

I already have some of the relevant data, and I’m familiar with what’s out there. In the fall, I wrote a ten part series that analyzed twenty years of race results to see if American runners were getting faster or slower.

It turns out, if you compare within age and gender groups, times have generally been improving.

A Deep Dive Into Marathon Data

Recently, I saw a post on Reddit the sparked some great conversation. Have marathoners really gotten slower over the…

medium.com

My main area of focus in that analysis was on the fastest runners — and so while the sample I used was generally representative of all American runners, it may have been weighted a little more heavily in favor of faster runners.

For this analysis, I want to make sure that the sample is broadly representative of the running population as a whole. That means fast, slow, old, young, men, and women. Basically everybody and anybody who runs a marathon in a given year.

Before we get into the actual data, here are a few of the questions I’ve thought about.

What Time Period Should We Cover?

My previous work has shown that times among American marathoners have been improving since 2000 — so I’m hesitant to use a time period that is overly broad. The results may not be indicative of results today.

At the same time, COVID threw a wrench in the running industry. Results from 2020–2022 are hardly representative of normal years. 2023 is the closest thing to a normal year post-COVID, and participation still hasn’t fully rebounded.

For now, I’m going to focus on 2010 to 2019. This is ten years, so it’s broad enough to ease out any outliers. It also limits the length of time over which times may have changed.

I think it’s a good compromise for now to show a baseline — and I’ll return later to see how the data for 2023 compares.

Can We Narrow It Down a Little Further?

Using 2010 to 2019 as a starting point leaves a pretty big population of runners to analyze — 6,649 total races with 5,111,548 total finishes.

We could randomly sample races from this list, but ensuring that the sample is representative gets a little complicated when you consider the huge disparities in size between races.

Another option is to limit the analysis to a specific part of the year. Fall is a popular racing season, so if we target all marathons in September, October, and November, we can get a large sample that should still be pretty representative.

This also narrows things down to one season and reduces the likelihood of an individual runner competing multiple times in a year in different marathons. Although serious runners may compete in 2–3 races per year, they’re typically spread out across the seasons.

This narrows things down considerably — to 2,455 races and 2,340,518 finishes.

What About Eliminating Smaller Races?

While that’s a smaller sample, it’s still a pretty large number of races. And I’ll bet that the majority of them only have 100 or 200 runners.

Eliminating those smaller races could drastically reduce the total number of races — and thus the complexity of the data sourcing issue — without drastically reducing the number of runners — and thus reducing the quality of the dataset.

If you set the bar at 100 finishers, that brings the number of races down to 1,549 and the number of finishers to 2,295,701. That’s hardly any difference in the number of finishers (about 2%), but it’s still a pretty big number of races.

If you set the bar at 500 finishers, that brings the number of races down to 543 and the number of finishers to 2,061,500. That’s about an 80% reduction in the number of races while only reducing the number of finishers by about 10%.

If you set the bar at 1,000 finishers, you bring down the number of races to 324 and the number of finishers to 1,904,094.

While I think either 500 or 1,000 would be a fine cut-off point, I’m going to choose 500. It significantly shrinks the number of races without sacrificing the total size of the dataset.

Should We Include the World Marathon Majors?

Finally, any time you talk about the distribution of finish times, people are quick to point out that some races are different from others.

Boston, for example, is certainly not representative of the population at large. There are strict qualifying standards, and you have to be pretty good to make it in.

Likewise, New York and Chicago have qualifying times that allow fast runners to get in — while everyone else has to get lucky with a lottery or raise money for a charity.

If you were looking at only a handful of races, this would be problematic. You can certainly make a good argument that the population of runners who run one of these races could differ significantly from the overall population of all marathon runners.

But in our case, we’re looking at all marathons (with over 500 finishers) in a given time period. So for every fast race, there are slow races to balance things out.

If you were to eliminate some races because they’re considered fast, you’d end up tilting the dataset in the other direction. If a person qualifies for a World Major, it likely means that they are not running in a different race.

If Chicago and New York converted 100% to the lottery system, there would be a lot of fast runners who would get redistributed to other races. They’re still part of the population, and they can’t be ignored just because they’re running in a Major.

So What Will the Final Sample Include?

Based on thinking through these questions, here are the parameters that I’ve set for inclusion in the dataset:

All marathons that are:

In the United States
Occur in September, October, or November
Took place from 2010 to 2019
Had 500 or more finishers

The source I’m using to identify these races is Marathon Guide. It is the most complete list of marathons in the United States, and it includes the date and number of finishers for each race.

The visual above summarizes the total number of finishers and races, per year, that are included in the sample.

Now that I have identified a sample set to use, it’ll take some time to collect the required data from a variety of sources.

Once I’ve collected and cleaned the data, I’ll return with the next part in this series. Before we can get into the statistical analysis, we’ll do some data exploration to see just who is included in this dataset.

You can expect that article to be published early next week — and you can subscribe to email updates if you want to be sure you don’t miss it.

I’m an avid runner and a data nerd. I’m turning 40 this year, so age grading is of particular interest to me. Follow me on Medium for more data informed stories about running, and check out my blog, Running with Rock, for tips on marathon training. You’ll also find me on Strava.