avatarBrian Rock

Summary

The provided content details an extensive analysis of American marathon data from 2000 to 2021, exploring trends in race numbers, participant demographics, and performance metrics to understand changes in the sport over two decades.

Abstract

The author embarked on a comprehensive exploration of marathon trends in the United States, prompted by an article suggesting American marathoners are slowing down. Over a period of weeks, the author collected and analyzed data from over 11,000 races, using Python and Selenium for scraping and Plotly for visualization. The study revealed a rapid increase in the number of marathons, particularly small and slow races, and a significant rise in female participation. Despite an overall growth in the number of races and finishers, the data indicated a decline in the number of large races and a stagnation in the total number of finishers post-2014, with a notable drop in 2020 due to the COVID-19 pandemic. The author then selected a representative sample of races to analyze deeper trends, aiming to determine whether runners are getting faster or slower and to explore age-related changes and finish times across different race categories.

Opinions

  • The author initially doubted the claim that both average and elite runners were getting slower, given advancements in training, nutrition, and footwear.
  • The author believes that the inclusion of small, multi-day trail races in other datasets may not influence larger trends but is a potential limitation of the data.
  • There is an expectation that the number of races and participants would increase over time, but the actual growth, especially in small and slow races, exceeded the author's expectations.
  • The author suggests that larger races tend to attract faster runners, while smaller races are more likely to have slower winners.
  • The growth in the number of marathons is not necessarily indicative of the overall quality of the fields, as evidenced by the increase in slow races.
  • The author notes that the increase in female participation in marathons has far outpaced that of male participants, leading to a more balanced gender distribution in the sport.
  • The decision to exclude races with fewer than 100 finishers from the final analysis was based on the assumption that these races have little impact on the larger findings and exhibit greater variability in finish times.
  • The author's choice to focus on race series that span the majority of the time period reflects a preference for understanding changes within consistent race samples rather than the impact of new races.
  • The author expresses a personal interest in running, mentioning their anticipation for the Chicago Marathon registration and inviting readers to follow their running blog and Strava profile.

Tracking the Growth in Marathons from 2000 to 2021

And identifying a random sample of races for additional, deeper analysis

For the past few weeks, I’ve been exploring data from American marathons to see how the sport has changed in the last two decades.

This analysis was inspired by an article I read which claimed that American marathoners were getting slower. On the one hand, it seemed plausible that the average runner was slower today than twenty years ago, if more runners were getting involved — changing the composition of the overall population of runners.

But it seemed a bit counterintuitive that the fastest runners should also be getting slower — given advances in training, nutrition, and footwear. And yet the article claimed that the fastest runners were getting slower as well.

I couldn’t find enough readily available data to challenge that conclusion, so I went about the task of collecting the data myself.

Note: This article is part of a larger series that is behind Medium’s paywall. If you’re not a member of Medium, you can use this form to get access to this series. I’ll e-mail you a special link to one article each week.

If you want to read more context, check out the original article in the series here. It also includes links to the previous articles, where you can find detailed analysis of the results from select races.

Today, we’re going to step back from individual marathons and take a look at the big picture.

Specifically:

  • How has the number of races changed since 2000?
  • How has the number of finishers changed since 2000?
  • What is a reasonable subset of races to look at to answer the original question — whether runners are getting faster or slower?
Photo by Roman Biernacki on Pexels

A Note About Data Sources and Methodology

The primary source of data for this analysis is Marathon Guide.

This website includes a full database of race results, from 2000 to 2022, for marathons large and small. It is the closest thing to a comprehensive dataset that exists for marathon race results. I did not include 2023 because the year is not yet finished.

I used Python and Selenium to scrape the relevant data. First, I scraped the results list page for each year (2000 to 2023) to gather the name of each race run in that year and the URL of its results. Then, I went back to each URL to scrape the summary statistics for that race.

The site is primarily focused on the United States, but it does include some international marathons. If a race was not located in the United States, I removed it from the dataset.

In the end, I had a dataset that included the year, date, race name, total number of finishers, number of male finishers, number of female finishers, time of the male winner, and time of the female winner for each race.

Over the 23-year period, this totaled 11,114 races.

Next, I categorized the races in two ways — by the size of the field and by the speed of the winner.

For size, I chose to sort them into three buckets:

  • Small races: less than 500 finishers
  • Medium races: 500–1,999 finishers
  • Large races: 2,000 or more finishers

For speed, I also chose to sort them into three buckets:

  • Fast Races: Male Winner < 2:20 or Female Winner < 2:45
  • Medium Races: Male Winner < 2:45 or Female Winner < 3:10
  • Slow Races: Male Winner > 2:45 or Female Winner > 3:10

Finally, I used Pandas and Plotly packages in Python to clean, format, and visualize the data as seen throughout this article.

Once I had cleaned the data, I compared my numbers to those collected by the Association of Road Racing Statisticians. The total number of finishers lined up pretty neatly, but in most cases, ARRS reported more total marathons per year.

I’m fairly certain that this is the result of the inclusion of some additional small races, often multi-day trail races, that are not included in the Marathon Guide results list.

I don’t think this influences any of the larger trends that we’ll discuss, but I do want to point out this potential limitation with the data. There’s no guarantee that this represents every marathon run in the United States — but it surely covers a vast majority of them.

A Look at the Number and Types of Races

Let’s start by looking number of races run each year, and then the breakdown of those into different sizes and speeds.

This first visual is simply a snapshot of the total number of races run in the United States each year, from 2000 to 2022.

At the beginning of the period, 2000 to 2009, the number of races grows slowly each year. There’s a small increase each year, with the exception of 2007. The number almost doubles over that decade, from 216 to 400.

The rate of growth increases rapidly over the next few years, however. Over the next five years (2010–14), the number of races shoots up to 718. From there, it stagnates and shrinks slightly through 2019.

There’s an obvious dropoff in 2020 — a result of the COVID-19 pandemic shutting things down after March of that year. The number of races rebounded in the following years, but as of 2022, it had not quite reached the pre-pandemic peak.

Although I expected to see the number of races increase, I’ll be honest — I did not expect there to be that many marathons each year and for it to grow that rapidly.

Categorizing Races by Speed and Size

But are all of these races the same?

The chart above plots the same number of races across the time period, but the color indicates the size of the race. The blue bar is large races (2,000+), the red bar is medium races (500–1,999), and the green bar is small races (<500).

A glaringly obvious trend here is that small races far outnumber large and medium races. In 2000, about two-thirds (140) of all races (216) had fewer than 500 finishers.

As the number of races increased, the lion’s share of this increase came from small races. In the peak year — 2016 — there were 600 small races. The number of small races had more than quadrupled, and they now made up over 80% of the total number of races (745).

There is some growth among medium races, as well, even though they make up a small proportion of the total number. They increase from 30 at the beginning to about 100 at the peak, and from 2010–2019 the number hangs out there.

Meanwhile, the number of large races grows somewhat — from 30 to 50 — but then it shrinks back to about 30 at the end of the time period.

And what about the quality of the fields in these races?

The graph above represents the speed of each race — based on the time of the male or female winner. Fast races are won in elite or sub-elite times (<2:20 for men, 2:45 for women), medium races have times that are fast for amateur runners (<2:45 for men, <3:10 for women), and the slow races make up the balance.

There is a small group of races throughout the time period with fast finishes. It varies in the 30s and 40s, increasing somewhat over the two decades. The peak number is in 2019–56.

The largest category in the beginning is medium races. In 2000, they make up over half of all races. Although this number increases modestly, they make up a much smaller proportion of the total (slightly over 1/3) in 2019.

Meanwhile, slow races see the most growth. By 2019, they make up well over half of all races. As the number of races increases rapidly over the two decades, there is also a huge increase in the number of races with slow winning times.

Does the Size of a Race Relate to the Time of the Winner?

Finally, is there an interaction between the size of a marathon’s field and the time in which it is won? I would expect that larger races would attract deeper fields — with faster times — and smaller races would be more likely to have slower winners.

And sure enough, that’s what the graph above shows. For each size of race, the colors on the graph indicate how many of those races had fast, medium, or slow finishes.

Large races make up the smallest group. But of the 833 total races in the large category, well over half (502) of them had fast finishes. Most of the remainder finished in the medium category, and very few of those races (12) had slow finish times.

When you look at the medium-sized races, there are some with fast finishes. But the vast majority of them come in with moderate finishing times. These races are likely not big enough to attract elite fields, but they are still big enough to attract serious amateurs who will finish in pretty respectable times.

And the smallest races often come with slow finish times. There’s a tiny fraction of them with fast times, and these are likely championship races (like the Olympic Qualifiers). About a third have moderate times, but the remaining two-thirds have slow winners.

A few general conclusions we can draw from this:

  • The number of races grows rapidly from 2000 to 2015 and then stagnates
  • Much of this growth comes from small races
  • Much of this growth comes from slow races
  • Small races tend to have slower winners

A Look at the Growing Field

From the perspective of races, things seem to be slowing down — because there are many more slow races in 2019 than there were in 2000.

But these races are also small, and therefore likely to make up a tiny fraction of the overall field of marathoners. So what happens if we look at things through that lens — the actual number of runners?

The graph above shows the total number of finishers for each year. It follows a similar pattern to the graph of the number of races, but it’s a little different.

There is an increase in the beginning, but that increase is more gradual. There are already over 300,000 finishers in 2000 and that increases to about 560,000 by 2014. So while the number of races more than tripled, the number of runners didn’t quite double in that same time period.

There’s also a difference on the flip side — from 2014 to 2019. The number of finishers decreases significantly. It’s still much higher in 2019 than it was in 2000, but there are about 90,000 fewer finishers in 2019 than there were in 2014.

The rebound post-COVID is also weaker. This may simply be the effect of reduced field sizes in 2021 and 2022 for some of the larger marathons. But it remains to be seen if the field as a whole will bounce back to its pre-COVID levels.

Finally, there is one outlier in here — 2012. It’s significantly lower than 2011 and 2013. That’s because the New York City Marathon was canceled that year in the aftermath of Superstorm Sandy. New York City is the biggest marathon in the country — and quite often the biggest in the world — so its absence is easy to spot in the data.

In looking at individual races, one of the trends I’ve noticed is that the number of women running marathons has increased greatly since 2000 — far outpacing any growth among men.

How does that look when you zoom out to the full field of all marathon runners?

In 2000, marathon running was a male-dominated sport. It was more balanced than it had been back in the 1970s and 1980s, but in 2000 only 112,000 out of about 300,000 — so just over a third — of finishers were women.

From 2000 to the 2014 peak, there is growth among both groups. But the men only increase from 187,000 to 317,000 — about a 70% increase — while the women increase from 112,000 to 244,000 — a 118% increase.

Female participation in marathons increased more quickly than male participation over this time period, and in 2014 almost 44% of the field was composed of women.

It’s still not quite even, but it’s much more balanced than it was. However, in 2022, things tilted slightly back towards men (about 59–41). So it remains to be seen what things will look like in the post-pandemic era.

Number of Finishers Across Different Types of Races

Earlier, we characterized races based on the size of the field and the speed of the winner.

How are the number of finishers distributed across these different types of races?

Large races are often much larger than small races. The largest races have tens of thousands of runners in them. So even though the number of races is tilted heavily in favor of small races, that likely won’t be the case for finishers.

The graph above shows the total number of finishers broken down by small races (green), medium races (red), and large races (blue).

Sure enough, although small races made up a tiny portion of the total number of races — they make up the lion’s share of the number of actual runners.

In 2000, 228,000 of the 299,000 total finishers were in large races — over 76%. This number grows with the total number of finishers, and it peaks in 2014 — with 375,000 finishers running in 52 races. The two largest races (New York and Chicago) made up close to 100,000 of those finishers by themselves.

There is some growth in the number of finishers in medium and small races. Both categories increase in terms of the total number of finishers. The number of finishers in medium races doubles, and the number of finishers in small races triples.

But despite this growth, they still make up a small proportion of the overall field.

One interesting trend to note, however, is that after the 2014 peak, the steepest decline took place in large races. The number of finishers in large races dropped by almost 79,000 — about 21%. Meanwhile, the number in small races dropped by about 10,000 (13%) and the number in medium races only dropped by 2,000 (2%).

With the number of large races declining, there is a shift towards medium-sized races. I wonder if this will continue in the years to come.

Finally, here’s the distribution of finishers based on the speed of a race.

Again, the largest group of finishers is in the fastest races. This isn’t all that surprising, because the largest races also tended to be fast races.

But it is interesting that there is relatively little growth among that group. It increases slightly and it varies a bit year to year — probably the result of some races right on the cusp of the fast/medium designation.

But the largest growth in the number of finishers comes from the group of runners in medium races. That group is a little over 100,000 and about a third in 2000. It’s grown to 222,000 in 2014 — close to 40%. This moves in the other direction in the last few years, though.

Despite a large increase percentage-wise, the number of runners in slow races remains a tiny fraction of the overall field throughout the entire period.

Photo by Gustavo Fring on Pexels

So How Do We Pick a Representative Sample of Races to Analyze?

This is the important question — and it’s a tougher question than I expected.

The simplest answer is to look at them all. But, scraping the remaining results — about 11,000 races with close to 6,000,000 finishes — would take quite a while. It took a meaningful amount of time to collect the results from the first six races, and it helped that I spread the task out over the course of a month.

One solution would be to simply take a random sample of all of the races. But there is such variation in the size and speed of the races, and an imbalance in the distribution of those sizes and speeds, that a straight random sample could still easily be skewed.

A better approach would be to stratify that sample in some way. With the races already broken down by size and speed, I could take a random sample of each category of race.

Another question is whether I should randomly select a sample of individual races in each year — or whether I should look for race series that existed over the entire (or majority) of the time period.

The advantage of picking individual races is that I can get a consistent number of races each year. This also would bring in some potential change occurring from the addition of new races.

But I think I’m most interested in whether and how things are changing within a given set of races — than whether new races are changing the overall field. So I’d prefer to identify a set of race series that span the majority of the time period — creating a more consistent sample from year to year.

The Decisions and Thought Process

As I worked through the data and narrowed things down to a sample, here are the decisions I made and the thought process behind them.

First, I decided to narrow the original dataset a little bit. I reduced the data to the years 2000 to 2019. COVID drastically changed things in 2020, and I think it will be a couple years before we can look at the post-COVID data and see a new baseline.

Second, I eliminated races with fewer than 100 finishers. With such small fields, they are unlikely to impact the larger findings. And I assume these small races are subject to much greater variation in finish times. There are still a sufficient number of small races, with 100 to 500 finishers, to work with.

Finally, I looked for races that existed and fit their size and speed criteria for at least ten years in the twenty-year period. This eliminates races with brief histories, and it eliminates potential outliers like the Olympic Trials.

At the end of this, I was left with the following:

  • 21 Large and Fast Races
  • 8 Large and Medium Races
  • 1 Large and Slow Race
  • 5 Medium and Fast Races
  • 48 Medium and Medium Races
  • 50 Small and Medium Races
  • 32 Small and Slow Races

Among the large and fast races, I already had results for six of them. I’ve previously written about Boston, NYC, Chicago, LA, and Philly, and I previously collected the data for the California International Marathon. These represent about a third of large, fast races, and I think that’s enough.

There’s only one large, slow race — the Bataan Memorial Death March in New Mexico — so I decided to include that.

For the remaining groups, I randomly selected races to include.

Due to the small size of the categories, I selected 3/8 of the large, medium races and 3/5 of the medium, fast races.

There was a larger group of medium, medium races, so I only selected a quarter of them — 12/48.

For the small, medium, and small, slow races, I decided to randomly select 50%. I figured these races were small, so scraping them would not take much time. Their small fields would also have little impact on the larger results. And they likely had the most variation in finish times, so a larger sample would smooth that out.

The Actual Sample of Races

Once I determined the parameters, for selecting the sample races, I randomly selected them from the original dataset and saved them in a CSV file. That CSV file, with the full list of races, is embedded below.

Note that each race is identified with a size and speed. This is the size and speed of the race for the majority of its existence. In some cases, races spanned multiple categories.

For example, the Steamtown Marathon has been a medium-sized race with a medium finish time for most of its history. But it did dip into the large category briefly (2013–15), and there were some years in which the time of the female winner qualified the race as fast.

Distribution of Races in the Sample

Because I chose to work with race series, instead of a specific number of races per year, the number of races in each year could vary.

So let’s take a look at how the races are distributed throughout the time period.

In the first six years, there’s a steady increase. From there, the number stabilizes between 63 and 65.

This tells me that a subset of the races — about half — have probably existed for the entire period (2000 to 2019). The remainder of the races were likely launched in the early period (2000 to 2005), and by 2006 there should be a consistent dataset.

That may or may not impact the final analysis — but it’s something to keep in mind in case the early period doesn’t match up with the later period.

Finally, how many finishers do these races represent in total — and how does this change over the time period?

The overall trend looks somewhat similar to the trend for finishers among all the races — it increases from 2000 to 2014 before decreasing slightly. There’s also a dip in 2012 when NYC was canceled.

But the increase is more gradual, and the decrease at the end is more of a stagnation. The biggest increase also occurs in the first few years — when the number of races in the sample were fewer.

So some of the growth in the sample — as well as the full dataset — is likely due to the addition of new races. But there is some growth between 2006 and 2019 within the races identified in the subset.

So What’s Next?

Now that I’ve identified a sample to work with, the next step is to collect the additional data and perform the analysis.

In total, this sample represents 3,624,744 finishers — but the large races I’ve already collected represent 2,948,544 finishers. So it shouldn’t be a terribly time-consuming thing to collect the additional data.

Once collected, I can look at the changes in the various subgroups and compare that to my findings from the six large races. After that, I can collect everything into one dataset and look at everything together.

Specifically, we’ll be looking at:

  • How does the age of finishers change?
  • How have the finish times changed?
  • Is there evidence that runners are getting faster, slower, or staying about the same?

If that sounds interesting to you, be sure to follow me here on Medium for the final articles in this series. You can also check out the first article in the series to find links to each article I’ve written so far. I’ll update that article with additional links moving forward.

And if there are other questions you’ve got — leave a response. I’m coming to the end of this project, and I’m looking forward to looking at this data in different ways and asking some new questions. And your responses often inspire me as to what to ask.

I’m an avid runner, and I’m looking forward to registering for the Chicago Marathon on Tuesday. I’m also a data nerd, and I analyze running data and write about it here on Medium. For more about my own running story, check out my blog, Running with Rock. You can also follow me on Strava to see what I’m up to.

Running
Marathon
Data Analysis
Data Science
Data
Recommended from ReadMedium