Age Grading, Percentiles, and Z-Scores: Three Ways to Compare Race Results

How do you effectively and fairly compare race results between two runners from different age groups?

That’s the question I’ve been exploring in an ongoing series of articles.

The current system — age grading — is useful. But it has some flaws. After collecting and analyzing a truckload of data, I’ve offered two alternatives to age grading — percentiles and z-scores.

Analysis of Marathon Results and Age Grading

Age grading makes it possible to compare race results between different age groups. But is there a better alternative…

medium.com

Today, I wanted to take a step back and compare these three different methods. In particular, one of my problems with age grading is that it seems to skew more favorably towards some groups than others. Do the alternatives have the same problems?

I’ve also set up a calculator that you can use to input a runner’s age, gender, and time — and get all three scores so that you can compare them yourself.

But first, I wanted to take a minute to recap where we’ve been. If this is the first article you’ve read on the topic, this is for you. If you’ve been following along from the beginning, feel free to skip ahead to the next section.

A Recap of the Previous Articles on Age Grading

This series started with a question — how do you effectively compare race results between different age groups and genders?

In the first article, I offered a look into some of the history behind age grading and a basic critique of the system.

The question is important, and age grading is much better than having nothing. However, I find it a bit problematic to try and compare everyone’s results to the hypothetical best result possible. Determining that standard is difficult, and the particular standard can be a bit arbitrary.

It’s also not all that helpful for the average runner to understand how their results fit in to the bigger picture — because no matter how good they get, they’ll always be a long way off that best possible time.

How Do You Compare Race Results Between Age Groups?

The promise of and problems with the age grading system

medium.com

Age grading was developed based on stats and studies of the best runners — but I wanted to take a look at this problem from a different vantage point. How do we compare to all of the other runners?

Statistics offer several ways to make these comparisons, but first I needed a large and representative data set to work with. In the second article in the series, I laid out the sample that I’d put together.

In short, I narrowed things down to American marathons that took place in September, October, or November from 2010 to 2019.

I chose these three months to try to capture an entire ‘season’ of running so that I didn’t have to worry about individually weighing which races to include. Some are fast races, some are slow. Collectively, they’re pretty representative.

I also narrowed things down to races with 500 or more runners. This didn’t shrink the sample size much, but it did make the scraping process easier.

Creating a Sample of Marathon Race Results for Age Analysis

The first step in analyzing the age grading system and its alternatives

medium.com

Once I collected all of the data, I explored what was in the dataset.

The dataset includes over 2 million individual race results. While the biggest group of runners are in the under-35 age group, each age group is pretty well represented — at least up until the 70s. There were a decent number of men 70–74, but there was a smaller group of men 75–79 and women 70–74. The group of women 75–79 is smaller yet, and there’s just not enough data to make any useful analyses for runners over 80.

Exploring Data to Understand Age Grading and Marathons

Let’s take a look at who’s in our sample and what the data looks like

medium.com

With the data in hand, I was able to offer up the first alternative: percentiles.

Essentially, this is a method of looking at a distribution and figuring out where a particular result fits into that distribution. By looking at all results, I can say what percentage of runners in a given age group finished slower than a given time — and what percentage beat it.

Assuming the distribution of times between these age groups is fairly consistent, a runner in the top 5% of their age group should be more or less equivalent to another runner in the top 5% of their own age group.

Can We Use Percentiles to Compare Running Performances Between Age Groups?

A first look at an alternative to the current age grading system

medium.com

In a follow-up article, I took a closer look at the individual distributions to ensure that they were similarly constructed. For the most part, they were. Although things got a little less clean among the smaller, older age groups — the general shape of the distributions was the same.

However, upon further analysis, it does seem that the current tables I’ve developed may favor some groups over others — especially at the 99.9th percentile. I think this can be worked out in a future version, but that’s a problem for another day.

Follow Up on Using Percentiles to Compare Race Performances Across Age Groups

Taking a second look and refining the model

medium.com

Finally, in the most recent article, I took a look at whether z-scores would offer a better way to compare race results.

A z-score is a measure of how far above or below the mean a given data point is. In short, you calculate the mean and standard deviation for each distribution — and then every result can be assigned a standardized number to represent how fast or slow it is.

In a broad sense, it works. Results with z-scores of -2 or lower are on the outskirts of the distribution — and clearly impressive. But this method is unbalanced, and it definitely favors younger women over other age groups.

While I think percentiles can be tweaked and calibrated to offer better comparisons — I don’t think that’s possible with z-scores. Nonetheless, I’m going to keep them in the mix for now to help offer context.

Can We Use Z-Scores to Compare Running Performances Between Age Groups?

A second alternative to the current age grade system

medium.com

Are These Systems Fair to Different Age Groups?

At this point, we have three different systems for comparing race results between age groups — the existing age grading system, tables with percentiles for each age group, and the means to calculate z-scores for a given result.

How fair are these three different approaches? Do they favor one group over another?

One way to look at this question is to compare the percentage of the overall sample that comes from each age group to the percentage of the top runners — based on each method — that represent each age group.

In a general sense, you’d expect those distributions to be fairly similar. Perhaps younger age groups will be overrepresented because there are more elite athletes. But otherwise — if the system is fair — there shouldn’t be any huge disparities.

The visual below shows the top 100 finishers, by each grading methodology, from 2019. It also shows the percentage of all runners who are in each age group.

There are four bars per age group, and if you hover over the bar it will tell you what it represents.

The left bar is simply the percentage of runners in the overall sample in that age group. If things were distributed evenly and randomly, that’s how many runners we’d expect to see in a group of 100.

The second bar is the number of runners, based on age grading, that are in the top 100. Then, the number based on percentiles. Finally, the number based on z-scores.

Let’s start with the comparison between age grading and the overall distribution. Men under 35 are a large outlier here. They represent about 18 of every 100 runners — but 38 of the top age grade scores are men under 35. Meanwhile, men 35–39, men 40–44, and women under 35 are all underrepresented pretty heavily.

Now, it might be a plausible outcome for the youngest age group to be overrepresented — since there are more professional runners in that age group and fewer in the older groups. But that doesn’t explain why young men are so heavily overrepresented while young women are not.

If we move over a bar and compare percentiles to the overall distribution, the problem is flipped. Women under 35 are heavily overrepresented. Men under 35 are still overrepresented, but it’s a little more even. For the rest of the age groups, it’s not that bad.

Finally, the most out-of-whack bar on the whole graph is women under 35 measured by z-scores. As I noted in that article, they are heavily favored in that model. Although only 17 of every 100 runners in the sample are women under 35 — 53 of the top finishers by z-scores fit in that age group. That’s a problem.

The top 100 is a small group, which makes it even more prone to outliers. So what happens if we zoom out to the top 500?

When you look at age grading, men are still far out of line. And women under 35 are still pretty starkly underrepresented. The other male age groups don’t look that out of whack, but the other young women — 35–39 and 40–44 — are also underrepresented pretty heavily.

When you look at percentiles, the men under 35 come down to earth, but the women under 35 are now overrepresented. Women 35–39 also do pretty well. Men under 35 are overrepresented, but again, the difference is less stark.

Another issue — which I didn’t notice before — is that men 45–49, 50–54, and 55–59 are pretty underrepresented here. They make up a large portion of the sample, but very few of the top 500.

Backing out to the top 1000 finishers gives an even smoother and more consistent picture.

Again, age grading heavily favors young men and disfavors young women. Z-scores heavily favor young women and disfavor older men.

At this level of precision, the distribution of runners based on percentiles seems the most even. Both young men and women are overrepresented — but not by a ton. Runners in their 50s, on the other hand, don’t fair so well.

At the end of the day, I think this data suggests that all three methods are unbalanced in some way. Age grading tends to favor the results of young men, while z-scores heavily favor young women. Percentiles have their own imbalances, but it may be possible to calibrate this a little better.

It’s also worth noting that distinguishing between the best runners — and the top 1,000 represents about the top 0.5% of runners in the sample — is not a strong suit for percentiles.

Calculate Your Own Age Grades With This Online Calculator

In order to conduct this analysis, I’ve done a lot of work with Python and the Pandas package. I’ve assembled a large sample set, and based on look-up tables and calculations I can quickly spit out results based on each system.

However, that doesn’t help you — the reader.

To try and rectify that, I’ve put together an online calculator which is available here.

Mind you, this is still a beta version of the calculator and it needs some improvements. Notably, I haven’t implemented any kind of error handling — so if you don’t properly enter the time or you enter some unexpected values you’re liable to get no results.

But, it does allow you to input three things — a runner’s age, gender, and finish time — and see their score based on each of the three grading systems. You’ll see their age graded time, their age grade score, their percentile among their peers, and the z-score of their results.

I will go back and improve on this a bit in the future, but now that I have a rudimentary system in place I thought it was important to share it with you.

What’s Next?

One of the biggest hurdles to this point has been creating the online calculator. With that out of the way, I can look ahead to the final piece of this project: collecting new data and updating the models.

I wanted to start with the older (and bigger) dataset covering 2010 to 2019 to deal with some of the technical issues first. Now that I’ve got a good handle on how to do this, I plan to collect new data from 2023 and update both the percentile model and the z-score model. I’ll also compare them to the newer 2023 age grade tables.

That will be forthcoming in the next week or two, once I’ve had a chance to scrape the new data.

After that, I think it’s about time to bring this to a close. But I do plan on sharing the full dataset on Kaggle once I’m finished so that others can conduct their own analysis.

So a final installment in this series will likely respond to some critiques that have been offered, make some general conclusions, and share access to the dataset.

If any of that interests you, make sure you subscribe for email updates to get the next couple of articles. I expect to publish them in the next two or three weeks.

And if you have any feedback or ideas that will help with this analysis — please leave a response. It always helps to have a second (or third or fourth) opinion!

I’m an avid runner and a data nerd. I just turned 40, so comparing results across age groups is of particular interest to me. Here’s how you can keep up with what I’m doing: