Can We Use Z-Scores to Compare Running Performances Between Age Groups?

A second alternative to the current age grade system

How can you effectively compare race results between different age groups?

That’s the question I’ve been exploring in an ongoing series of articles.

As we get older, we inevitably slow down. It hits some people harder than others — but the bottom line is that a 60-year-old man can’t compete head-to-head with a 25-year-old man on a level playing field.

To keep things interesting for masters runners and to maintain an inclusive running community, a lot of effort has been put into developing a system — called age grading — to allow for comparisons between those two athletes. In recent weeks, I’ve been using data to explore some possible alternatives.

Analysis of Marathon Results and Age Grading

Age grading makes it possible to compare race results between different age groups. But is there a better alternative…

medium.com

Previously, I offered up percentiles as another way to make comparisons. I wrote a follow-up on that topic here.

There’s some potential there, but there are also some drawbacks. I’ll return to that topic later and try to square some of those circles and improve things.

But for today, I want to focus on another possible method of comparison: z-scores.

Can this statistical tool help us effectively compare race results from different types of runners? Let’s take a look and find out.

What Are Z-Scores?

First, let’s go over some basic statistics. Before we can compute z-scores for individual race results, we need to go over a few key concepts.

Let’s say you have a sample of race results.

There are different ways to describe the ‘typical’ or ‘average’ runner. For our purposes, we’re going to use the mean. To calculate the mean, you add up all of the results and divide by the number of the results.

In our case, I’ve got a sample that includes 360,075 race results from men who are under 35. If you were to do the math (I used the Pandas package in Python and let my computer do the work for me), the mean finish time for this group is 4:16:36 (more on the sample later).

That only tells part of the story, though. While many runners finish around that 4:16 mark, the fastest runner in this sample finished in 2:03:45. Others took 6, 7, or 8 hours to finish.

Another fundamental concept is that of variance and deviation. How much do the individual results vary from the mean? Are they clustered around it tightly or are they spread out?

Without getting into the math, we can use Pandas to calculate the standard deviation of this sample: 54:07. This essentially represents the average distance between a given race result and the mean finish time.

A z-score is a standardized way to understand how far from the mean a given result is — based on the standard deviation of that sample.

If a runner finished 54:07 below the mean (3:22:29), the z-score of that result would be -1 — one standard deviation below the mean. If a runner finished twice as fast (2:28:22), that would be two standard deviations below the mean.

The visual above illustrates this concept.

The histogram represents the 360,075 individual race results and what percentage of them fall into each 5-minute bucket (i.e. 3:55 to 4:00).

The dashed lines show where the mean is, as well as one and two standard deviations above and below the mean.

Notice that as you get further from the mean, fewer and fewer runners have beaten that time (or run slower than that time).

If you apply this concept to different samples — like other age groups — the actual mean and standard deviations will vary. But the general principle will hold — far fewer runners will have finished the race at two standard deviations below the mean than one standard deviation below the mean.

This could, potentially, offer a way to compare how “good” an individual result is — by measuring how far below the mean it is.

Calculating Mean and Standard Deviation for Each Age Bracket

The first step is to collect a bunch of data — which I’ve already done. You can read more here and here, but this is the short version.

To facilitate this analysis, I identified a group of races to use as a sample. This includes every marathon run in the United States in September, October, or November from 2010 to 2019 with 500 or more finishers. Then, I collected gender, age, and finish time for each finisher — totaling 2,017,493 results.

After some cleaning and preparation, I’ve loaded these results into a Pandas dataframe to enable easy analysis. When I finish this series, I’ll be sharing the full dataset on Kaggle in case you’re interested in doing your own analysis.

With the results in Pandas, it’s fairly simple to group the results by gender and age bracket and then calculate both the mean and the standard deviation for those groups. Note, I’m using the age brackets that the BAA uses for Boston Marathon qualifying purposes, and I’m not including any runners over 80 because the sample is just too small.

The visual above plots the mean finish time for each age bracket. The red dots are men and the green dots are women.

There’s nothing surprising here. As you move to the right, the mean finish times get slower. The differences aren’t big among the younger runners, but they grow increasingly significant as you move into the 50s and 60s.

On average, women finish slower than men of a similar age. But the trend between ages is similar for both women and men.

The visual above shows what the standard deviation is for each of those groups.

This is not exactly what I expected — and it might present a problem.

The standard deviation is pretty similar across each group. Aside from the older men, every group has a standard deviation that is between 50 and 55 minutes. The three older male age brackets are only slightly higher (55 to 60 minutes).

I’m not entirely sure what I expected — but I thought this would scale in some way with the mean. We’ll see how things play out below, but I have a feeling that this may result in one group or another having more extreme z-scores than others — and thus being overvalued in any comparison.

Applying Normalized Z-Scores to Individual Results

Once I calculated the mean and standard deviation for each group, I took a subset of the results (2019) and calculated a z-score for each of them.

Let’s take a look at a few races to see how this works out in practice.

We’ll start with a major race — the Twin Cities Marathon in Minneapolis. Here are the top 10 finishers by z-score. For the sake of comparison, I’ve also included the age grade score (according to the 2020 age grade tables).

+----------+-------+----------+----------+-------------+
| Gender   |   Age | Finish   |   zScore |   Age Grade |
|----------+-------+----------+----------+-------------|
| F        |    40 | 02:34:07 |    -2.55 |       89.12 |
| F        |    27 | 02:31:29 |    -2.54 |       88.5  |
| F        |    24 | 02:32:49 |    -2.51 |       87.73 |
| F        |    30 | 02:35:50 |    -2.45 |       86.03 |
| F        |    33 | 02:36:34 |    -2.44 |       85.69 |
| F        |    31 | 02:38:46 |    -2.4  |       84.44 |
| F        |    26 | 02:40:08 |    -2.37 |       83.72 |
| F        |    37 | 02:40:24 |    -2.37 |       84.42 |
| F        |    30 | 02:40:13 |    -2.37 |       83.68 |
| F        |    29 | 02:41:13 |    -2.35 |       83.16 |
+----------+-------+----------+----------+-------------+

Something that immediately jumps out at me is that all ten of these finishers are women. They do have high age grades and these are impressive finish times … but I find it hard to believe that there are no men at all with comparable times.

For a second example, let’s look at a smaller race — the Atlantic City Marathon in New Jersey (which, incidentally, is the first marathon I ran).

+----------+-------+----------+----------+-------------+
| Gender   |   Age | Finish   |   zScore |   Age Grade |
|----------+-------+----------+----------+-------------|
| F        |    28 | 02:42:48 |    -2.32 |       82.35 |
| M        |    35 | 02:19:15 |    -2.27 |       87.84 |
| M        |    32 | 02:21:46 |    -2.13 |       85.83 |
| M        |    61 | 03:00:04 |    -2.06 |       83.75 |
| F        |    41 | 03:07:27 |    -1.92 |       73.72 |
| M        |    34 | 02:33:10 |    -1.92 |       79.66 |
| M        |    56 | 02:58:22 |    -1.91 |       80.66 |
| M        |    56 | 03:01:43 |    -1.84 |       79.17 |
| F        |    41 | 03:12:30 |    -1.83 |       71.79 |
| F        |    24 | 03:11:47 |    -1.76 |       69.91 |
+----------+-------+----------+----------+-------------+

In this case, there is a mix of men and women in the results. But take a look at the top two results.

The top result — with a z-score of -2.32 — is a 28-year-old woman who ran 2:42. That’s a great time (she was the first female finisher and the fourth finisher overall), but is it better than the 35-year-old man who ran 2:19?

Here’s one final example — the Richmond Marathon in Virginia.

+----------+-------+----------+----------+-------------+
| Gender   |   Age | Finish   |   zScore |   Age Grade |
|----------+-------+----------+----------+-------------|
| F        |    23 | 02:36:19 |    -2.44 |       85.77 |
| F        |    30 | 02:36:30 |    -2.44 |       85.67 |
| F        |    28 | 02:40:08 |    -2.37 |       83.72 |
| F        |    29 | 02:43:31 |    -2.3  |       81.99 |
| F        |    23 | 02:47:03 |    -2.24 |       80.26 |
| F        |    36 | 02:47:54 |    -2.23 |       80.38 |
| F        |    35 | 02:48:45 |    -2.21 |       79.76 |
| F        |    26 | 02:49:08 |    -2.2  |       79.27 |
| F        |    28 | 02:49:29 |    -2.19 |       79.1  |
| F        |    28 | 02:50:19 |    -2.17 |       78.72 |
+----------+-------+----------+----------+-------------+

Here, again, all of the top 10 finishers are women. The top two (2:36) are pretty impressive, but again it’s hard to believe there are no men who would fit on this list anywhere.

These women are also all young — with no one above the 35–39 age group represented.

To dig a little deeper, here are the top ten men by z-scores from the Richmond Marathon.

+----------+-------+----------+----------+-------------+
| Gender   |   Age | Finish   |   zScore |   Age Grade |
|----------+-------+----------+----------+-------------|
| M        |    31 | 02:19:43 |    -2.17 |       87.07 |
| M        |    25 | 02:20:54 |    -2.15 |       86.34 |
| M        |    30 | 02:21:34 |    -2.14 |       85.93 |
| M        |    24 | 02:22:00 |    -2.13 |       85.67 |
| M        |    35 | 02:27:14 |    -2.11 |       83.08 |
| M        |    25 | 02:24:14 |    -2.09 |       84.34 |
| M        |    55 | 02:48:50 |    -2.08 |       84.45 |
| M        |    22 | 02:25:32 |    -2.06 |       83.59 |
| M        |    26 | 02:26:26 |    -2.05 |       83.08 |
| M        |    42 | 02:36:17 |    -2.01 |       81.55 |
+----------+-------+----------+----------+-------------+

So there were some really high-performing men. But the 31-year-old man who ran 2:19 only had a z-score of -2.17. Coincidentally, that’s just behind the 10th-place woman who ran 2:50.

There’s a little more age variation here (a 55-year-old man and a 42-year-old man), but the majority of these finishers are in the men under 35 age group.

The Problem with Using Z-Scores

I think the obvious problem here is that this measure tends to over-value the results of women — especially young women. In the case of two large-ish marathons, the full top ten list was taken up by women.

Why is this happening?

Think back to the mean and standard deviation for each age group.

The mean finish time for women under 35 is 4:43:20. This is 27 minutes slower than the men of the same age, and slower than every male age group through 55–59.

At the same time, their standard deviation (51:59) is one of the lowest. It’s about two minutes lower than the men of the same age.

Combined, this creates much more room for the best women to perform well below the mean — and achieve lower z-scores than are possible for other age groups.

At the time (2019), the men’s world record in the marathon was 2:01:39 (set by Eliud Kipchoge at Berlin 2018). This would achieve a z-score of -2.51.

Meanwhile, the then women’s world record was 2:14:04 (set by Brigid Kosgei at Chicago 2019). This would achieve a z-score of -2.86.

There’s a huge built-in advantage for women under 35 with this system. Even a woman running 2:30 would achieve a z-score (-2.55) below Kipchoge’s world record.

This visual shows what percentage of runners in each age group achieved a z-score below -2 (blue) or between -1 and -2 (purple).

Although the group is still small, there are for more women under 35 with scores under 2 (1.27%) than men under 35 (0.79%).

At 35–39, there are also about twice as many women (0.89%) than men (0.38%) who score below -2.

There’s also some funky stuff going on among runners in their 60s and 70s — but the groups are so small that it may not be apparent in individual race results.

So Are Z-Scores Useful for Understanding Race Results?

If we’re looking for a complete alternative to age grading, then I’d say no.

It’s pretty clear that this is unbalanced, and that some groups have an advantage over others. The higher mean finish times among young women give them a lot more room to finish below that mean and receive a low z-score.

As a concept, I’d say this is useful if you want to make some general comparisons and judgments. Knowing that a runner is either one or two standard deviations beneath the mean does give you a general sense of how accomplished they are — and how difficult their result is.

But I don’t think there’s a good way to calibrate this to make better comparisons at the extremes while staying true to the concept.

Age grading may have its own problems when it comes to calibration, but I don’t see this system as an improvement.

Maybe someone else can take the data and make this work better, but I think I’m going to put this on the shelf for now and cross it off my list.

So What’s Next?

At this point, we’ve looked at two alternatives — percentiles and z-scores.

Moving forward, I want to take a step back and see how these two alternatives compare to age grading. I also want to create and share a calculator that you can use to score your own results and see how they compare.

After that, I want to update the data I’m using to 2023, tweak the percentile system a bit, and compare it to the latest age grade tables from 2023.

I think that will bring this series to a close — and I’ll wrap up by sharing the full dataset on Kaggle if you want to conduct your own analysis.

If any of that interests you, make sure you subscribe for email updates to get the next couple of articles. I expect to publish them in the next two or three weeks.

And if you have any feedback or ideas that will help with this analysis — please leave a response. It always helps to have a second (or third or fourth) opinion!

I’m an avid runner and a data nerd. I just turned 40, so comparing results across age groups is of particular interest to me. Here’s how you can keep up with what I’m doing: