avatarMomin Mehmood Butt

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

11008

Abstract

s home teams, and as visiting teams. What we next need to do is to merge these two Data Frames together to give us the total performance of each team across the season. For this, we use the pd.merge Pandas method to combine the two Data Frame on the ‘team’ column.</p><figure id="9375"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*nZMrYwa-bRgvvQVabqFoew.png"><figcaption>Image by Author</figcaption></figure><p id="61f3">From this consolidated dataset, we can now add together these columns to get the total number of wins, games played, runs scored by the team and the runs scored against it as a team across the entire season.</p><div id="c685"><pre><span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'W'</span>]=<span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'hwin'</span>]+<span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'awin'</span>] <span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'G'</span>]=<span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'Gh'</span>]+<span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'Ga'</span>] <span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'R'</span>]=<span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'HomRh'</span>]+<span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'VisRa'</span>] <span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'RA'</span>]=<span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'VisRh'</span>]+<span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'HomRa'</span>]</pre></div><p id="b415">Note that there are 30 different teams but for the sake of visibility, we’re displaying the first 10 in the list:</p><figure id="22bf"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*zeOk-k_Pwnd6yy-rLAcf6w.png"><figcaption>Image by Author</figcaption></figure><p id="ac58">The final step in preparing the data is to define win percentage and the Pythagorean Expectation. Win percentage is simply the ratio of the total number of matches won to the total number of matches played in a particular season.</p><div id="fdef"><pre><span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'wpc'</span>] = <span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'W'</span>]/<span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'G'</span>] <span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'pyth'</span>] = <span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'R'</span>]<span class="hljs-number">2</span>/(<span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'R'</span>]<span class="hljs-number">2</span> + <span class="hljs-symbol">MLB18</span>[<span class="hljs-string">'RA'</span>]<span class="hljs-number">2</span>)</pre></div><div id="fe3a"><pre>ax = sns.scatterplot(<span class="hljs-attribute">x</span>=<span class="hljs-string">"pyth"</span>, <span class="hljs-attribute">y</span>=<span class="hljs-string">"wpc"</span>, <span class="hljs-attribute">data</span>=MLB18) plt.show()</pre></div><figure id="8490"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*AT-_xum3A8RWi9eTzLTudA.png"><figcaption>Image by Author</figcaption></figure><p id="bae8">The scatterplot above tells us fairly clearly that there is a strong correlation between the Pythagorean Expectation and win percentage in our particular use case - the higher the Pythagorean Expectation, the higher the win percentage of the team is likely to be. This confirms the existence of the relationship as described by Bill James.</p><p id="7cfd">To actually quantify this relationship, we can fit a regression equation for this relationship to observe that for each unit increase in the Pythagorean Expectation, how much does the win percentage increase.</p><div id="7e79"><pre>model = sm<span class="hljs-selector-class">.OLS</span>(MLB18<span class="hljs-selector-attr">[<span class="hljs-string">'wpc'</span>]</span>,MLB18<span class="hljs-selector-attr">[<span class="hljs-string">'pyth'</span>]</span>,data=MLB18) results = model<span class="hljs-selector-class">.fit</span>() results<span class="hljs-selector-class">.summary</span>()</pre></div><p id="e353">The regression output tells you many things about the fitted relationship between win percentage and the Pythagorean Expectation. Regression is a method for identifying an equation which best fits the data. In this case that relationship is: wpc = Intercept + coef x pyth</p><figure id="b0d5"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*eBtPUiqFE74w2Dmmv-YV7Q.png"><figcaption>Image by Author</figcaption></figure><p id="d481">We can see the value of Intercept is 0.0609 and coefficient is 0.8770. It’s this latter value we are interested in. It means that for every one unit increase in Pythagorean Expectation, the value of win percentage goes up by 0.877.</p><blockquote id="c4b0"><p>(i) The standard error (std err) gives us an idea of the precision of the estimate. The ratio of the coefficient (coef) to the standard error is called the t statistic (t) and its value informs us about statistical significance. This is illustrated by the p-value (P > |t|) — this is the probability that we would observe the value .8770 by chance, if the true value were really zero. This probability here is 0.000 — (this is not exactly zero, but the table doesn’t include enough decimal places to show this) which means we can confident it is not zero. By convention, it is usual to conclude that we cannot be confident that the value of the coefficient is not zero if the p-value is greater than .05</p></blockquote><blockquote id="e158"><p>(ii) in the top right hand corner of the table is the R-squared. This statistic tells you the percentage of variation in the y-variable (wpc) which can be accounted for by the variation in the x variables (pyth). R-squared can be thought of as a percentage — here the Pythagorean Expectation can account for 89.4% of the variation in win percentage.</p></blockquote><h1 id="b9aa">Pythagorean Expectation and National Basketball Association (NBA)</h1><p id="8852">In the case of basketball, we have a dataset with vastly different characteristics. An important difference from the MLB example is that here each game appears in two rows, one for each team i.e., each team appears twice for each game, first as the home team and then as the away team. So in that sense, we have twice as many rows as we have games. Therefore, we don’t need to separate out and create two data frames for home team and away team because it has already been done for us in this scenario.</p><p id="2f56">The data consists of games played in the 2018 season and here is a list of columns/features/variables that we have in the Data Frame:</p><figure id="f0ce"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*zcmwFeUd9ISMws4z6n_RTw.png"><figcaption></figcaption></figure><p id="8f5d">The game result is the column labeled ‘WL’. We create a variable which has a value of ‘1’ if the team won, and zero if it lost. Now, for calculating the Pythagorean Expectation we need only the result, points scored (PTS) and point conceded (PTSAGN).</p><div id="44d2"><pre>NBAR18<span class="hljs-selector-attr">[<span class="hljs-string">'result'</span>]</span> = np<span class="hljs-selector-class">.where</span>(NBAR18<span class="hljs-selector-attr">[<span class="hljs-string">'WL'</span>]</span>== <span class="hljs-string">'W'</span>,<span class="hljs-number">1</span>,<span class="hljs-number">0</span>) NBAteams18 = NBAR18<span class="hljs-selector-class">.groupby</span>(<span class="hljs-string">'TEAM_NAME'</span>)<span class="hljs-selector-attr">[<span class="hljs-string">'result'</span>,<span class="hljs-string">'PTS'</span>,<span class="hljs-string">'PTSAGN'</span>]</span><span class="hljs-selector-class">.sum</span>()<span class="hljs-selector-class">.reset_index</span>()</pre></div><p id="f662">Since every team plays 82 games in an NBA season, we can calculate the win percentage and Pythagorean Expectation for each team (n=30) in the following way:</p><div id="78fb"><pre><span class="hljs-symbol">NBAteams18</span>[<span class="hljs-string">'wpc'</span>] = <span class="hljs-symbol">NBAteams18</span>[<span class="hljs-string">'result'</span>]/<span class="hljs-number">82</span> <span class="hljs-symbol">NBAteams18</span>[<span class="hljs-string">'pyth'</span>] = <span class="hljs-symbol">NBAteams18</span>[<span class="hljs-string">'PTS'</span>]<span class="hljs-number">2</span>/(<span class="hljs-symbol">NBAteams18</span>[<span class="hljs-string">'PTS'</span>]<span class="hljs-number">2</span> + <span class="hljs-symbol">NBAteams18</span>[<span class="hljs-string">'PTSAGN'</span>]<span class="hljs-number">2</span>)</pre></div><figure id="2784"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ipACYGqHe44yNLIcHeT1Yg.png"><figcaption>Image by Author</figcaption></figure><p id="b8f8">Now, with our statistical analysis, we first create a scatterplot in Seaborn to see what the relationship looks like. As depicted, it looks very similar to the baseball example.</p><figure id="3ff8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*kcO5i_UmPnfv3VqZTPX1DQ.png"><figcaption>Image by Author</figcaption></figure><p id="aa2c">We can fit a regression equation for this relationship to observe for each unit increase in the Pythagorean Expectation, how much does the win percentage increase in this example for basketball.</p><figure id="79d1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*dFhsrI3QMXzy9PEDJnSg0g.png"><figcaption>Image by Author</figcaption></figure><p id="5980">The results summary above shows a very large t statistic and a P value of 0.000 which essentially means this is highly statistically significant. The R-squared (coefficient of determination) value is close to 100% which means that almost all movements of a dependent variable (wpc) are completely explained by movements in the independent variable (pyth).</p><h1 id="4bf9">Pythagorean Expectation and Indian Premier League (IPL)</h1><p id="d816">In our last example for this article, we’ll look into an example from cricket’s most high-profile competition known as the IPL. We will be using data from the matches played in the 2018 IPL season and the dataset has the following columns:</p><figure id="10c0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*I4OV0tVR-u22TQGm8EcAvQ.png"><figcaption>Image by Author</figcaption></figure><p id="93ea">First we identify when the home team is the winning team, and when the visiting team is the winner. Next we identify the runs scored by the home team and the away team (note: unlike baseball, where there are nine innings for each team, in T20 cricket each team gets only one inning, and once the first completes its inning, the opposing team has its inning).

Options

Finally, we include a counter which we can add up to give total number of games for each team.</p><div id="6012"><pre>IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'hwin'</span>]</span>= np<span class="hljs-selector-class">.where</span>(IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'home_team'</span>]</span>==IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'winning_team'</span>]</span>,<span class="hljs-number">1</span>,<span class="hljs-number">0</span>) IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'awin'</span>]</span>= np<span class="hljs-selector-class">.where</span>(IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'away_team'</span>]</span>==IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'winning_team'</span>]</span>,<span class="hljs-number">1</span>,<span class="hljs-number">0</span>) IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'htruns'</span>]</span>= np<span class="hljs-selector-class">.where</span>(IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'home_team'</span>]</span>==IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'inn1team'</span>]</span>,IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'innings1'</span>]</span>,IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'innings2'</span>]</span>) IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'atruns'</span>]</span>= np<span class="hljs-selector-class">.where</span>(IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'away_team'</span>]</span>==IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'inn1team'</span>]</span>,IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'innings1'</span>]</span>,IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'innings2'</span>]</span>) IPL18<span class="hljs-selector-attr">[<span class="hljs-string">'count'</span>]</span>=<span class="hljs-number">1</span></pre></div><p id="2bf0">One thing to note here is that there are only 60 rows (matches) in the IPL18 Data Frame. The amount of data we have for the cricket example is therefore significantly lesser than the basketball and baseball examples we covered earlier and this could be a potential issue as we’ll find out later.</p><p id="9855">Similar to how we did in the MLB example, we’ll have to create two separate Data Frames for home and away teams in IPL’s case, too. We use the same .groupby command to aggregate the performance of home and away teams during the 2018 season and merge these two Data Frames to get a combined Data Frame that shows the performances of each of the eight IPL teams:</p><div id="9b2a"><pre>IPLhome = IPL18<span class="hljs-selector-class">.groupby</span>(<span class="hljs-string">'home_team'</span>)<span class="hljs-selector-attr">[<span class="hljs-string">'count'</span>,<span class="hljs-string">'hwin'</span>, <span class="hljs-string">'htruns'</span>,<span class="hljs-string">'atruns'</span>]</span><span class="hljs-selector-class">.sum</span>()<span class="hljs-selector-class">.reset_index</span>() IPLhome = IPLhome<span class="hljs-selector-class">.rename</span>(<span class="hljs-attribute">columns</span>={<span class="hljs-string">'home_team'</span>:<span class="hljs-string">'team'</span>,<span class="hljs-string">'count'</span>:<span class="hljs-string">'Ph'</span>,<span class="hljs-string">'htruns'</span>:<span class="hljs-string">'htrunsh'</span>,<span class="hljs-string">'atruns'</span>:<span class="hljs-string">'atrunsh'</span>})</pre></div><div id="0765"><pre>IPLaway = IPL18<span class="hljs-selector-class">.groupby</span>(<span class="hljs-string">'away_team'</span>)<span class="hljs-selector-attr">[<span class="hljs-string">'count'</span>,<span class="hljs-string">'awin'</span>, <span class="hljs-string">'htruns'</span>,<span class="hljs-string">'atruns'</span>]</span><span class="hljs-selector-class">.sum</span>()<span class="hljs-selector-class">.reset_index</span>() IPLaway = IPLaway<span class="hljs-selector-class">.rename</span>(<span class="hljs-attribute">columns</span>={<span class="hljs-string">'away_team'</span>:<span class="hljs-string">'team'</span>,<span class="hljs-string">'count'</span>:<span class="hljs-string">'Pa'</span>,<span class="hljs-string">'htruns'</span>:<span class="hljs-string">'htrunsa'</span>,<span class="hljs-string">'atruns'</span>:<span class="hljs-string">'atrunsa'</span>})</pre></div><div id="0ff3"><pre><span class="hljs-attr">IPL18</span> = pd.merge(IPLhome, IPLaway, <span class="hljs-literal">on</span> = [<span class="hljs-string">'team'</span>])</pre></div><p id="43e3">That’s our basic data that we need to aggregate the following for each team: number of wins, wins as home team and wins as away team, games played as home team and as away team, runs scored as home team and as away team, and runs against when playing at home and when playing away:</p><div id="5bcb"><pre><span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'W'</span>] = <span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'hwin'</span>]+<span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'awin'</span>] <span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'G'</span>] = <span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'Ph'</span>]+<span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'Pa'</span>] <span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'R'</span>] = <span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'htrunsh'</span>]+<span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'atrunsa'</span>] <span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'RA'</span>] = <span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'atrunsh'</span>]+<span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'htrunsa'</span>]</pre></div><figure id="5d2d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*9WCLyJoQrYrHpNDUW1ND3Q.png"><figcaption>Image by Author</figcaption></figure><p id="b774">The win percentage, which is the wins divided by the number of games played, and the Pythagorean expectation, which is runs scored squared divided by the sum of runs scored squared and runs against squared can now be easily calculated:</p><div id="d744"><pre><span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'wpc'</span>] = <span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'W'</span>]/<span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'G'</span>] <span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'pyth'</span>] = <span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'R'</span>]<span class="hljs-number">2</span>/(<span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'R'</span>]<span class="hljs-number">2</span> + <span class="hljs-symbol">IPL18</span>[<span class="hljs-string">'RA'</span>]**<span class="hljs-number">2</span>)</pre></div><p id="528e">Having prepared the data, we are now ready to examine the relationship between the dependent and the independent variable using the scatterplot.</p><figure id="e654"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*iEEAje00RTiAudBD7yOd0w.png"><figcaption>Image by Author</figcaption></figure><p id="d58f">We can see that there is a very weak correlation between win percentage and the Pythagorean Expectation. Firstly, because we’ve only got eight teams, we have many fewer dots, so it’s harder to discern any relationship when you have so many fewer observations in your data. The second thing to notice is that the dots tend to be scattered all over the plot, they’re not neatly organized from left to right in an upward sloping relationship as we saw in the previous two examples.</p><figure id="5047"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*e8bDGl7mCztpfzls2wpCbA.png"><figcaption>Image by Author</figcaption></figure><p id="665f">This is further confirmed when we fit a linear regression model on this relationship. This time, while coefficient on pyth is positive - implying that a higher Pythagorean Expectation leads to a large win percentage, the standard error is also very large, and the t statistic of 1.353 implies a p-value of 0.225 - well above the usual threshold of 0.05. This, in turn, means that the coefficient estimate is in fact insignificantly different from zero and we can confidently say that there is no statistically significant relationship between Pythagorean Expectation and win percentage in the IPL example.</p><blockquote id="9a2f"><p>There could be several reasons why the Pythagorean Expectation model didn’t produce a good for the IPL dataset. For starters, as established above, the data we had for IPL was very limited: 60 matches and 8 teams as opposed to some 2,300 matches and 30 teams in MLB. Random variations are likely to be smoothed out when analyzing data on a large scale so there is a much greater chance that random variations could have overwhelmed the Pythagorean model if it were correct in the IPL example.</p></blockquote><blockquote id="7417"><p>Another interpretation could be that there is some fundamental difference between the cricket and sports like baseball which makes the Pythagorean model appropriate for one but not the other. For example, in cricket, the team batting second need only score one more run than the opponent to win, and so the inning ends if it reaches this milestone. If the team batting second is the winning team, then the gap in the scores will be small. However, if the team batting first can get all ten wickets cheaply, then the gap in scores could be very large. In our data the average runs difference when the team batting second won was 2, and when the team batting first won was 30. This asymmetry explains why the Pythagorean Expectation may not be a good guide to winning in the IPL.</p></blockquote><p id="d4a1">Perhaps, we can look into the data innings-wise in a separate article and try to analyze games where the winning team bats first or second separately. For now, this article concludes here. In the subsequent article, we’ll look into how the Pythagorean Expectation can be used a predictor in English Premier League (EPL).</p><p id="5335">References:</p><ol><li><a href="https://www.coursera.org/learn/foundations-sports-analytics/home/welcome">Foundations of Sports Analytics: Data, Representation, and Models in Sports</a></li><li><a href="https://www.baseball-reference.com/">Baseball Reference</a></li><li><a href="https://en.wikipedia.org/wiki/Pythagorean_expectation">Pythagorean Expectation</a></li><li>Dataset from <a href="https://www.retrosheet.org/gamelogs/index.html">Retrosheet</a> (License: <a href="https://www.retrosheet.org/notice.txt">https://www.retrosheet.org/notice.txt</a>)</li></ol></article></body>

Pythagorean Expectation in Sports Analytics, with Examples From Different Sports

Pythagorean Expectation is used in different sports like baseball, basketball, football, hockey etcetera to drive data-driven analytics and predictive modeling

Image by Tim Gouw on Unsplash

Pythagorean Expectation is a sports analytics formula, a brainchild of one of the great baseball analysts and statisticians - Bill James. Originally derived from and devised for baseball, it was eventually utilized in other professional sports as well such as basketball, soccer, American football, ice hockey etcetera.

The formula basically states that the percentage of games a professional sports team will win across a given season should be proportional to the ratio of the square of the points/runs/goals scored by the team in the season, divided by the sum of squares of the points/runs/goals scored by the team and its opponents across the whole season:

This is a concept which can help to explain not only why teams are successful, but also can be used as the basis for predicting results in the future. It’s a relationship that we can measure with data. We can actually calculate the Pythagorean Expectation for each team and then we can test whether it truly is related to the win percentage of the team across a given season.

Over time, the Pythagorean Expectation formula has been tinkered with and enhanced according to different use cases. The modifications have mostly been centered around the value of the exponent. The ideal exponent in baseball’s case was found out to be 1.83 rather than 2. Pythagenport and Pythagenpat are two modified forms of Bill James’ original formula that have been in use in baseball to calculate the ideal exponent from the run environment rather than using a fixed exponent value.

Similarly, statisticians have researched and dug out different ideal exponents for other sports as well - 13.91 in basketball and 2.37 in ice hockey. Basketball’s higher exponent is due to the smaller role that chance plays in basketball as opposed to a sport like baseball.

In this article, however, we’ll delve into the basic form of the Pythagorean Expectation formula and see how it relates to the win percentage of teams in different professional sports. In a subsequent article, we will also look at how Pythagorean Expectation can be used as a predictor, i.e., how can we forecast win percentage in the future using historical Pythagorean Expectation.

Pythagorean Expectation and Major League Baseball (MLB)

We will start by importing the following modules into our Jupyter Notebook:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

In this section, we’ll be looking at the MLB games played in 2018 season, a log of which can be downloaded from Retrosheet. Here’s a glimpse of the Data Frame:

Image by Author

The screenshot above covers only the first few columns. In total, there are 161 columns/features/variables in the MLB dataset and a total of 2,431 rows where each row represents a single game. For this article, though, we’ll be requiring only a handful of columns: home team, visiting team, runs scored by home team, runs scored by visiting team, date of the match. We can also rename the columns to slightly shorter variable names for ease of use:

MLB18 = MLB[['VisitingTeam','HomeTeam','VisitorRunsScored','HomeRunsScore','Date']]
MLB18 = MLB18.rename(columns={'VisitorRunsScored':'VisR','HomeRunsScore':'HomR'})

Now, our dataset consists of individual games, and in each game there are two teams; there’s a home team and a visiting team. If we want to calculate the total number of runs scored by team across the entire season, we’re going to need to take into account the runs scored when it was the home team and the runs scored when it was a visiting team. Also we’ll need to have the runs scored against it when it was the home team and the runs scored against it when it was the visiting team. In order to do that, we’re going to cut this Data Frame up into two smaller Data Frames; one for teams when they’re visiting teams, one for teams when they’re home teams. Then we’re going to merge those two data sets to get the aggregate for each team across the whole season.

Before we do that, we have to define the winner of each game which in baseball’s case is simple - the team who scores the most runs wins. We can use the np.where Numpy method to segregate wins into two different columns: home team wins and away team wins:

MLB18['hwin'] = np.where(MLB18['HomR'] > MLB18['VisR'],1,0)
MLB18['awin'] = np.where(MLB18['HomR'] < MLB18['VisR'],1,0)
MLB18['count'] = 1

The new column, count, will be used at the rear-end when we’ll merge records to make a single, consolidated Data Frame:

Image by Author

Moving on, we will now create two separate Data Frames, starting with the Data Frame for home teams. We group the MLB18 data set by home team to obtain the sum of wins and runs (scored and conceded) and also the counter variable to show how many games were played (in MLB, the teams do not necessarily play the same number of games in the regular season):

MLBhome = MLB18.groupby('HomeTeam')['hwin','HomR','VisR','count'].sum().reset_index()
MLBhome = MLBhome.rename(columns={'HomeTeam':'team','VisR':'VisRh','HomR':'HomRh','count':'Gh'})

There are a total of 30 teams for which the following information is displayed in the table below: name of the team, number of wins for the team as the home team, number of runs scored by the team as the home team, number of runs scored by visitors against the team when it was the home team and the total number of games played by the team as the home team in a given season.

Image by Author

We repeat the same process for the visiting teams. We get the following details from the below code snippet: name of the team, number of wins for the team as the visiting team, number of runs scored by the team as the visiting team, number of runs scored by hosts against the team when it was the visiting team and the total number of games played by the team as the visiting team in a given season.

MLBaway = MLB18.groupby('VisitingTeam')['awin','HomR','VisR','count'].sum().reset_index()
MLBaway = MLBaway.rename(columns={'VisitingTeam':'team','VisR':'VisRa','HomR':'HomRa','count':'Ga'})
Image by Author

These two Data Frames summarize the performance of teams as home teams, and as visiting teams. What we next need to do is to merge these two Data Frames together to give us the total performance of each team across the season. For this, we use the pd.merge Pandas method to combine the two Data Frame on the ‘team’ column.

Image by Author

From this consolidated dataset, we can now add together these columns to get the total number of wins, games played, runs scored by the team and the runs scored against it as a team across the entire season.

MLB18['W']=MLB18['hwin']+MLB18['awin']
MLB18['G']=MLB18['Gh']+MLB18['Ga']
MLB18['R']=MLB18['HomRh']+MLB18['VisRa']
MLB18['RA']=MLB18['VisRh']+MLB18['HomRa']

Note that there are 30 different teams but for the sake of visibility, we’re displaying the first 10 in the list:

Image by Author

The final step in preparing the data is to define win percentage and the Pythagorean Expectation. Win percentage is simply the ratio of the total number of matches won to the total number of matches played in a particular season.

MLB18['wpc'] = MLB18['W']/MLB18['G']
MLB18['pyth'] = MLB18['R']**2/(MLB18['R']**2 + MLB18['RA']**2)
ax = sns.scatterplot(x="pyth", y="wpc", data=MLB18)
plt.show()
Image by Author

The scatterplot above tells us fairly clearly that there is a strong correlation between the Pythagorean Expectation and win percentage in our particular use case - the higher the Pythagorean Expectation, the higher the win percentage of the team is likely to be. This confirms the existence of the relationship as described by Bill James.

To actually quantify this relationship, we can fit a regression equation for this relationship to observe that for each unit increase in the Pythagorean Expectation, how much does the win percentage increase.

model = sm.OLS(MLB18['wpc'],MLB18['pyth'],data=MLB18)
results = model.fit()
results.summary()

The regression output tells you many things about the fitted relationship between win percentage and the Pythagorean Expectation. Regression is a method for identifying an equation which best fits the data. In this case that relationship is: wpc = Intercept + coef x pyth

Image by Author

We can see the value of Intercept is 0.0609 and coefficient is 0.8770. It’s this latter value we are interested in. It means that for every one unit increase in Pythagorean Expectation, the value of win percentage goes up by 0.877.

(i) The standard error (std err) gives us an idea of the precision of the estimate. The ratio of the coefficient (coef) to the standard error is called the t statistic (t) and its value informs us about statistical significance. This is illustrated by the p-value (P > |t|) — this is the probability that we would observe the value .8770 by chance, if the true value were really zero. This probability here is 0.000 — (this is not exactly zero, but the table doesn’t include enough decimal places to show this) which means we can confident it is not zero. By convention, it is usual to conclude that we cannot be confident that the value of the coefficient is not zero if the p-value is greater than .05

(ii) in the top right hand corner of the table is the R-squared. This statistic tells you the percentage of variation in the y-variable (wpc) which can be accounted for by the variation in the x variables (pyth). R-squared can be thought of as a percentage — here the Pythagorean Expectation can account for 89.4% of the variation in win percentage.

Pythagorean Expectation and National Basketball Association (NBA)

In the case of basketball, we have a dataset with vastly different characteristics. An important difference from the MLB example is that here each game appears in two rows, one for each team i.e., each team appears twice for each game, first as the home team and then as the away team. So in that sense, we have twice as many rows as we have games. Therefore, we don’t need to separate out and create two data frames for home team and away team because it has already been done for us in this scenario.

The data consists of games played in the 2018 season and here is a list of columns/features/variables that we have in the Data Frame:

The game result is the column labeled ‘WL’. We create a variable which has a value of ‘1’ if the team won, and zero if it lost. Now, for calculating the Pythagorean Expectation we need only the result, points scored (PTS) and point conceded (PTSAGN).

NBAR18['result'] = np.where(NBAR18['WL']== 'W',1,0)
NBAteams18 = NBAR18.groupby('TEAM_NAME')['result','PTS','PTSAGN'].sum().reset_index()

Since every team plays 82 games in an NBA season, we can calculate the win percentage and Pythagorean Expectation for each team (n=30) in the following way:

NBAteams18['wpc'] = NBAteams18['result']/82
NBAteams18['pyth'] = NBAteams18['PTS']**2/(NBAteams18['PTS']**2 + NBAteams18['PTSAGN']**2)
Image by Author

Now, with our statistical analysis, we first create a scatterplot in Seaborn to see what the relationship looks like. As depicted, it looks very similar to the baseball example.

Image by Author

We can fit a regression equation for this relationship to observe for each unit increase in the Pythagorean Expectation, how much does the win percentage increase in this example for basketball.

Image by Author

The results summary above shows a very large t statistic and a P value of 0.000 which essentially means this is highly statistically significant. The R-squared (coefficient of determination) value is close to 100% which means that almost all movements of a dependent variable (wpc) are completely explained by movements in the independent variable (pyth).

Pythagorean Expectation and Indian Premier League (IPL)

In our last example for this article, we’ll look into an example from cricket’s most high-profile competition known as the IPL. We will be using data from the matches played in the 2018 IPL season and the dataset has the following columns:

Image by Author

First we identify when the home team is the winning team, and when the visiting team is the winner. Next we identify the runs scored by the home team and the away team (note: unlike baseball, where there are nine innings for each team, in T20 cricket each team gets only one inning, and once the first completes its inning, the opposing team has its inning). Finally, we include a counter which we can add up to give total number of games for each team.

IPL18['hwin']= np.where(IPL18['home_team']==IPL18['winning_team'],1,0)
IPL18['awin']= np.where(IPL18['away_team']==IPL18['winning_team'],1,0)
IPL18['htruns']= np.where(IPL18['home_team']==IPL18['inn1team'],IPL18['innings1'],IPL18['innings2'])
IPL18['atruns']= np.where(IPL18['away_team']==IPL18['inn1team'],IPL18['innings1'],IPL18['innings2'])
IPL18['count']=1

One thing to note here is that there are only 60 rows (matches) in the IPL18 Data Frame. The amount of data we have for the cricket example is therefore significantly lesser than the basketball and baseball examples we covered earlier and this could be a potential issue as we’ll find out later.

Similar to how we did in the MLB example, we’ll have to create two separate Data Frames for home and away teams in IPL’s case, too. We use the same .groupby command to aggregate the performance of home and away teams during the 2018 season and merge these two Data Frames to get a combined Data Frame that shows the performances of each of the eight IPL teams:

IPLhome = IPL18.groupby('home_team')['count','hwin', 'htruns','atruns'].sum().reset_index()
IPLhome = IPLhome.rename(columns={'home_team':'team','count':'Ph','htruns':'htrunsh','atruns':'atrunsh'})
IPLaway = IPL18.groupby('away_team')['count','awin', 'htruns','atruns'].sum().reset_index()
IPLaway = IPLaway.rename(columns={'away_team':'team','count':'Pa','htruns':'htrunsa','atruns':'atrunsa'})
IPL18 = pd.merge(IPLhome, IPLaway, on = ['team'])

That’s our basic data that we need to aggregate the following for each team: number of wins, wins as home team and wins as away team, games played as home team and as away team, runs scored as home team and as away team, and runs against when playing at home and when playing away:

IPL18['W'] = IPL18['hwin']+IPL18['awin']
IPL18['G'] = IPL18['Ph']+IPL18['Pa']
IPL18['R'] = IPL18['htrunsh']+IPL18['atrunsa']
IPL18['RA'] = IPL18['atrunsh']+IPL18['htrunsa']
Image by Author

The win percentage, which is the wins divided by the number of games played, and the Pythagorean expectation, which is runs scored squared divided by the sum of runs scored squared and runs against squared can now be easily calculated:

IPL18['wpc'] = IPL18['W']/IPL18['G']
IPL18['pyth'] = IPL18['R']**2/(IPL18['R']**2 + IPL18['RA']**2)

Having prepared the data, we are now ready to examine the relationship between the dependent and the independent variable using the scatterplot.

Image by Author

We can see that there is a very weak correlation between win percentage and the Pythagorean Expectation. Firstly, because we’ve only got eight teams, we have many fewer dots, so it’s harder to discern any relationship when you have so many fewer observations in your data. The second thing to notice is that the dots tend to be scattered all over the plot, they’re not neatly organized from left to right in an upward sloping relationship as we saw in the previous two examples.

Image by Author

This is further confirmed when we fit a linear regression model on this relationship. This time, while coefficient on pyth is positive - implying that a higher Pythagorean Expectation leads to a large win percentage, the standard error is also very large, and the t statistic of 1.353 implies a p-value of 0.225 - well above the usual threshold of 0.05. This, in turn, means that the coefficient estimate is in fact insignificantly different from zero and we can confidently say that there is no statistically significant relationship between Pythagorean Expectation and win percentage in the IPL example.

There could be several reasons why the Pythagorean Expectation model didn’t produce a good for the IPL dataset. For starters, as established above, the data we had for IPL was very limited: 60 matches and 8 teams as opposed to some 2,300 matches and 30 teams in MLB. Random variations are likely to be smoothed out when analyzing data on a large scale so there is a much greater chance that random variations could have overwhelmed the Pythagorean model if it were correct in the IPL example.

Another interpretation could be that there is some fundamental difference between the cricket and sports like baseball which makes the Pythagorean model appropriate for one but not the other. For example, in cricket, the team batting second need only score one more run than the opponent to win, and so the inning ends if it reaches this milestone. If the team batting second is the winning team, then the gap in the scores will be small. However, if the team batting first can get all ten wickets cheaply, then the gap in scores could be very large. In our data the average runs difference when the team batting second won was 2, and when the team batting first won was 30. This asymmetry explains why the Pythagorean Expectation may not be a good guide to winning in the IPL.

Perhaps, we can look into the data innings-wise in a separate article and try to analyze games where the winning team bats first or second separately. For now, this article concludes here. In the subsequent article, we’ll look into how the Pythagorean Expectation can be used a predictor in English Premier League (EPL).

References:

  1. Foundations of Sports Analytics: Data, Representation, and Models in Sports
  2. Baseball Reference
  3. Pythagorean Expectation
  4. Dataset from Retrosheet (License: https://www.retrosheet.org/notice.txt)
Sports Analytics
Data Analytics
Data Science
Baseball
Basketball
Recommended from ReadMedium