Free AI web copilot to create summaries, insights and extended knowledge, download it at here
7218
Abstract
mples. Alternative methods such as the method of <a href="https://www.jstor.org/stable/2345653">L-moments</a> exist and may give a better fit in some cases. Fortunately the package we will use to do the GEV fit has both MLE and L-moments methods.</p><h1 id="8a98">Time Series EVA on Daily Reported Crimes:</h1><p id="5af2">We will now apply block maxima EVA to predict the probability of monthly crime spikes in Vancouver BC and the expected period of months before a certain monthly reported crime count threshold is exceeded. This dataset is a good candidate for EVA because of the presence of an <a href="https://en.wikipedia.org/wiki/2011_Vancouver_Stanley_Cup_riot">extreme outlier event in 2011</a> which caused an all-time-high spike in the reported crime counts dataset. Interestingly enough this was not <a href="https://en.wikipedia.org/wiki/1994_Vancouver_Stanley_Cup_riot">the first hockey riot in Vancouver</a>.</p><figure id="201b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*ET4eN0qonBJr6LYY"><figcaption>I’m more of a American handball fan.</figcaption></figure><p id="6884">So let’s go and fetch the Vancouver Crimes dataset from kaggle: <a href="https://www.kaggle.com/wosaku/crime-in-vancouver">https://www.kaggle.com/wosaku/crime-in-vancouver</a>. Props to Willan Osaku for the data wrangling as done in his kernel here: <a href="https://www.kaggle.com/wosaku/eda-of-crime-in-vancouver-2003-2017">https://www.kaggle.com/wosaku/eda-of-crime-in-vancouver-2003-2017</a>. The dataset contains reports of property crimes and crimes committed against people from Jan 1 2003 to Jul 15 2017. We will set the block maxima size to 30 days regardless of the number of days in calendar months. <b>Edit: for your convenience, all data sets (the original and the daily aggregate) are available here: <a href="https://github.com/hhl60492/blog">https://github.com/hhl60492/blog</a></b></p><p id="2a60">For this analysis we will be using a combination of python and R. Python for the data wrangling and R for the actual GEV fitting (at this point Python lacks the high quality EVA packages that R has). For the python script, basically we aggregate the reported number of crimes by day, plot it and dump that data series to a csv with the dates.</p>
<figure id="c61a">
<div>
<div>
<iframe class="gist-iframe" src="/gist/hhl60492/1a65ace756543ab631b11bd6cc6d82bb.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><figure id="ee6b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Cam_wBY0UPhVoKM65gppIg.png"><figcaption>We can clearly see the all-time-high spike due to the 2011 Hockey Riot. Also noticeable is the non-linear downward trend from the start of the series to the end. I challenge the reader to correlate some factors to this trend.</figcaption></figure><p id="0188">In the above daily crime reports time series, we have a median of 95 crimes per day, with min = 29 and max = 647, with stddev = 26.</p><p id="fcce">Next, we are going to download and install the package ‘<a href="https://cran.r-project.org/web/packages/extRemes/index.html">extRemes</a>’ package in RStudio. Then EVA is as straightforward as loading the package, aggregating the data and doing the necessary fits. However, interpreting the ‘return ’ of the GEV is where the meat and potatoes are and we’ll get to that in a bit.</p>
<figure id="883d">
<div>
<div>
<iframe class="gist-iframe" src="/gist/hhl60492/d08502086cb924085252bec69b40a08c.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="24da">Note the way the block maxima was aggregated was on a 30-day interval regardless of calendar months (aggregates on calendar months giving variable block maxima sizes does not really affect our fit). Now, the diagnostic plots given by</p><div id="dc7e"><pre><span class="hljs-function"><span class="hljs-title">plot</span><span class="hljs-params">(fit_mle)</span></span></pre></div><p id="1171">such as the Emperical vs Model Q-Q Plot gives an indication of how good the fit is (the more dots that fall on the line the better). We see evidence of an extreme outlier that was too extreme for even our GEV.</p><figure id="e781"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*DqXn6tGVBcujg8B5qHh2xA.png"><figcaption></figcaption></figure><p id="33b1">And below is what our fitted GEV actually looks like compared to the empirical block max sample distribution/histogram.</p><figure id="0230"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*lwg2twlcctZwVmngCN5IEQ.png"><figcaption>Our empirical distribution is slightly bi-modal, this explains why our GEV (a unimodal distribution) had trouble fitting the extreme outlier.</figcaption></figure><div id="4235"><pre><span class="hljs-function"><span class="hljs-title">fevd</span><span class="hljs-params">(x = as.vector(month_bm<span class="hljs-variable">$month_max</span>)</span></span>, type = <span class="hljs-string">"GEV"</span>, method = <span class="hljs-string">"Lmoments"</span>,
period<span class="hljs-selector-class">.basis</span> = <span class="hljs-string">"month"</span>)
<span class="hljs-selector-attr">[1]</span> <span class="hljs-string">"GEV Fitted to as.vector(month_bmmonth_max) using L-moments estimation."</span>
location scale shape
<span class="hljs-number">114.0563798</span> <span class="hljs-number">19.6206000</span> <span class="hljs-number">0.2034706</span></pre></div><p id="41e0">There really isn’t too much of a difference in fit between the MLE and L-moment GEV models, so let’s just pick the L-moment model and print out the return levels.</p><div id="8369"><pre><span class="hljs-function"><span class="hljs-title">fevd</span><span class="hljs-params">(x = as.vector(month_bm<span class="hljs-variable">month_max</span>)</span></span>, type = <span class="hljs-string">"GEV"</span>, method = <span class="hljs-string">"Lmoments"</span>,
period<span class="hljs-selector-class">.basis</span> = <span class="hljs-string">"month"</span>)
<span class="hljs-function"><span class="hljs-title">get</span><span class="hljs-params">(paste(<span class="hljs-string">"return.level.fevd."</span>, newcl, sep = <span class="hljs-string">""</span>)</span></span>)(x = x, return<span class="hljs-selector-class">.period</span> = return<span class="hljs-selector-class">.period</span>,
conf = <span class="hljs-number">0.05</span>)</pre></div><div id="4dd0"><pre>GEV model fitted <span class="hljs-keyword">to</span> <span class="hljs-keyword">as</span>.vector(month_bm$month_max)
Data are assumed <span class="hljs-keyword">to</span> be stationary
[<span class="hljs-number">1</span>] <span class="hljs-string">"Return Levels for period units in months"</span>
<span class="hljs-number">3</span>-<span class="hljs-built_in">month</span> level <span class="hljs-number">6</
Options
span>-<span class="hljs-built_in">month</span> level <span class="hljs-number">12</span>-<span class="hljs-built_in">month</span> level <span class="hljs-number">24</span>-<span class="hljs-built_in">month</span> level <span class="hljs-number">48</span>-<span class="hljs-built_in">month</span> level <span class="hljs-number">120</span>-<span class="hljs-built_in">month</span> level
<span class="hljs-number">133.4993</span> <span class="hljs-number">153.9620</span> <span class="hljs-number">176.1072</span> <span class="hljs-number">200.9296</span> <span class="hljs-number">229.1534</span> <span class="hljs-number">272.8342</span></pre></div><p id="b2aa">The results from “Return Levels for period units in months” on are the <i>return period </i>and corresponding <i>return level </i>on the line below it. The<b> return value</b> is defined <b>as a value that is expected to be equaled or exceeded on average once every return period. </b>The<b> return period <i>T</i> </b>of a particular event<b> is the inverse of the probability that the event will be exceeded in any given block maxima time unit. </b>Note<b> </b><i>T </i>is always a multiple of our base block maxima time unit (in our case 1 month = 30 days) and is related to the <i>threshold exceedance probability p </i>by <i>T = 1/p.</i></p><p id="76fe">So what this all means is that for the foreseeable future (<b><i>assuming the distribution does not change, i.e. stationarity; also one should refit and revise predictions after every block max period has elapsed</i></b>) from the first return period we can expect on average one instance of seeing more than 133.5 crimes reported per day in a 3-month period.</p><p id="027f">Conversely if we wanted to see how likely it is for one of next month’s daily crime reports to exceed 133.5, it would be 1/(3 months) or 33.3% (again, under strict stationarity assumptions). Looking at the largest return period, our interpretation would be that there is likely to be a day exceeding 272.8 crime reports in the next 120 months (10 years) or conversely the probability of seeing a day exceeding 272.8 crime reports in the next 30 days is 1/120 = 0.83%.</p><p id="6779">As an fun exercise let’s try to predict the probability of another hockey riot in Vancouver. The empirical data (with a terrible sample size of <i>n</i>=2) gives us 2 hockey riots in 17 years. Thus the yearly rate of hockey riots according to the empirical data is 2/17 or 11.76%. You can run a block max and GEV fit on hockey riot counts distributed across years or decades but fitting on only 2 empirical data points is terrible idea, only slightly better than participating in a hockey riot yourself. So let’s find a better way of estimating the probability of another hockey riot. We know that during hockey riots many crimes reports will get submitted to the police. Thus we can use the daily reported crime counts as a proxy for the riots. Now there are no data points going back to 1994 regarding how many crimes were reported on the day of the riot, but this <a href="https://www.cbc.ca/news/canada/a-tale-of-two-riots-1.1079520">CBC article</a> states that there were 150 arrests made in connection to the riot. So for that day of the riot in 1994, let’s assume a base crime report count of 96 (the median of our 2003–2017) and a conversion factor of 1.5 for arrests to reported crime numbers, as the rioters may have created more than one crime report per arrest or conversely some rioters got away and were not arrested after a crime report was created. So for the day of the 1994 riot we have perhaps 96 + 1.5 * 150 = 321 crimes reported. Then the average number of crimes reported between both riots is (647+321) / 2 = 484.</p><p id="ea86">Next, we will work backwards and modify the return period in our call to the return.level() function and use trial and error to find the period that gives us a return level close to 484 in our GEV fitted to the 2003 to 2017 dataset.</p><div id="3406"><pre><span class="hljs-attribute">return</span>.level(fit_lmom, conf = <span class="hljs-number">0</span>.<span class="hljs-number">05</span>, return.period= c(<span class="hljs-number">3</span>,<span class="hljs-number">6</span>,<span class="hljs-number">12</span>,<span class="hljs-number">24</span>,<span class="hljs-number">48</span>,<span class="hljs-number">120</span>*<span class="hljs-number">20</span>))</pre></div><p id="e901">Gives us return levels of</p><div id="8eb4"><pre><span class="hljs-attribute">3</span>-month level <span class="hljs-number">6</span>-month level <span class="hljs-number">12</span>-month level <span class="hljs-number">24</span>-month level <span class="hljs-number">48</span>-month level <span class="hljs-number">2400</span>-month level
<span class="hljs-attribute">133</span>.<span class="hljs-number">4993</span> <span class="hljs-number">153</span>.<span class="hljs-number">9620</span> <span class="hljs-number">176</span>.<span class="hljs-number">1072</span> <span class="hljs-number">200</span>.<span class="hljs-number">9296</span> <span class="hljs-number">229</span>.<span class="hljs-number">1534</span> <span class="hljs-number">487</span>.<span class="hljs-number">4839</span></pre></div><p id="d31c">Bingo. So the chance of one hockey riot in the next month based on this analysis is approximately 1/2400 or 0.042%. Assuming the GEV remained stationary and that hockey riots occur independently of one and another, we will have on average 1 hockey riot or other mass crime incident where the daily crime count exceed 484 reports in the next 200 years. <b>Disclaimer: this analysis is for educational purposes only. It is not meant to inform any decision making on part of any person currently residing in, moving into or out of Vancouver, affected businesses, the Municipal Government, Provincial Government, the VPD, nor the RCMP.</b></p><h1 id="f09e">Conclusion:</h1><p id="e118"><b>In practice, it is wise to refit the GEV and revise the updates after every single block maxima period has elapsed. </b>Moreso, standard time series practices such as de-trending and seasonality removal may help to give the GEV a better fit. Recently, non-stationary EVA methods have been developed, but the theoretical framework for non-stationary GEV models has not been well established. If you are interested, checkout <a href="http://amir.eng.uci.edu/neva.php">NEVA</a> (Matlab) and <a href="https://cran.r-project.org/web/packages/GEVcdn/index.html">GEV-CDN</a> (R). However in general the limitation of EVA must be acknowledged. The EVA distributions fit only what is observed from a univariate time series, and makes no assumptions regarding external factors that affect the system under study. Even in the case of non-stationary EVA, the non-stationarities are chosen a-priori to be some simple continuous functions and are still mostly agnostic to external factors. Good luck doing your own Extreme Value Analysis, and be sure to check off with your senior Statistician specializing in Extreme Value Theory before making any decision based on your predictions!</p></article></body>