Scraping the results of a 10K Run for fun!

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

2218

Abstract

the ones without the results at the same time. Which means with a simple wget and grep I got the table row HTML with the needed data for each of the BIB numbers. After this, rather than adding the output to separate files, I decided to append the result rows to a single text file. The code snippet became:<div id="0238"><pre>for i in {10000..11500} do wget -O - http://resultsite.com/2019nebwdr/?bibNo=" $i"\&submit=SUBMIT | grep “10 KM” >> out.txt done</pre></div>And yes, I got lucky with the unique “10 KM” string for the grep, otherwise it would’ve been a bit complex parsing.Voila, now I have a text file with just the results in each row. Now, all I need to do is to parse the HTML and get the required data as a CSV file, which I could accomplish with awk in a jiffy. The CSV file can be simply accessed with Microsoft excel or google spreadsheets. For sharability I uploaded them to a google spreadsheet and shared with all my friends who participated in the run. Below is the final script and the link to the results spreadsheet.<div id="beb6"><pre>#!/bin/sh for i in {10000..11500} do wget -O — http://resultsite.com/2019nebwdr/?bibNo="$ i"&submit=SUBMIT | grep “10 KM” | awk -F’</td><td>’ -v OFS=”,” ‘{print $1, $ 2, $3, $ 4, $5, $ 6, $7, $ 8, $9, <span class="

Options

hljs-variable"> $10, $ 11, $12, $ 13}’ | sed -E ‘s/<tbody><tr><td>|</td></tr></tbody></table>//g’ >> race_results_10k.csv done</pre></div><a href="http://chakrapani.me/womensday-run/results_excel.html">http://chakrapani.me/womensday-run/results_excel.html</a>While writing this post, It hit me that there was no need to parse the output file if I need to access the results as a table in a browser. Because, each row is a proper table row, all I need to do is add the table headers to the file and name it as a HTML file, voila!Here is the simple 4 line script that did the job. It can be further optimized, but for me I just needed to get the data I needed quickly and it did the job.<div id="e3a0"><pre>#!/bin/sh for i in {10000..11500} do wget -O — http://resultsite.com/2019nebwdr/?bibNo="$i"&submit=SUBMIT | grep “10 KM” | sed -E ‘s/<tbody>|</tb
ody>|</table>//g’ >> output.html done</pre></div>I used sed to remove the unwanted tbody and table tags from the data. After I got the HTML file with the results from the above script, I just added the basic HTML & table tags at the top and bottom of the file to make it proper HTML file. Here’s the link to the final results.<a href="http://chakrapani.me/womensday-run/10k-results.html">http://chakrapani.me/womensday-run/10k-results.html</a><figure id="a058"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*s5ojLAzaVGI5x1t5vj17Ow.png"><figcaption>Preview of the results table.</figcaption></figure>I’d love to hear your thoughts and how we can accomplish the same task in a more efficient manner. Cheers!!</article></body>

Scraping the results of a 10K Run for fun!

Women’s Day Run — 10th March 2019, Bangalore.

Recently I had participated in a 10k run on the occasion of Women’s Day. After the run when I got my race time with rank, I was intrigued to see how others performed and most importantly how are the chip times of the people who are ahead of me. I decided to put my rusty scripting skills to use. Then I started meddling with the results website and found that it was a simple GET request with BIB number.

wget -O - http://resultsite.com/2019nebwdr/?bibNo=11019\&submit=SUBMIT

The next step was to identify the range of BIB Numbers. After a few brute-force trials I figured out that the BIB numbers for 10K started from 10000 and there were around 620 people who participated in the race and some more who registered but didn’t turn for the run. So, I wrote a small bash script to get all the results of BIB numbers from 10000 to 11500. The initial script to get each of the result with a for loop looked like the below snippet.

for i in {10000..11500}
do
  wget -O -  http://resultsite.com/2019nebwdr/?bibNo="$i"\&submit=SUBMIT 
done

There were quite some bib numbers which didn’t have any results due to non-participation. So, I had to filter out those which didn’t have any results.

My initial approach(thought process) was to download all result pages into a folder with file names as bib-number.html and then parse each of them to get the data that is useful. But after careful observation of the resulting HTML, a simple grep was sufficient to get the data that I wanted in each of the result page and ignore the ones without the results at the same time. Which means with a simple wget and grep I got the table row HTML with the needed data for each of the BIB numbers. After this, rather than adding the output to separate files, I decided to append the result rows to a single text file. The code snippet became:

for i in {10000..11500}
do
  wget -O -  http://resultsite.com/2019nebwdr/?bibNo="$i"\&submit=SUBMIT | grep “10 KM” >> out.txt
done

And yes, I got lucky with the unique “10 KM” string for the grep, otherwise it would’ve been a bit complex parsing.

Voila, now I have a text file with just the results in each row. Now, all I need to do is to parse the HTML and get the required data as a CSV file, which I could accomplish with awk in a jiffy. The CSV file can be simply accessed with Microsoft excel or google spreadsheets. For sharability I uploaded them to a google spreadsheet and shared with all my friends who participated in the run. Below is the final script and the link to the results spreadsheet.

#!/bin/sh
for i in {10000..11500}
do
  wget -O — http://resultsite.com/2019nebwdr/?bibNo="$i"\&submit=SUBMIT | grep “10 KM” | awk -F’</td><td>’ -v OFS=”,” ‘{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13}’ | sed -E ‘s/<tbody><tr><td>|<\/td><\/tr><\/tbody><\/table>//g’ >> race_results_10k.csv
done

While writing this post, It hit me that there was no need to parse the output file if I need to access the results as a table in a browser. Because, each row is a proper table row, all I need to do is add the table headers to the file and name it as a HTML file, voila!

Here is the simple 4 line script that did the job. It can be further optimized, but for me I just needed to get the data I needed quickly and it did the job.

#!/bin/sh
for i in {10000..11500}
do
  wget -O — http://resultsite.com/2019nebwdr/?bibNo="$i"\&submit=SUBMIT | grep “10 KM” | sed -E ‘s/<tbody>|<\/tb\
ody>|<\/table>//g’ >> output.html
done

I used sed to remove the unwanted tbody and table tags from the data. After I got the HTML file with the results from the above script, I just added the basic HTML & table tags at the top and bottom of the file to make it proper HTML file. Here’s the link to the final results.

Preview of the results table.

I’d love to hear your thoughts and how we can accomplish the same task in a more efficient manner. Cheers!!