avatarJason Chong

Summary

The provided context is a comprehensive guide on web scraping in R, specifically focusing on extracting and analyzing data about NBA players from ESPN's website.

Abstract

The article offers a practical introduction to web scraping using R, detailing the process of collecting data on NBA players from the ESPN website. It emphasizes the utility of web scraping in extracting structured data from unconventional sources, such as web pages. The author demonstrates the use of the rvest package to harvest player information, including names, positions, and statistics, and then integrates this data into a usable format for analysis. The guide also touches on the importance of CSS selectors in identifying relevant data, which can be facilitated by tools like SelectorGadget, and illustrates how to automate the scraping process for all NBA teams. The article concludes with an example of exploratory data analysis conducted on the scraped data, providing insights into player statistics and salaries, and encourages readers to explore further by accessing the complete code on the author's GitHub repository.

Opinions

  • The author suggests that web scraping is a valuable skill for gaining competitive insights, such as analyzing customer reviews, ticket prices, and job postings.
  • Consistency in the structure of web pages is highlighted as a critical factor for successful automated web scraping across multiple pages.
  • The use of CSS selectors is presented as an essential technique for web scraping, with SelectorGadget being recommended as a tool to simplify this process.
  • The author expresses a preference for metric units over imperial units by converting player height and weight measurements.
  • The article implies that web scraping, when combined with data visualization and exploratory data analysis, can lead to meaningful insights, such as correlations between player performance and salary.
  • The author promotes Medium membership, suggesting that it supports content creators and encourages the production of high-quality articles.
  • The author provides additional reading suggestions, indicating the value of continuous learning in SQL commands and machine learning explainability for data analysts and scientists.

Practical Introduction to Web Scraping in R — NBA Players

Understanding the basics of web scraping, when and why it is useful, and how to implement it in R

Photo by Olivier Collet on Unsplash

Data does not always come in a neat tabular format that we can readily use to analyze and derive insights from. Sometimes we may find ourselves in situations where we need to extract data from alternative sources, for example, the internet.

Some companies even use web scraping to gain an advantage over their competitors by accessing data that most people overlook. When used right, web scraping enables us to obtain data from any website and transform it into a usable form in order to supplement an analysis or project.

Web scraping is the method of extracting data from websites in an automated way. It saves the trouble of having to manually download or copy data but instead automates the entire process from start to finish.

Some interesting use cases of web scraping include analyzing customer or product reviews, getting real-time ticket prices for flights or hotels, or even aggregating open job postings.

There are many existing packages in common programming languages that are built for the purpose of parsing HTML documents. For instance, Beautiful Soup in Python is widely used for web scraping.

However, for today’s exercise, we will be using rvest, a package in R used for harvesting data from web pages. Furthermore, we will learn how to apply web scraping techniques to obtain information about current players in the National Basketball Association (NBA) from the ESPN website.

Player profile

An important thing to keep in mind when performing web scraping is to check for consistency across the web pages that you intend on scraping. Specifically, the only way that we can automate scraping across multiple pages is if there is a pattern that occurs in how the data is structured.

For example, let’s take a closer look at the roster page for the Boston Celtics and the New York Knicks.

Team roster for the Boston Celtics.
Team roster for the New York Knicks.

As you can see, both teams have their roster set up in a table format which contains information about each player’s name, position, age, height, weight, college, and annual salary. In addition, you might also notice the end of the URL changes according to the name of the team: ../bos/boston-celtics for the Boston Celtics and ../ny/new-york-knicks for the New York Knicks. This will come in handy later.

Before we look at how to scrape data for each NBA team, let’s only focus on one team for now. Once we succeed, we can easily write a for loop to repeat the whole process for the other teams.

Traditionally, in order to web scrape, you would ideally have some basic knowledge of HTML and CSS, which are the two fundamental technologies for building and designing web pages.

HTML is used to create content and provide structure to a web page. CSS, on the other hand, is used for design, layout, and styling to make the web page appear more slick and presentable.

We are going slightly cheat around this process via a Chrome extension called SelectorGadget, which you can add to your browser and immediately start using. It helps to generate the CSS selector of an element on a web page by simply highlighting the element that you are interested in.

Here, we are going to highlight the elements that contain each player’s name, position, age, height, weight, college, and salary for the Boston Celtics and subsequently store them in their respective variables. Then, we will collate them together to create a data frame.

Player regular season statistics

Let’s take our scraping one step further, shall we?

You may have noticed the hyperlink on each player’s name that when clicked on will bring you to a separate web page containing the individual player’s performance in the most recent NBA regular season.

For example, let’s look at Jayson Tatum and Derrick Rose.

Jayson Tatum averaged 26.9 points, 8.0 rebounds, 4.4 assists per game, and had an efficiency rating of 21.87 in the 2021–22 NBA season.
Derrick Rose averaged 12.0 points, 3.0 rebounds, 4.0 assists per game, and had an efficiency rating of 16.93 in the 2021–22 NBA season.

As part of this exercise, suppose we would like to combine this information with our original data frame from earlier on.

To do this, we will have to write a function that fetches the season statistics for each player on the team roster.

Automate for all NBA teams

Now that we understand how the process works for the Boston Celtics, we can repeat it for all 30 NBA teams.

We simply need to use a for loop to change the end of the URL name for each team. The whole process took my computer close to 9 minutes to complete. Not bad.

After some further data cleaning, like changing player height and weight to centimetres and kilograms respectively (I’m not the biggest fan of the imperial system), data types, and renaming values, we get a final data frame that looks like the following.

Bonus: exploratory data analysis

This section of the blog post is more so for my own entertainment than it is about web scraping. Effectively, I have used some basic data visualizations to gather insights from the data we have prepared about current NBA players.

While it is not the most groundbreaking analysis, I have nevertheless included my comments in the caption for each of the charts below.

Point guards are the shortest players and centers are the tallest players.
Point guards weigh the least and centers weigh the most.
Average points scored per game is highly positively correlated with salary. Height is negatively correlated with average rebounds per game.
The higher the average points scored per game, the higher the player’s salary.
Kentucky and Duke are by far the most popular colleges for NBA prospects.
The combined annual salary for the top 10 highest-paid players is around $438 million. That’s insane.

Thank you for reading, and I hope that you took away some new knowledge about the basics of web scraping in R. I also encourage you to take a look at the workbook for this exercise on my GitHub here. It contains all the code for this project from start to end.

If you found any value from this article and are not yet a Medium member, it would mean a lot to me as well as the other writers on this platform if you sign up for membership using the link below. It encourages us to continue putting out high-quality and informative content just like this one — thank you in advance!

Don’t know what to read next? Here are some suggestions.

Data Science
Analytics
Web Scraping
NBA
Technology
Recommended from ReadMedium