Practical Introduction to Web Scraping in R — NBA Players
Understanding the basics of web scraping, when and why it is useful, and how to implement it in R
Data does not always come in a neat tabular format that we can readily use to analyze and derive insights from. Sometimes we may find ourselves in situations where we need to extract data from alternative sources, for example, the internet.
Some companies even use web scraping to gain an advantage over their competitors by accessing data that most people overlook. When used right, web scraping enables us to obtain data from any website and transform it into a usable form in order to supplement an analysis or project.
Web scraping is the method of extracting data from websites in an automated way. It saves the trouble of having to manually download or copy data but instead automates the entire process from start to finish.
Some interesting use cases of web scraping include analyzing customer or product reviews, getting real-time ticket prices for flights or hotels, or even aggregating open job postings.
There are many existing packages in common programming languages that are built for the purpose of parsing HTML documents. For instance, Beautiful Soup in Python is widely used for web scraping.
However, for today’s exercise, we will be using rvest, a package in R used for harvesting data from web pages. Furthermore, we will learn how to apply web scraping techniques to obtain information about current players in the National Basketball Association (NBA) from the ESPN website.
Player profile
An important thing to keep in mind when performing web scraping is to check for consistency across the web pages that you intend on scraping. Specifically, the only way that we can automate scraping across multiple pages is if there is a pattern that occurs in how the data is structured.
For example, let’s take a closer look at the roster page for the Boston Celtics and the New York Knicks.
As you can see, both teams have their roster set up in a table format which contains information about each player’s name, position, age, height, weight, college, and annual salary. In addition, you might also notice the end of the URL changes according to the name of the team: ../bos/boston-celtics
for the Boston Celtics and ../ny/new-york-knicks
for the New York Knicks. This will come in handy later.
Before we look at how to scrape data for each NBA team, let’s only focus on one team for now. Once we succeed, we can easily write a for loop to repeat the whole process for the other teams.
Traditionally, in order to web scrape, you would ideally have some basic knowledge of HTML and CSS, which are the two fundamental technologies for building and designing web pages.
HTML is used to create content and provide structure to a web page. CSS, on the other hand, is used for design, layout, and styling to make the web page appear more slick and presentable.
We are going slightly cheat around this process via a Chrome extension called SelectorGadget, which you can add to your browser and immediately start using. It helps to generate the CSS selector of an element on a web page by simply highlighting the element that you are interested in.
Here, we are going to highlight the elements that contain each player’s name, position, age, height, weight, college, and salary for the Boston Celtics and subsequently store them in their respective variables. Then, we will collate them together to create a data frame.