avatarDr. Roi Yehoshua

Summary

The article presents a series of advanced Pandas exercises using a dataset of US flight delays and cancellations from 2015, aimed at enhancing data science skills.

Abstract

The web content introduces a collection of master-level questions specifically designed to challenge and improve the data analysis skills of individuals working with Pandas in Python. These questions are based on a comprehensive dataset detailing flight delays and cancellations in the US during 2015, which includes 5.8 million flights across 31 columns of information. The exercises progress in difficulty, starting with straightforward queries such as counting flights from LAX in July 2015, and advancing to more complex tasks like identifying airlines with high cancellation rates and plotting the number of flights per airline. The article also provides a link to the solutions and encourages readers to engage with previous sets of master-level questions in NumPy, Data Science, and Deep Learning.

Opinions

  • The author believes that the provided questions will effectively test and enhance the reader's proficiency in data analysis with Pandas.
  • There is an emphasis on the practical application of data science skills, as evidenced by the use of a real-world dataset from Kaggle.
  • The article suggests that mastering Pandas is a critical step in becoming adept at data science, as shown by the range and depth of the questions presented.
  • By offering solutions and linking to related content, the author shows a commitment to comprehensive learning and continuous improvement in the field of data science.

Master-Level Questions in Pandas

Photo by Matthew Smith on Unsplash

The following questions deal with the flight delays and cancellations data set from Kaggle: https://www.kaggle.com/datasets/usdot/flight-delays

The file flights.csv contains 5.8 million rows with data on flights that took place in the US in the year 2015. It has 31 columns with detailed information on each flight, such as the date of the flight, the airline identifier, origin and destination airports, and whether the flight was cancelled or diverted.

Here are the first five rows from this data set (showing only the first 9 columns):

Download the flights.csv file and use Pandas to answer the following questions. The questions are organized from easy to hard.

  1. Find how many flights departed from LAX airport in July 2015.
  2. Find the number of the flight that had the longest arrival delay.
  3. Find the airport with the highest number of arriving flights.
  4. Find the day of week that had the highest number of flight cancellations.
  5. Create a bar plot showing the number of flights for each airline.
  6. Find the mean arrival delay for each airline.
  7. Find the airlines that had more than 10,000 cancellations.
  8. Find airlines having more than 2% of their flights cancelled. For each such airline, print its identifier and the percentage of cancelled flights.
  9. Find the three top airlines with the highest number of cancelled or diverted flights.
  10. Find the longest sequence of on-time flights for each airline (an on-time flight is a flight with less than 15 minutes arrival delay).

The solutions can be found in this post.

Don’t forget to check out my previous master-level questions:

Pandas
Data Science
Interview Questions
Interview Preparation
Recommended from ReadMedium