Summary

The article discusses the use of SQL ranking functions (ROW_NUMBER, DENSE_RANK, and RANK) to identify top-selling products in an Amazon interview question scenario.

Abstract

The article, titled "The One SQL Function You Need to Ace the Amazon Interview Question: Highest-Grossing Items," provides insight into the application of SQL window functions for solving a common interview problem. It explains the difference between ROW_NUMBER, DENSE_RANK, and RANK functions and demonstrates how to use them to find the top two highest-grossing products within each category for the year 2022. The author illustrates the use of these functions with examples and provides a solution using the ROW_NUMBER function, later optimizing it with a Common Table Expression (CTE) for better readability and performance. The article emphasizes the importance of understanding these SQL functions for aspiring data engineers and concludes by offering a free guide to help readers improve their SQL skills.

Opinions

The author believes that practicing SQL, particularly window functions, is crucial for success in data engineering interviews, such as those at Amazon.
The article suggests that the ROW_NUMBER function is particularly useful for the task of identifying top sellers quickly and easily.
The author values the use of CTEs for enhancing the readability and performance of SQL queries.
There is an opinion that learning SQL through practice on platforms like Datalemur is essential for developing problem-solving skills in the context of data engineering.
The author implies that mastering SQL ranking functions can give candidates an edge in interviews by demonstrating their ability to handle complex data analysis tasks.

The One SQL Function You Need to Ace the Amazon Interview Question: Highest-Grossing Items

How to Use ROW_NUMBER, DENSE_RANK, and RANK to Identify Top Sellers Quickly and Easily

Besides working on personal projects, a good idea is to practice programming skills and develop problem-solving skills through coding games.

Learning SQL, Python, and algorithms is essential for anyone who wants to pursue a career in Data engineering. There are a considerable amount of platforms that help practise and work on programming skills.

In this article, I will present the three most used Windows functions (ROW_NUMBER, DENSE_RANK, and RANK) and how I used one of them to solve the Amazon highest-grossing items question.

I used the Datalemur platform to practice SQL interview questions. I chose a medium-difficulty question for the company Amazon.

The problem was :

“Assume you’re given a table containing data on Amazon customers and their spending on products in different category, write a query to identify the top two highest-grossing products within each category in the year 2022. The output should include the category, product, and total spend.”

This problem is relatively simple, and the part that could be problematic is to find a way to get the top two highest per category based on the spend. It made me think immediately about ranking functions in SQL, and for me, it was the key to this problem.

There are three ranking functions in SQL: rank(), dense_rank(), and row_number().

Knowing which one to use came from the understanding of each of them:

row_number(): The row_number() function always generates a unique ranking for every row, even if they have the same values;

SELECT id, name, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rank
FROM employees;

+----+-----------+---------+----+
| id | name      | salary | rank |
+----+-----------+---------+----+
|  1 | Alice     | 1000    |  1  |
|  2 | Bob       | 1000    |  1  |
|  3 | Charlie   | 900     |  2  |
|  4 | David     | 800     |  3  |
+----+-----------+---------+----+

Rank (): The rank() function will generate a ranking but assign the same rank to rows with the same values. The rank() function will create a gap. And the next different rank will not have the immediate following number.

SELECT id, name, salary, rank() OVER (PARTITION BY department ORDER BY salary DESC) AS rank
FROM employees;

+----+-----------+---------+----+
| id | name      | salary | rank |
+----+-----------+---------+----+
|  1 | Alice     | 1000    |  1  |
|  2 | Bob       | 1000    |  1  |
|  3 | Charlie   | 900     |  3  |
|  4 | David     | 800     |  4  |
+----+-----------+---------+----+

dense_rank(): The dense_rank() is similar to rank() but will create a contiguous sequence of ranks. Given two rows ranking the same, the next ranking value will be increment by 1.

SELECT id, name, salary, dense_rank() OVER (PARTITION BY department ORDER BY salary DESC) AS rank
FROM employees;

+----+-----------+---------+----+
| id | name      | salary | rank |
+----+-----------+---------+----+
|  1 | Alice     | 1000    |  1  |
|  2 | Bob       | 1000    |  1  |
|  3 | Charlie   | 900     |  2  |
|  4 | David     | 800     |  4  |
+----+-----------+---------+----+

I used in this exercise the row_number() function. Here is my solution :

SELECT category, product, total_spend
FROM
(SELECT 
    p_s.category as category,
    p_s.product as product,
    SUM(p_s.spend) as total_spend,
    row_number() over (PARTITION BY category order by SUM(p_s.spend) desc) row_number
  FROM product_spend as p_s
  WHERE
    EXTRACT(YEAR FROM p_s.transaction_date)= '2022'
  GROUP BY category, product) subquery
WHERE row_number<=2
ORDER BY category, row_number;

This script can be improved by using CTE for more readability and performance.

WITH subquery AS (
  SELECT
    p_s.category AS category,
    p_s.product AS product,
    SUM(p_s.spend) AS total_spend,
    row_number() OVER (PARTITION BY category ORDER BY SUM(p_s.spend) DESC) AS row_number
  FROM product_spend AS p_s
  WHERE
    EXTRACT(YEAR FROM p_s.transaction_date) = '2022'
  GROUP BY category, product
)

SELECT
  category,
  product,
  total_spend
FROM subquery
WHERE row_number <= 2
ORDER BY category, row_number;

Conclusion

Row_number, Dense_rank and rank are powerful SQL functions that can be used to solve various problems, including the Amazon interview question on highest-grossing items.

Here are some tips :

Use the PARTITION BY clause to group the rows in the dataset by category.
Use the ORDER BY clause to order the rows in each category by sales in descending order.
Use the ROW_NUMBER function to rank each row in each category.
Filter the results only to include the top two highest rankings.

I hope this article has been helpful.

Go from SELECT * to interview-worthy project. Get our free 5-page guide.