avatarYuen

Summary

The provided content is a detailed guide on how to solve 50 SQL interview questions from LeetCode, offering insights into the thought process and multiple approaches to problem-solving.

Abstract

The text serves as a comprehensive walkthrough for tackling SQL challenges presented in LeetCode's "Top SQL 50" study plan. It is structured around specific questions, each accompanied by a step-by-step explanation and SQL code examples. The author shares personal experiences and thought patterns when approaching these problems, emphasizing the importance of understanding the underlying logic and the use of SQL functions such as JOIN, COUNT, SUM, and CASE statements. The guide also highlights common SQL concepts such as aggregate functions, window functions, and subqueries, providing a valuable resource for both beginners and experienced SQL users looking to enhance their problem-solving skills in database querying.

Opinions

  • The author believes in the importance of a strong foundational understanding of SQL concepts to tackle interview questions effectively.
  • There is an emphasis on the utility of various SQL functions and clauses, such as JOINs for combining tables and aggregate functions for summarizing data.
  • The author suggests that there can be multiple correct approaches to a problem, demonstrating this with alternative solutions for some questions.
  • The personal anecdotes and reflections on the learning process imply that persistence and practice are key to mastering SQL.
  • The guide conveys the idea that SQL problems can often be solved through a combination of logical reasoning and familiarity with SQL syntax and capabilities.
  • The author encourages the use of Common Table Expressions (CTEs) and window functions as powerful tools for complex queries, indicating a preference for these modern SQL features.

LeetCode SQL 50 Problems — Step by Step Solution

Hi friends, in this article we’ll go through the 50 SQL interview questions on LeetCode in a step-by-step process.

Hey guys, it’s been a while I hope you’ve missed me :D

How is everyone’s winter so far? For me it’s trying my best to survive Toronto’s brutal winter (yet to come lolol) in my SuperPuff hehe :)

Living in Toronto for the past 5 months has been a lonely experience for me — starting from 0 with no family/friends here, feeling like I’m living with a mini-NYC lifestyle again = =, and taking forever to finally find a job that is non-Data Analysis related…

It took me quite some time to figure things out & accept my current situation (the reality blablabla). I’m still trying, but I have to say, I’m absolutely in a much better mental state now. There is definitely a difference between moving to a new country when you’re 17 and when you’re 28, OK I GET IT NOW.

I guess y’all already have enough of my venting, let’s get into the juicy stuff now — shall we?

So yeah, as the title itself says, let’s take some time and go through the 50 SQL interview questions on LeetCode.

1. Recyclable and Low Fat Products (SELECT — Easy):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| product_id  | int     |
| low_fats    | enum    |
| recyclable  | enum    |
+-------------+---------+
product_id is the primary key (column with unique values) for this table.
low_fats is an ENUM (category) of type ('Y', 'N') where 'Y' means this product is low fat and 'N' means it is not.
recyclable is an ENUM (category) of types ('Y', 'N') where 'Y' means this product is recyclable and 'N' means it is not.

Write a solution to find the ids of products that are both low fat and recyclable.

Return the result table in any order.

The question asks us to find the ids of products that are both low fat and recyclable, so we can set the criterias in the WHERE clause, our sql statement would look like this:

select product_id
from Products
where low_fats = 'Y'
and recyclable = 'Y'

2. Find Customer Referee (SELECT — Easy):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| id          | int     |
| name        | varchar |
| referee_id  | int     |
+-------------+---------+
In SQL, id is the primary key column for this table.
Each row of this table indicates the id of a customer, their name, and the id of the customer who referred them.

Find the names of the customer that are not referred by the customer with id = 2.

Return the result table in any order.

The question asks us to find only the name of the customer that are not referred by customer with id = 2. Again, we can set up a simple criteria in the WHERE clause, but we also need to considers those customers with no referees at all — the result sql looks like this:

select name
from Customer
where referee_id <> 2
or referee_id is null

3. Big Countries (SELECT — Easy):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| name        | varchar |
| continent   | varchar |
| area        | int     |
| population  | int     |
| gdp         | bigint  |
+-------------+---------+
name is the primary key (column with unique values) for this table.
Each row of this table gives information about the name of a country, the continent to which it belongs, its area, the population, and its GDP value.

A country is big if:
1. it has an area of at least three million (i.e., 3000000 km2), or
2. it has a population of at least twenty-five million (i.e., 25000000).
Write a solution to find the name, population, and area of the big countries.

Return the result table in any order.

The question asks for big countries only, and return three columns — name, population, and area. A country is classified as “big” if its area ≥ 3000000 or its population ≥ 25000000. We can set up these 2 criterias in the WHERE clause, result sql looks like this:

select name, population, area
from World
where area >= 3000000
or population >= 25000000

4. Article Views I (SELECT — Easy):

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| article_id    | int     |
| author_id     | int     |
| viewer_id     | int     |
| view_date     | date    |
+---------------+---------+
There is no primary key (column with unique values) for this table, the table may have duplicate rows.
Each row of this table indicates that some viewer viewed an article (written by some author) on some date. 
Note that equal author_id and viewer_id indicate the same person.

The question asks us to list only the author_ids who views at least one of their own articles. As stated, if author_id = viewer_id, which means the author has been viewing his own work. We can set up a criteria in the WHERE clause, and select only the DISTINCT author_ids — the result sql looks like this:

select distinct author_id as id
from Views
where author_id = viewer_id
order by 1

5. Invalid Tweets (SELECT — Easy):

+----------------+---------+
| Column Name    | Type    |
+----------------+---------+
| tweet_id       | int     |
| content        | varchar |
+----------------+---------+
tweet_id is the primary key (column with unique values) for this table.
This table contains all the tweets in a social media app.

Write a solution to find the IDs of the invalid tweets. 
The tweet is invalid if the number of characters used in the content of the tweet is strictly greater than 15.
Return the result table in any order.

The question wants us to return a list of invalid tweet_ids — and a tweet is invalid if the content length is > 15 characters. We can set it up in the WHERE clause, here’s the result sql:

select tweet_id
from Tweets
where length(content) > 15

6. Replace Employee ID with the Unique Identifier (Basic Joins — Easy):

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| id            | int     |
| name          | varchar |
+---------------+---------+
id is the primary key (column with unique values) for this table.
Each row of this table contains the id and the name of an employee in a company.

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| id            | int     |
| unique_id     | int     |
+---------------+---------+
(id, unique_id) is the primary key (combination of columns with unique values) for this table.
Each row of this table contains the id and the corresponding unique id of an employee in the company.



Write a solution to show the unique ID of each user, If a user does not have a unique ID replace just show null.

Return the result table in any order

The question wants us to find a unique id for each employee, but if there isn’t any, just mark it as a null value. We can use the colesce() function in the SELECT clause, and join the 2 tables on the common column id. The result sql looks like this:

select coalesce(unique_id, null) as unique_id,
       name
from Employees a
left join EmployeeUNI b
on a.id = b.id

7. Product Sales Analysis I (Basic Joins — Easy):

+-------------+-------+
| Column Name | Type  |
+-------------+-------+
| sale_id     | int   |
| product_id  | int   |
| year        | int   |
| quantity    | int   |
| price       | int   |
+-------------+-------+
(sale_id, year) is the primary key (combination of columns with unique values) of this table.
product_id is a foreign key (reference column) to Product table.
Each row of this table shows a sale on the product product_id in a certain year.
Note that the price is per unit.

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| product_id   | int     |
| product_name | varchar |
+--------------+---------+
product_id is the primary key (column with unique values) of this table.
Each row of this table indicates the product name of each product.



Write a solution to report the product_name, year, and price for each sale_id in the Sales table.

Return the resulting table in any order.

Output: 
+--------------+-------+-------+
| product_name | year  | price |
+--------------+-------+-------+
| Nokia        | 2008  | 5000  |
| Nokia        | 2009  | 5000  |
| Apple        | 2011  | 9000  |
+--------------+-------+-------+

We have 2 tables in this question — Sales and Product. The question wants us to report the product_name, year, and price for each sale_id in the Sales table — so it’s obvious that the Sales table would be our reference table. In order to report the product_name column from the Product table, we need to join the 2 tables together with a LEFT JOIN. Here’s the result sql:

select product_name,
       year,
       price
from Sales s
left join Product p
on s.product_id = p.product_id

8. Customer Who Visited But Did Not Make Any Transactions (Basic Joins — Easy):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| visit_id    | int     |
| customer_id | int     |
+-------------+---------+
visit_id is the column with unique values for this table.
This table contains information about the customers who visited the mall.

+----------------+---------+
| Column Name    | Type    |
+----------------+---------+
| transaction_id | int     |
| visit_id       | int     |
| amount         | int     |
+----------------+---------+
transaction_id is column with unique values for this table.
This table contains information about the transactions made during the visit_id.



Write a solution to find the IDs of the users who visited without making any transactions 
and the number of times they made these types of visits.
Return the result table sorted in any order.


Output:
+-------------+----------------+
| customer_id | count_no_trans |
+-------------+----------------+
| 54          | 2              |
| 30          | 1              |
| 96          | 1              |
+-------------+----------------+

The question wants us to find out a list of customers who visit the mall but didn’t make any purchases, and count such visits. OKKKK, I’ve attempted this question a few times, and everytime I come up with a different solution, idk what’s going on w/ my brain but I’ll share all my solutions below:

  1. After reading through the problem, the first thought in my head is — alright let’s filter the Visits table to include only those customers who didn’t make a purchase — which means that in the Transactions table, their visit_id should not be recorded/do not exist. This can be achieved in a simple subquery in the WHERE clause. Then I save the results in a CTE table named no_trans. Finally we’re free to select the customer_id column from the no_trans table & count the number of their visits with the count() function and GROUP BY statement. Here’s the final sql:
with no_trans as (
    select *
    from Visits
    where visit_id not in (
        select visit_id
        from Transactions
    )
)

select distinct customer_id,
       count(visit_id) as count_no_trans
from no_trans
group by 1

2. Another way to approach this problem is to join the two tables together with a LEFT JOIN — using the Visits table as the reference table. This will select all the records in the Visits table, along with the common records from the Transactions table. Since we need to find out which customers didn’t make any transactions, we can do a simple filtering in the WHERE clause — “transaction_id” is null. And finally we can count such visits for each customer with a GROUP BY statement. Here’s the final sql:

select v.customer_id, 
       count(*) as count_no_trans
from Visits v
left join Transactions t
on v.visit_id = t.visit_id
where transaction_id is null
group by 1

3. Alert — this is probably the most unnecessary solution so feel free to skip this one :) Similar to the second solution, we use a LEFT JOIN to join the 2 tables together, but this time we wrap the joined results as a subquery in the FROM clause, and we use the case statement to count the number of no-transaction visits. And finally, we filter the counting results to exclude the customers who did make any purchases (count_no_trans=0). Here’s the final sql:

select *
from
    (select customer_id, 
        sum(case when transaction_id is null then 1 else 0 end) as count_no_trans
    from Visits a
    left join Transactions b
    on a.visit_id = b.visit_id
    group by 1) sub
where sub.count_no_trans > 0

9. Rising Temperature (Basic Joins — Easy):

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| id            | int     |
| recordDate    | date    |
| temperature   | int     |
+---------------+---------+
id is the column with unique values for this table.
This table contains information about the temperature on a certain day.


Write a solution to find all dates'Id with higher temperatures compared to its previous dates (yesterday).

Return the result table in any order.

Output: 
+----+
| id |
+----+
| 2  |
| 4  |
+----+

The question asks us to compare the temperature between 2 consecutive days, and find out which days have higher temperature than their previous days — and return only the ids of the dates. We can first do a Self Join to join all the records by itself, then set a filter in the WHERE clause to include only the records that shows consecutive days and the second day’s temperature is higher than the first day’s. We can use the datediff() function to filter for consecutive days. Result sql looks like below:

select a.id
from Weather a, Weather b
where datediff(a.recordDate, b.recordDate) = 1
and a.temperature > b.temperature

10. Average Time of Process Per Machine (Basic Joins — Easy):

+----------------+---------+
| Column Name    | Type    |
+----------------+---------+
| machine_id     | int     |
| process_id     | int     |
| activity_type  | enum    |
| timestamp      | float   |
+----------------+---------+
The table shows the user activities for a factory website.
(machine_id, process_id, activity_type) is the primary key (combination of columns with unique values) of this table.
machine_id is the ID of a machine.
process_id is the ID of a process running on the machine with ID machine_id.
activity_type is an ENUM (category) of type ('start', 'end').
timestamp is a float representing the current time in seconds.
'start' means the machine starts the process at the given timestamp and 'end' means the machine ends the process at the given timestamp.
The 'start' timestamp will always be before the 'end' timestamp for every (machine_id, process_id) pair.




There is a factory website that has several machines each running the same number of processes. Write a solution to find the average time each machine takes to complete a process.

The time to complete a process is the 'end' timestamp minus the 'start' timestamp. The average time is calculated by the total time to complete every process on the machine divided by the number of processes that were run.

The resulting table should have the machine_id along with the average time as processing_time, which should be rounded to 3 decimal places.

Return the result table in any order.

This question is asking for the average processing time of each machine — we need to sum up the processing time for each individual process, and divide the number of processes.

When I read through this problem, I realized it’s a SELF-JOIN kind of question. In this case, we need to pivot the table horizontally in order to separate the “start” and “end” activity_type. As you can see in the final SQL below, we join the table by itself using the 2 common keys — machine_id and process_id.

Then we can find out the the processing time for each individual process and sum them up with the sum() function.

Next, we can use count() to figure out the number of processes for each machine. And finally divide the 2 values to calculate the average processing time for each machine — need to pair it with the GROUP BY statement of course, since it’s an aggregate function.

In order to select only 2 columns as our result set table (machine_id and average processing time), we can wrap the whole SELF-JOIN statement in the FROM clause. It can also be done through a CTE/subquery, whatever works for you.

The final SQL looks like this:

select a.machine_id,
       round(sum(b.timestamp - a.timestamp) / count(a.process_id),3) as processing_time
from Activity a
join
Activity b
on a.machine_id = b.machine_id
and a.process_id = b.process_id
where a.activity_type = "start"
and b.activity_type = "end"
group by 1

11. Employee Bonus (Basic Joins — Easy):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| empId       | int     |
| name        | varchar |
| supervisor  | int     |
| salary      | int     |
+-------------+---------+
empId is the column with unique values for this table.
Each row of this table indicates the name and the ID of an employee in addition to their salary and the id of their manager.

+-------------+------+
| Column Name | Type |
+-------------+------+
| empId       | int  |
| bonus       | int  |
+-------------+------+
empId is the column of unique values for this table.
empId is a foreign key (reference column) to empId from the Employee table.
Each row of this table contains the id of an employee and their respective bonus.


Write a solution to report the name and bonus amount of each employee with a bonus less than 1000.

Return the result table in any order.




Input: 
Employee table:
+-------+--------+------------+--------+
| empId | name   | supervisor | salary |
+-------+--------+------------+--------+
| 3     | Brad   | null       | 4000   |
| 1     | John   | 3          | 1000   |
| 2     | Dan    | 3          | 2000   |
| 4     | Thomas | 3          | 4000   |
+-------+--------+------------+--------+
Bonus table:
+-------+-------+
| empId | bonus |
+-------+-------+
| 2     | 500   |
| 4     | 2000  |
+-------+-------+
Output: 
+------+-------+
| name | bonus |
+------+-------+
| Brad | null  |
| John | null  |
| Dan  | 500   |
+------+-------+

The question is looking for the bonus salary for each employee, and wants us to display “null” if an employee has no bonus. Since we need to find the bonus for all the employees, we need to left join the Employee table with the Bonus table — on the common key empId. Then we set up two filters to filter out bonus ≥ 1000 and null bonus values. Here’s the final SQL:

select name, coalesce(bonus, null) as bonus
from Employee e
left join Bonus b
on e.empId = b.empId
where bonus < 1000
or bonus is null

12. Students and Examinations (Basic Joins — Easy):

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| student_id    | int     |
| student_name  | varchar |
+---------------+---------+
student_id is the primary key (column with unique values) for this table.
Each row of this table contains the ID and the name of one student in the school.

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| subject_name | varchar |
+--------------+---------+
subject_name is the primary key (column with unique values) for this table.
Each row of this table contains the name of one subject in the school.

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| student_id   | int     |
| subject_name | varchar |
+--------------+---------+
There is no primary key (column with unique values) for this table. It may contain duplicates.
Each student from the Students table takes every course from the Subjects table.
Each row of this table indicates that a student with ID student_id attended the exam of subject_name.

Write a solution to find the number of times each student attended each exam.

Return the result table ordered by student_id and subject_name.

The result format is in the following example.


Input: 
Students table:
+------------+--------------+
| student_id | student_name |
+------------+--------------+
| 1          | Alice        |
| 2          | Bob          |
| 13         | John         |
| 6          | Alex         |
+------------+--------------+
Subjects table:
+--------------+
| subject_name |
+--------------+
| Math         |
| Physics      |
| Programming  |
+--------------+
Examinations table:
+------------+--------------+
| student_id | subject_name |
+------------+--------------+
| 1          | Math         |
| 1          | Physics      |
| 1          | Programming  |
| 2          | Programming  |
| 1          | Physics      |
| 1          | Math         |
| 13         | Math         |
| 13         | Programming  |
| 13         | Physics      |
| 2          | Math         |
| 1          | Math         |
+------------+--------------+
Output: 
+------------+--------------+--------------+----------------+
| student_id | student_name | subject_name | attended_exams |
+------------+--------------+--------------+----------------+
| 1          | Alice        | Math         | 3              |
| 1          | Alice        | Physics      | 2              |
| 1          | Alice        | Programming  | 1              |
| 2          | Bob          | Math         | 1              |
| 2          | Bob          | Physics      | 0              |
| 2          | Bob          | Programming  | 1              |
| 6          | Alex         | Math         | 0              |
| 6          | Alex         | Physics      | 0              |
| 6          | Alex         | Programming  | 0              |
| 13         | John         | Math         | 1              |
| 13         | John         | Physics      | 1              |
| 13         | John         | Programming  | 1              |
+------------+--------------+--------------+----------------+

In this question, we need to find out the number of times students attend each exams — also considering the possibilities that some students may not attend an exam for one/more subjects. I feel like the most straightforward way is to CROSS JOIN the students and subjects table first, and then perform a LEFT JOIN between the joined results and examinations table to do the exam counts.

One thing to notice is that we need to count the number of times each subject appears in the examinations table after we join everything together, since some students might not take an exam for specific subject(s). The final SQL is below:

select stu.student_id, 
       stu.student_name,
       sub.subject_name,
       count(exam.subject_name) as attended_exams
from students stu
cross join subjects sub
left join examinations exam
on stu.student_id = exam.student_id
and sub.subject_name = exam.subject_name
group by 1,2,3
order by 1,2

13. Managers with at Least 5 Direct Reports (Basic Joins — Medium):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| id          | int     |
| name        | varchar |
| department  | varchar |
| managerId   | int     |
+-------------+---------+
id is the primary key (column with unique values) for this table.
Each row of this table indicates the name of an employee, their department, and the id of their manager.
If managerId is null, then the employee does not have a manager.
No employee will be the manager of themself.


Write a solution to find managers with at least five direct reports.

Return the result table in any order.


Input: 
Employee table:
+-----+-------+------------+-----------+
| id  | name  | department | managerId |
+-----+-------+------------+-----------+
| 101 | John  | A          | null      |
| 102 | Dan   | A          | 101       |
| 103 | James | A          | 101       |
| 104 | Amy   | A          | 101       |
| 105 | Anne  | A          | 101       |
| 106 | Ron   | B          | 101       |
+-----+-------+------------+-----------+
Output: 
+------+
| name |
+------+
| John |
+------+

the question wants us to find all managers who have at least five direct reports (≥ 5). Using GROUP BY & HAVING, we can first count & filter managerIds to include only those who have 5 or more direct reports.

We need to refer these managerIds back to their employee ids in order to get the names of the managers — by performing a one-to-one matching between “id” and “managerId” . Here’s the resutl SQL:

select name
from Employee
where id in
(
    select managerId
    from Employee
    group by 1
    having count(id) >= 5
)

14. Confirmation Rate (Basic Joins — Medium):

+----------------+----------+
| Column Name    | Type     |
+----------------+----------+
| user_id        | int      |
| time_stamp     | datetime |
+----------------+----------+
user_id is the column of unique values for this table.
Each row contains information about the signup time for the user with ID user_id.

+----------------+----------+
| Column Name    | Type     |
+----------------+----------+
| user_id        | int      |
| time_stamp     | datetime |
| action         | ENUM     |
+----------------+----------+
(user_id, time_stamp) is the primary key (combination of columns with unique values) for this table.
user_id is a foreign key (reference column) to the Signups table.
action is an ENUM (category) of the type ('confirmed', 'timeout')
Each row of this table indicates that the user with ID user_id requested a confirmation message at time_stamp and that confirmation message was either confirmed ('confirmed') or expired without confirming ('timeout').


The confirmation rate of a user is the number of 'confirmed' messages divided by the total number of requested confirmation messages. The confirmation rate of a user that did not request any confirmation messages is 0. Round the confirmation rate to two decimal places.

Write a solution to find the confirmation rate of each user.

Return the result table in any order.


Input: 
Signups table:
+---------+---------------------+
| user_id | time_stamp          |
+---------+---------------------+
| 3       | 2020-03-21 10:16:13 |
| 7       | 2020-01-04 13:57:59 |
| 2       | 2020-07-29 23:09:44 |
| 6       | 2020-12-09 10:39:37 |
+---------+---------------------+
Confirmations table:
+---------+---------------------+-----------+
| user_id | time_stamp          | action    |
+---------+---------------------+-----------+
| 3       | 2021-01-06 03:30:46 | timeout   |
| 3       | 2021-07-14 14:00:00 | timeout   |
| 7       | 2021-06-12 11:57:29 | confirmed |
| 7       | 2021-06-13 12:58:28 | confirmed |
| 7       | 2021-06-14 13:59:27 | confirmed |
| 2       | 2021-01-22 00:00:00 | confirmed |
| 2       | 2021-02-28 23:59:59 | timeout   |
+---------+---------------------+-----------+
Output: 
+---------+-------------------+
| user_id | confirmation_rate |
+---------+-------------------+
| 6       | 0.00              |
| 3       | 0.00              |
| 7       | 1.00              |
| 2       | 0.50              |
+---------+-------------------+

In this question, we need to calculate the confirmation rate for each user & we have 2 tables — Signups and Confirmations. We need to first do a LEFT JOIN between the Signups and Confirmations table, based on the common key user_id.

Then we can calculate the number of confirmation messages with the CASE statement & sum() function, and count the total number of user actions. In this case, we can GROUP BY user_id to calculate these 2 numbers for each user.

I prefer to wrap the results in a CTE table first and then calculate the confirmation rate, but you’re welcome to do the whole thing in one big query. Here’s the final SQL:

with cte as (
    select a.user_id,
        sum(case when action = 'confirmed' then 1 else 0 end) as confirmed,
        count(action) as total
    from Signups a
    left join Confirmations b
    on a.user_id = b.user_id
    group by 1
)

select user_id,
       round(coalesce(confirmed/total,0),2) as confirmation_rate
from cte
group by 1

15. Not Boring Movies (Basic Aggregate Functions — Easy):

+----------------+----------+
| Column Name    | Type     |
+----------------+----------+
| id             | int      |
| movie          | varchar  |
| description    | varchar  |
| rating         | float    |
+----------------+----------+
id is the primary key (column with unique values) for this table.
Each row contains information about the name of a movie, its genre, and its rating.
rating is a 2 decimal places float in the range [0, 10]


Write a solution to report the movies with an odd-numbered ID and a description that is not "boring".

Return the result table ordered by rating in descending order.


Input: 
Cinema table:
+----+------------+-------------+--------+
| id | movie      | description | rating |
+----+------------+-------------+--------+
| 1  | War        | great 3D    | 8.9    |
| 2  | Science    | fiction     | 8.5    |
| 3  | irish      | boring      | 6.2    |
| 4  | Ice song   | Fantacy     | 8.6    |
| 5  | House card | Interesting | 9.1    |
+----+------------+-------------+--------+
Output: 
+----+------------+-------------+--------+
| id | movie      | description | rating |
+----+------------+-------------+--------+
| 5  | House card | Interesting | 9.1    |
| 1  | War        | great 3D    | 8.9    |
+----+------------+-------------+--------+

This question wants us to find all the movies with an odd id number & exclude the ones with the description “boring”. I’m not sure if this is exactly an aggregate-function type of problem, but what I did was to include these 2 criterias in the WHERE clause. Here’s the final SQL:

select *
from Cinema
where id %2 != 0
and description != "boring"
order by rating desc

16. Average Selling Price (Basic Aggregate Functions — Easy):

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| product_id    | int     |
| start_date    | date    |
| end_date      | date    |
| price         | int     |
+---------------+---------+
(product_id, start_date, end_date) is the primary key (combination of columns with unique values) for this table.
Each row of this table indicates the price of the product_id in the period from start_date to end_date.
For each product_id there will be no two overlapping periods. That means there will be no two intersecting periods for the same product_id.

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| product_id    | int     |
| purchase_date | date    |
| units         | int     |
+---------------+---------+
This table may contain duplicate rows.
Each row of this table indicates the date, units, and product_id of each product sold.


Write a solution to find the average selling price for each product. average_price should be rounded to 2 decimal places.

Return the result table in any order.



Prices table:
+------------+------------+------------+--------+
| product_id | start_date | end_date   | price  |
+------------+------------+------------+--------+
| 1          | 2019-02-17 | 2019-02-28 | 5      |
| 1          | 2019-03-01 | 2019-03-22 | 20     |
| 2          | 2019-02-01 | 2019-02-20 | 15     |
| 2          | 2019-02-21 | 2019-03-31 | 30     |
+------------+------------+------------+--------+
UnitsSold table:
+------------+---------------+-------+
| product_id | purchase_date | units |
+------------+---------------+-------+
| 1          | 2019-02-25    | 100   |
| 1          | 2019-03-01    | 15    |
| 2          | 2019-02-10    | 200   |
| 2          | 2019-03-22    | 30    |
+------------+---------------+-------+
Output: 
+------------+---------------+
| product_id | average_price |
+------------+---------------+
| 1          | 6.96          |
| 2          | 16.96         |
+------------+---------------+

This question wants us to find out the average price for each product, and the way to calculate the average price is: sum of prices / sum of units sold. So okkkk this is definitely the aggregate-function problem, we can first find out the sums separately, and then divide the 2 numbers & round the result by 2 decimal places. Let’s LEFT JOIN the two tables by their common field — product_id, and make sure that the purchase_date happens between the product’s start_date and end_date.

We should also be aware that some products might not have any sales so far, and the average price for them should be 0. We can use the coalesce() function on the final number to achieve this.

Here’s the final SQL:

select a.product_id,
       coalesce(round((sum(price * units) / sum(units)), 2),0) as average_price
from Prices a
left join UNitsSold b
on a.product_id = b.product_id
and b.purchase_date between a.start_date and a.end_date
group by 1

17. Project Employees I (Basic Aggregate Functions — Easy):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| project_id  | int     |
| employee_id | int     |
+-------------+---------+
(project_id, employee_id) is the primary key of this table.
employee_id is a foreign key to Employee table.
Each row of this table indicates that the employee with employee_id is working on the project with project_id.

+------------------+---------+
| Column Name      | Type    |
+------------------+---------+
| employee_id      | int     |
| name             | varchar |
| experience_years | int     |
+------------------+---------+
employee_id is the primary key of this table. It's guaranteed that experience_years is not NULL.
Each row of this table contains information about one employee.



Write an SQL query that reports the average experience years of all the employees for each project, rounded to 2 digits.

Return the result table in any order.



Input: 
Project table:
+-------------+-------------+
| project_id  | employee_id |
+-------------+-------------+
| 1           | 1           |
| 1           | 2           |
| 1           | 3           |
| 2           | 1           |
| 2           | 4           |
+-------------+-------------+
Employee table:
+-------------+--------+------------------+
| employee_id | name   | experience_years |
+-------------+--------+------------------+
| 1           | Khaled | 3                |
| 2           | Ali    | 2                |
| 3           | John   | 1                |
| 4           | Doe    | 2                |
+-------------+--------+------------------+
Output: 
+-------------+---------------+
| project_id  | average_years |
+-------------+---------------+
| 1           | 2.00          |
| 2           | 2.50          |
+-------------+---------------+

This question wants us to find employees’ average experience years for each project, and we know intuitively that it’s a GROUP BY/aggregate-function type of problem. We should first do a LEFT JOIN between the 2 tables, based on the common key — employee_id. Then we can use the avg() function to calculate the average years and group the results by project_id. Here’s the final SQL:

select project_id,
       round(avg(experience_years),2) as average_years
from Project a
left join Employee b
on a.employee_id = b.employee_id
group by 1

18. Percentage of Users Attended a Contest (Basic Aggregate Functions — Easy):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| user_id     | int     |
| user_name   | varchar |
+-------------+---------+
user_id is the primary key (column with unique values) for this table.
Each row of this table contains the name and the id of a user.

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| contest_id  | int     |
| user_id     | int     |
+-------------+---------+
(contest_id, user_id) is the primary key (combination of columns with unique values) for this table.
Each row of this table contains the id of a user and the contest they registered into.


Write a solution to find the percentage of the users registered in each contest rounded to two decimals.

Return the result table ordered by percentage in descending order. In case of a tie, order it by contest_id in ascending order.



Input: 
Users table:
+---------+-----------+
| user_id | user_name |
+---------+-----------+
| 6       | Alice     |
| 2       | Bob       |
| 7       | Alex      |
+---------+-----------+
Register table:
+------------+---------+
| contest_id | user_id |
+------------+---------+
| 215        | 6       |
| 209        | 2       |
| 208        | 2       |
| 210        | 6       |
| 208        | 6       |
| 209        | 7       |
| 209        | 6       |
| 215        | 7       |
| 208        | 7       |
| 210        | 2       |
| 207        | 2       |
| 210        | 7       |
+------------+---------+
Output: 
+------------+------------+
| contest_id | percentage |
+------------+------------+
| 208        | 100.0      |
| 209        | 100.0      |
| 210        | 100.0      |
| 215        | 66.67      |
| 207        | 33.33      |
+------------+------------+

This question wants us to calculate the attendance percentage for each contest. The two main numbers we need to figure out are the total number of users, and the number of users participated in each contest. I show two different methods below, one with cte and one with subquery:

with cte as (
    select count(user_id) as total
    from Users
)

select contest_id,
       round(100* count(user_id) / total,2) as percentage
from cte
join Register
group by 1
order by 2 desc, 1



# subquery
select contest_id,
       round(100*count(user_id) / (select count(user_id) from Users),2) as percentage
from Register
group by 1
order by 2 desc, 1

19. Queries Quality and Percentage (Basic Aggregate Functions — Easy):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| query_name  | varchar |
| result      | varchar |
| position    | int     |
| rating      | int     |
+-------------+---------+
This table may have duplicate rows.
This table contains information collected from some queries on a database.
The position column has a value from 1 to 500.
The rating column has a value from 1 to 5. Query with rating less than 3 is a poor query.



We define query quality as:

The average of the ratio between query rating and its position.

We also define poor query percentage as:

The percentage of all queries with rating less than 3.

Write a solution to find each query_name, the quality and poor_query_percentage.

Both quality and poor_query_percentage should be rounded to 2 decimal places.

Return the result table in any order.



Input: 
Queries table:
+------------+-------------------+----------+--------+
| query_name | result            | position | rating |
+------------+-------------------+----------+--------+
| Dog        | Golden Retriever  | 1        | 5      |
| Dog        | German Shepherd   | 2        | 5      |
| Dog        | Mule              | 200      | 1      |
| Cat        | Shirazi           | 5        | 2      |
| Cat        | Siamese           | 3        | 3      |
| Cat        | Sphynx            | 7        | 4      |
+------------+-------------------+----------+--------+
Output: 
+------------+---------+-----------------------+
| query_name | quality | poor_query_percentage |
+------------+---------+-----------------------+
| Dog        | 2.50    | 33.33                 |
| Cat        | 0.66    | 33.33                 |
+------------+---------+-----------------------+

This question is trying to find out each query’s quality and its the poor query percentage. Based on the definitions, we know that the quality for a query is the ratio between its rating and its position, and we consider a query as poor if its has a rating <3.

To calculate each query’s quality & poor query percentage separately, we can use use aggregate function & GROUP BY statement. Here’s the final result:

select query_name,
       round(sum(rating/position) / count(query_name),2) as quality,
       round(100*sum(case when rating < 3 then 1 else 0 end) / count(query_name),2) as poor_query_percentage
from Queries
where query_name is not null
group by 1

20. Monthly Transactions I(Basic Aggregate Functions — Medium):

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| id            | int     |
| country       | varchar |
| state         | enum    |
| amount        | int     |
| trans_date    | date    |
+---------------+---------+
id is the primary key of this table.
The table has information about incoming transactions.
The state column is an enum of type ["approved", "declined"].



Write an SQL query to find for each month and country, the number of transactions and their total amount, the number of approved transactions and their total amount.

Return the result table in any order.



Input: 
Transactions table:
+------+---------+----------+--------+------------+
| id   | country | state    | amount | trans_date |
+------+---------+----------+--------+------------+
| 121  | US      | approved | 1000   | 2018-12-18 |
| 122  | US      | declined | 2000   | 2018-12-19 |
| 123  | US      | approved | 2000   | 2019-01-01 |
| 124  | DE      | approved | 2000   | 2019-01-07 |
+------+---------+----------+--------+------------+
Output: 
+----------+---------+-------------+----------------+--------------------+-----------------------+
| month    | country | trans_count | approved_count | trans_total_amount | approved_total_amount |
+----------+---------+-------------+----------------+--------------------+-----------------------+
| 2018-12  | US      | 2           | 1              | 3000               | 1000                  |
| 2019-01  | US      | 1           | 1              | 2000               | 2000                  |
| 2019-01  | DE      | 1           | 1              | 2000               | 2000                  |
+----------+---------+-------------+----------------+--------------------+-----------------------+

This question is marked as a “medium” level, but I personally feel that it’s easier than the previous one as long as we understand how aggregate function works. In this problem, we need to find out several things — for each month & country, what’s the number of transactions and total amount, what’s the number of approved transactions and total approved amount. To GROUP BY month and country column, we get the final SQL below:

select left(trans_date,7) as month,
       country,
       count(id) as trans_count,
       sum(case when state = 'approved' then 1 else 0 end) as approved_count,
       sum(amount) as trans_total_amount,
       sum(case when state = 'approved' then amount else 0 end) as approved_total_amount

from Transactions
group by 1,2

21. Immediate Food Delivery II(Basic Aggregate Functions — Medium):

+-----------------------------+---------+
| Column Name                 | Type    |
+-----------------------------+---------+
| delivery_id                 | int     |
| customer_id                 | int     |
| order_date                  | date    |
| customer_pref_delivery_date | date    |
+-----------------------------+---------+
delivery_id is the column of unique values of this table.
The table holds information about food delivery to customers that make orders at some date and specify a preferred delivery date (on the same order date or after it).


If the customer's preferred delivery date is the same as the order date, then the order is called immediate; otherwise, it is called scheduled.

The first order of a customer is the order with the earliest order date that the customer made. It is guaranteed that a customer has precisely one first order.

Write a solution to find the percentage of immediate orders in the first orders of all customers, rounded to 2 decimal places.



Input: 
Delivery table:
+-------------+-------------+------------+-----------------------------+
| delivery_id | customer_id | order_date | customer_pref_delivery_date |
+-------------+-------------+------------+-----------------------------+
| 1           | 1           | 2019-08-01 | 2019-08-02                  |
| 2           | 2           | 2019-08-02 | 2019-08-02                  |
| 3           | 1           | 2019-08-11 | 2019-08-12                  |
| 4           | 3           | 2019-08-24 | 2019-08-24                  |
| 5           | 3           | 2019-08-21 | 2019-08-22                  |
| 6           | 2           | 2019-08-11 | 2019-08-13                  |
| 7           | 4           | 2019-08-09 | 2019-08-09                  |
+-------------+-------------+------------+-----------------------------+
Output: 
+----------------------+
| immediate_percentage |
+----------------------+
| 50.00                |
+----------------------+

In this question, we need to first find the first orders for each customer, then find out the immediate orders of these first orders, and finally calculate the immediate percentage.

There are 2 different ways to approach this problem: aggregate functions OR window functions. I have listed the 2 methods below, but let me briefly explain each solution:

  1. aggregate function: Grouping by the customer_id column, we use min() function to locate the first orders for each customer. One thing to note here that we need to use min() on the customer_pref_delivery_date column as well, because we need the corresponding customer_pref_delivery_date value for each first_order_date.
  2. window function: one cool thing about window function is that it doesn’t reduce the number of rows like aggregate functions, and it assigns the correct value to each row based on the partition column! One thing to note here is that when we calculate the immediate percentage, we need to count the number of distinct customer_id here, since we did not reduct the number of rows when we tried to find out the first orders for each customer.
with first_order as (
    select delivery_id,
        customer_id,
        min(order_date) as first_order_date,
        min(customer_pref_delivery_date) as customer_pref_delivery_date
    from Delivery
    group by 2
)

select round(100 * sum(case when first_order_date = customer_pref_delivery_date then 1 else 0 end) / count(delivery_id),2) as immediate_percentage
from first_order




with first_order as (
    select delivery_id,
        customer_id,
        order_date,
        min(order_date) over (partition by customer_id) as first_order_date,
        customer_pref_delivery_date
    from Delivery
)

select round(100*sum(case when first_order_date = customer_pref_delivery_date then 1 else 0 end) / count(distinct customer_id),2) as immediate_percentage
from first_order

22. Game Play Analysis IV (Basic Aggregate Functions — Medium):

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| player_id    | int     |
| device_id    | int     |
| event_date   | date    |
| games_played | int     |
+--------------+---------+
(player_id, event_date) is the primary key (combination of columns with unique values) of this table.
This table shows the activity of players of some games.
Each row is a record of a player who logged in and played a number of games (possibly 0) before logging out on someday using some device.

Write a solution to report the fraction of players that logged in again on the day after the day they first logged in, rounded to 2 decimal places. In other words, you need to count the number of players that logged in for at least two consecutive days starting from their first login date, then divide that number by the total number of players.


Input: 
Activity table:
+-----------+-----------+------------+--------------+
| player_id | device_id | event_date | games_played |
+-----------+-----------+------------+--------------+
| 1         | 2         | 2016-03-01 | 5            |
| 1         | 2         | 2016-03-02 | 6            |
| 2         | 3         | 2017-06-25 | 1            |
| 3         | 1         | 2016-03-02 | 0            |
| 3         | 4         | 2018-07-03 | 5            |
+-----------+-----------+------------+--------------+
Output: 
+-----------+
| fraction  |
+-----------+
| 0.33      |
+-----------+

This question is very similar to the last one — sooo we need to report the fraction of players who log on two consecutive days, using their first login day as reference. Here I use the window function to find out each user’s first logon date (without reducing any number of rows in the dataset), then wrap the results as a cte table, and then calculate the fraction by: first compare the first logon date and event date and find out the number of players who log on two consecutive-days, then count the total number of players using distinct, and finally divide the two numbers to calculate the fraction. Here’s the final SQL:

with login as (
    select player_id,
        device_id,
        min(event_date) over (partition by player_id) as first_log,
        event_date,
        games_played
    from Activity
)

select round(sum(case when datediff(event_date,first_log) = 1 then 1 else 0 end)/
       count(distinct player_id),2) as fraction
from login

23. Number of Unique Subjects Taught by Each Teacher (Sorting and Grouping — Easy):

+-------------+------+
| Column Name | Type |
+-------------+------+
| teacher_id  | int  |
| subject_id  | int  |
| dept_id     | int  |
+-------------+------+
(subject_id, dept_id) is the primary key (combinations of columns with unique values) of this table.
Each row in this table indicates that the teacher with teacher_id teaches the subject subject_id in the department dept_id.

Write a solution to calculate the number of unique subjects each teacher teaches in the university.

Return the result table in any order.




Input: 
Teacher table:
+------------+------------+---------+
| teacher_id | subject_id | dept_id |
+------------+------------+---------+
| 1          | 2          | 3       |
| 1          | 2          | 4       |
| 1          | 3          | 3       |
| 2          | 1          | 1       |
| 2          | 2          | 1       |
| 2          | 3          | 1       |
| 2          | 4          | 1       |
+------------+------------+---------+
Output:  
+------------+-----+
| teacher_id | cnt |
+------------+-----+
| 1          | 2   |
| 2          | 4   |
+------------+-----+

In this question, we need to count the number of unique subjects each teacher teaches. We can use count() and GROUP BY on the column subject_id, here’s the final SQL:

select teacher_id,
       count(distinct subject_id) as cnt
from Teacher
group by 1

24. User Activity for the Past 30 Days I (Sorting and Grouping — Easy):

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| user_id       | int     |
| session_id    | int     |
| activity_date | date    |
| activity_type | enum    |
+---------------+---------+
This table may have duplicate rows.
The activity_type column is an ENUM (category) of type ('open_session', 'end_session', 'scroll_down', 'send_message').
The table shows the user activities for a social media website. 
Note that each session belongs to exactly one user.


Write a solution to find the daily active user count for a period of 30 days ending 2019-07-27 inclusively. A user was active on someday if they made at least one activity on that day.

Return the result table in any order.


Input: 
Activity table:
+---------+------------+---------------+---------------+
| user_id | session_id | activity_date | activity_type |
+---------+------------+---------------+---------------+
| 1       | 1          | 2019-07-20    | open_session  |
| 1       | 1          | 2019-07-20    | scroll_down   |
| 1       | 1          | 2019-07-20    | end_session   |
| 2       | 4          | 2019-07-20    | open_session  |
| 2       | 4          | 2019-07-21    | send_message  |
| 2       | 4          | 2019-07-21    | end_session   |
| 3       | 2          | 2019-07-21    | open_session  |
| 3       | 2          | 2019-07-21    | send_message  |
| 3       | 2          | 2019-07-21    | end_session   |
| 4       | 3          | 2019-06-25    | open_session  |
| 4       | 3          | 2019-06-25    | end_session   |
+---------+------------+---------------+---------------+
Output: 
+------------+--------------+ 
| day        | active_users |
+------------+--------------+ 
| 2019-07-20 | 2            |
| 2019-07-21 | 2            |
+------------+--------------+ 

In this question, we need to count the number of active users for a 30-day period of time, during 2019–06–28 and 2019–07–27, both inclusively. We can use the count() distinct function and set up the criteria in the WHERE clause, here’s the final SQL:

select activity_date as day,
       count(distinct user_id) as active_users
from Activity
where activity_date between '2019-06-28' and '2019-07-27'
group by 1

25. Product Sales Analysis III (Sorting and Grouping — Medium):

+-------------+-------+
| Column Name | Type  |
+-------------+-------+
| sale_id     | int   |
| product_id  | int   |
| year        | int   |
| quantity    | int   |
| price       | int   |
+-------------+-------+
(sale_id, year) is the primary key (combination of columns with unique values) of this table.
product_id is a foreign key (reference column) to Product table.
Each row of this table shows a sale on the product product_id in a certain year.
Note that the price is per unit.


+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| product_id   | int     |
| product_name | varchar |
+--------------+---------+
product_id is the primary key (column with unique values) of this table.
Each row of this table indicates the product name of each product.


Write a solution to select the product id, year, quantity, and price for the first year of every product sold.

Return the resulting table in any order.



Input: 
Sales table:
+---------+------------+------+----------+-------+
| sale_id | product_id | year | quantity | price |
+---------+------------+------+----------+-------+ 
| 1       | 100        | 2008 | 10       | 5000  |
| 2       | 100        | 2009 | 12       | 5000  |
| 7       | 200        | 2011 | 15       | 9000  |
+---------+------------+------+----------+-------+
Product table:
+------------+--------------+
| product_id | product_name |
+------------+--------------+
| 100        | Nokia        |
| 200        | Apple        |
| 300        | Samsung      |
+------------+--------------+
Output: 
+------------+------------+----------+-------+
| product_id | first_year | quantity | price |
+------------+------------+----------+-------+ 
| 100        | 2008       | 10       | 5000  |
| 200        | 2011       | 15       | 9000  |
+------------+------------+----------+-------+

To be honest, I don’t understand the purpose of the “Product” table here, since we only need to find out information for each product sold. So let’s ignore this table for now — in this question, we need to find out which year each product was sold first, and retrieve some other information: product_id, quantity, and price.

In the first solution below I find out out the first year for each sold product, and wrap results in a sub-query in the WHERE clause.

However, I also realized that it’s possible that one product can be sold multiple times with different quantities during their first year, and that’s how I come up with the second solution — butttt the LeetCode platform won’t take it as the correct answer. I will still include the SQL here for y’all, thanks :)

select product_id,
       year as first_year,
       quantity,
       price
from Sales
where (product_id, year) in (
        select product_id,
            min(year) as first_year
        from Sales
        group by 1
        )







select distinct product_id,
       min(year) over (partition by product_id) as first_year,
       quantity,
       price
from Sales
Sample Output from the second query:

| product_id | first_year | quantity | price |
| ---------- | ---------- | -------- | ----- |
| 100        | 2008       | 10       | 5000  |
| 100        | 2008       | 12       | 5000  |
| 200        | 2011       | 15       | 9000  |

26. Classes More Than 5 Students (Sorting and Grouping — Easy):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| student     | varchar |
| class       | varchar |
+-------------+---------+
(student, class) is the primary key (combination of columns with unique values) for this table.
Each row of this table indicates the name of a student and the class in which they are enrolled.


Write a solution to find all the classes that have at least five students.

Return the result table in any order.


Input: 
Courses table:
+---------+----------+
| student | class    |
+---------+----------+
| A       | Math     |
| B       | English  |
| C       | Math     |
| D       | Biology  |
| E       | Math     |
| F       | Computer |
| G       | Math     |
| H       | Math     |
| I       | Math     |
+---------+----------+
Output: 
+---------+
| class   |
+---------+
| Math    |
+---------+

In this question, we need to list all the classes that have 5 or more students. First let’s use the count() function & GROUP BY to calculate how many students we have for each class, then wrap the result set in the WHERE clause in order to choose the class column only as the output table, and add the criteria in the WHERE clause — number of students ≥ 5. Here’s the SQL:

select class
from
(select class,
       count(student) as num
from Courses
group by 1) sub
where sub.num >=5

27. Find Followers Count (Sorting and Grouping — Easy):

+-------------+------+
| Column Name | Type |
+-------------+------+
| user_id     | int  |
| follower_id | int  |
+-------------+------+
(user_id, follower_id) is the primary key (combination of columns with unique values) for this table.
This table contains the IDs of a user and a follower in a social media app where the follower follows the user.


Write a solution that will, for each user, return the number of followers.

Return the result table ordered by user_id in ascending order.


Input: 
Followers table:
+---------+-------------+
| user_id | follower_id |
+---------+-------------+
| 0       | 1           |
| 1       | 0           |
| 2       | 0           |
| 2       | 1           |
+---------+-------------+
Output: 
+---------+----------------+
| user_id | followers_count|
+---------+----------------+
| 0       | 1              |
| 1       | 1              |
| 2       | 2              |
+---------+----------------+

This question is a no-brainer aggregate function + GROUP BY combo, we need to count the number of followers for each user. Here’s the final SQL:

select user_id,
       count(follower_id) as followers_count
from Followers
group by 1
order by 1

28. Biggest Single Number (Sorting and Grouping — Easy):

+-------------+------+
| Column Name | Type |
+-------------+------+
| num         | int  |
+-------------+------+
This table may contain duplicates (In other words, there is no primary key for this table in SQL).
Each row of this table contains an integer.


A single number is a number that appeared only once in the MyNumbers table.

Find the largest single number. If there is no single number, report null.


Input: 
MyNumbers table:
+-----+
| num |
+-----+
| 8   |
| 8   |
| 3   |
| 3   |
| 1   |
| 4   |
| 5   |
| 6   |
+-----+
Output: 
+-----+
| num |
+-----+
| 6   |
+-----+
Explanation: The single numbers are 1, 4, 5, and 6.
Since 6 is the largest single number, we return it.

I was confused by the wording of this question when I first read through it, I thought it’s asking for the largest odd number that appears only once — but apparently not... In this question, we need to first count the number of times each integer appears in the table, then select the largest integer. There are soooo many ways to approach this problem, here’s my SQL below:

select max(num) as num
from (
    select num, 
        count(num) as count
    from MyNumbers
    group by 1
) sub
where sub.count = 1





with cte as (
    select num,
           count(num) as count
    from MyNumbers
    group by 1
    having count(num) = 1
)

select max(num) as num
from cte

29. Customers Who Bought All Products (Sorting and Grouping — Medium):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| customer_id | int     |
| product_key | int     |
+-------------+---------+
This table may contain duplicates rows. 
customer_id is not NULL.
product_key is a foreign key (reference column) to Product table.

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| product_key | int     |
+-------------+---------+
product_key is the primary key (column with unique values) for this table.


Write a solution to report the customer ids from the Customer table that bought all the products in the Product table.

Return the result table in any order.



Input: 
Customer table:
+-------------+-------------+
| customer_id | product_key |
+-------------+-------------+
| 1           | 5           |
| 2           | 6           |
| 3           | 5           |
| 3           | 6           |
| 1           | 6           |
+-------------+-------------+
Product table:
+-------------+
| product_key |
+-------------+
| 5           |
| 6           |
+-------------+
Output: 
+-------------+
| customer_id |
+-------------+
| 1           |
| 3           |
+-------------+

This question tricks me a little bit, first I thought I can solve it by setting “Customer.product_key” IN “Product.product_key”, but then I realized it doesn’t make sense!

Instead, weneed to first count the number of products each customer has bought, and compare the numbers with the total number of products in the Product table — to see if there’s any matches.

Well, you can probably write the SQL in a much cleaner way (see my second solution), but I like the first one better because it shows my thought process:

select customer_id
from (
    select customer_id,
        count(distinct product_key) as num_buy
    from Customer
    group by 1
    having count(distinct product_key) = (
        select count(product_key)
        from Product)
    ) sub




select customer_id
from Customer
group by 1
having count(distinct product_key) = (
    select count(product_key)
    from Product
)

30. The Number of Employees Which Report to Each Employee (Advanced Select and Joins — Easy):

+-------------+----------+
| Column Name | Type     |
+-------------+----------+
| employee_id | int      |
| name        | varchar  |
| reports_to  | int      |
| age         | int      |
+-------------+----------+
employee_id is the column with unique values for this table.
This table contains information about the employees and the id of the manager they report to. Some employees do not report to anyone (reports_to is null).


For this problem, we will consider a manager an employee who has at least 1 other employee reporting to them.

Write a solution to report the ids and the names of all managers, the number of employees who report directly to them, and the average age of the reports rounded to the nearest integer.

Return the result table ordered by employee_id.



Employees table:
+-------------+---------+------------+-----+
| employee_id | name    | reports_to | age |
+-------------+---------+------------+-----+
| 9           | Hercy   | null       | 43  |
| 6           | Alice   | 9          | 41  |
| 4           | Bob     | 9          | 36  |
| 2           | Winston | null       | 37  |
+-------------+---------+------------+-----+
Output: 
+-------------+-------+---------------+-------------+
| employee_id | name  | reports_count | average_age |
+-------------+-------+---------------+-------------+
| 9           | Hercy | 2             | 39          |
+-------------+-------+---------------+-------------+

In this question, for each manager, we need to find out the number of employees who report directly to them, along with these employees’ average age.

It’s a typical self-join problem, we can join the table’s reports_to column with its employee_id in order to locate the names of the managers. Then we can use aggregate function and GROUP BY to calculate the number of direct reports & their average average for each manager. Here’s the final SQL:

select distinct manager.employee_id, 
                manager.name,
                count(emp.reports_to) as reports_count,
                round(avg(emp.age),0) as average_age
from Employees emp
join Employees manager
on emp.reports_to = manager.employee_id
group by 1,2
order by 1

31. Primary Department for Each Employee (Advanced Select and Joins — Easy):

+---------------+---------+
| Column Name   |  Type   |
+---------------+---------+
| employee_id   | int     |
| department_id | int     |
| primary_flag  | varchar |
+---------------+---------+
(employee_id, department_id) is the primary key (combination of columns with unique values) for this table.
employee_id is the id of the employee.
department_id is the id of the department to which the employee belongs.
primary_flag is an ENUM (category) of type ('Y', 'N'). If the flag is 'Y', the department is the primary department for the employee. If the flag is 'N', the department is not the primary.



Employees can belong to multiple departments. When the employee joins other departments, they need to decide which department is their primary department. Note that when an employee belongs to only one department, their primary column is 'N'.

Write a solution to report all the employees with their primary department. For employees who belong to one department, report their only department.


Input: 
Employee table:
+-------------+---------------+--------------+
| employee_id | department_id | primary_flag |
+-------------+---------------+--------------+
| 1           | 1             | N            |
| 2           | 1             | Y            |
| 2           | 2             | N            |
| 3           | 3             | N            |
| 4           | 2             | N            |
| 4           | 3             | Y            |
| 4           | 4             | N            |
+-------------+---------------+--------------+
Output: 
+-------------+---------------+
| employee_id | department_id |
+-------------+---------------+
| 1           | 1             |
| 2           | 1             |
| 3           | 3             |
| 4           | 3             |
+-------------+---------------+

In this question, we need to report the primary departments for all employees, with the note that each employee can only have one primary department — primary_flag = ‘Y’. We first need to find out which employees already have a primary department & report them, then UNION the ones who don’t have a primary department.

The wording of this question is VERY misleading, and if you’re looking for a solution that will pass the Leetcode test, go for the second one below. However, I personally prefer my first solution because it makes more sense to me. Unfortunately, the first solution failed one of the Leetcode test cases — when one employee has 2 non-primary (“N”) departments, which disobeys the rule clearly stated in the question: when an employee belongs to more than one departments, he/she needs to decide one as their primary department. Anyway, here are the final SQL:

select employee_id,
       department_id
from Employee
where primary_flag = 'Y'
union
select  employee_id,
       department_id
from Employee
where employee_id not in (
    select employee_id
    from Employee
    where primary_flag = 'Y')




select employee_id,
       department_id
from Employee
group by 1
having count(department_id) = 1
union
select employee_id,
       department_id
from Employee
where primary_flag = 'Y'

32. Triangle Judgement (Advanced Select and Joins — Easy):

+-------------+------+
| Column Name | Type |
+-------------+------+
| x           | int  |
| y           | int  |
| z           | int  |
+-------------+------+
In SQL, (x, y, z) is the primary key column for this table.
Each row of this table contains the lengths of three line segments.


Report for every three line segments whether they can form a triangle.

Return the result table in any order.


Input: 
Triangle table:
+----+----+----+
| x  | y  | z  |
+----+----+----+
| 13 | 15 | 30 |
| 10 | 20 | 15 |
+----+----+----+
Output: 
+----+----+----+----------+
| x  | y  | z  | triangle |
+----+----+----+----------+
| 13 | 15 | 30 | No       |
| 10 | 20 | 15 | Yes      |
+----+----+----+----------+

In this question, we need to judge if three numbers can form a triangle. The rule to form a triangle is — the sum of any two lines is greater than the third line. We can use a CASE statement to test this rule, here’s the final SQL:

select *,
       case when x+y > z and x+z > y and y+z > x
            then 'Yes' else 'No' end as triangle
from Triangle

33. Consecutive Numbers (Advanced Select and Joins — Medium):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| id          | int     |
| num         | varchar |
+-------------+---------+
In SQL, id is the primary key for this table.
id is an autoincrement column.


Find all numbers that appear at least three times consecutively.

Return the result table in any order.


Logs table:
+----+-----+
| id | num |
+----+-----+
| 1  | 1   |
| 2  | 1   |
| 3  | 1   |
| 4  | 2   |
| 5  | 1   |
| 6  | 2   |
| 7  | 2   |
+----+-----+
Output: 
+-----------------+
| ConsecutiveNums |
+-----------------+
| 1               |
+-----------------+

In this question, we need to find out if any number appears consecutively three times or more in a row. We can use lead() or lag() function to create 2 new columns for the second and third number (pivot the num column), then wrap the results as a subquery/CTE, and compare if the three numbers are the same.

Orrr we can do 2 SELF JOINs to join the three tables together, and only select the rows with three identical numbers & consecutive ids.

Here are the final SQLs:

select distinct num as ConsecutiveNums
from
(select id, 
       num,
       lead(num,1) over () as second_num,
       lead(num,2) over () as third_num
from Logs) sub
where num = second_num
and num = third_num




with a as
    (select id, num,
        lead(num, 1) over () as second_num,
        lead(num,2) over () as third_num
    from Logs) 

select distinct num as ConsecutiveNums
from a
where num = second_num
and num = third_num



select distinct num as ConsecutiveNums
from
(select id,
       num,
       lag(num,1) over () as second_num,
       lag(num,2) over () as third_num
from Logs) sub
where num = second_num
and num = third_num



select distinct a.num as ConsecutiveNums
from Logs a
join Logs b
join Logs c
where a.id+1 = b.id
and a.id+2 = c.id
and a.num = b.num
and a.num = c.num

34. Product Price at a Given Date (Advanced Select and Joins — Medium):

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| product_id    | int     |
| new_price     | int     |
| change_date   | date    |
+---------------+---------+
(product_id, change_date) is the primary key (combination of columns with unique values) of this table.
Each row of this table indicates that the price of some product was changed to a new price at some date.



Write a solution to find the prices of all products on 2019-08-16. Assume the price of all products before any change is 10.

Return the result table in any order.


Input: 
Products table:
+------------+-----------+-------------+
| product_id | new_price | change_date |
+------------+-----------+-------------+
| 1          | 20        | 2019-08-14  |
| 2          | 50        | 2019-08-14  |
| 1          | 30        | 2019-08-15  |
| 1          | 35        | 2019-08-16  |
| 2          | 65        | 2019-08-17  |
| 3          | 20        | 2019-08-18  |
+------------+-----------+-------------+
Output: 
+------------+-------+
| product_id | price |
+------------+-------+
| 2          | 50    |
| 1          | 35    |
| 3          | 10    |
+------------+-------+

This is a great problem, really stretched my mind a bit (tho the wording is so confusing lol)— we need to find the prices for all products on the day of ‘2019–08–16’. This question can be split into two parts: for products that did change price on or before this date, report the most recent/updated price; and for products that never changed price or changed after ‘2019–08–16’, report 10 as the default price.

In order to find out the most recent price for all the products on this date, we can use window function rank() to rank the prices for each product_id, order by change_date in descending order — and then select the rows with the highest rank (rank = 1).

Then we can use UNION to combine with the products who never changed price/changed after this date, by setting criteria in the WHERE clause to exclude those products that have been reported already. Here’s the final SQL:

with cte as (
    select product_id,
        change_date,
        rank() over (partition by product_id order by change_date desc) as day_rank,
        new_price
    from Products
    where change_date <= '2019-08-16'
)

select product_id, new_price as price
from cte
where day_rank = 1
union
select product_id, 10 as price
from Products
where product_id not in (
    select product_id
    from cte
)

35. Last Person to Fit in the Bus (Advanced Select and Joins — Medium):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| person_id   | int     |
| person_name | varchar |
| weight      | int     |
| turn        | int     |
+-------------+---------+
person_id column contains unique values.
This table has the information about all people waiting for a bus.
The person_id and turn columns will contain all numbers from 1 to n, where n is the number of rows in the table.
turn determines the order of which the people will board the bus, where turn=1 denotes the first person to board and turn=n denotes the last person to board.
weight is the weight of the person in kilograms.


There is a queue of people waiting to board a bus. However, the bus has a weight limit of 1000 kilograms, so there may be some people who cannot board.

Write a solution to find the person_name of the last person that can fit on the bus without exceeding the weight limit. The test cases are generated such that the first person does not exceed the weight limit.



Input: 
Queue table:
+-----------+-------------+--------+------+
| person_id | person_name | weight | turn |
+-----------+-------------+--------+------+
| 5         | Alice       | 250    | 1    |
| 4         | Bob         | 175    | 5    |
| 3         | Alex        | 350    | 2    |
| 6         | John Cena   | 400    | 3    |
| 1         | Winston     | 500    | 6    |
| 2         | Marie       | 200    | 4    |
+-----------+-------------+--------+------+
Output: 
+-------------+
| person_name |
+-------------+
| John Cena   |
+-------------+

In this question, we need to find out who is the last person to board on the bus by calculating the cumulative sum of passengers’ weights. We can either use the window function or aggregate function + self-join to calculate the cumulative sum of weights.

Then we can filter the results to include only the total weights ≤ 1000, and wrap in a CTE table. Finally, we can select the last row by using a simple subquery — select the max turn number from the CTE table. Here’re the two SQLs:

with cte as (
    select *
    from (
        select *,
            sum(weight) over (order by turn) as cumu_weight
        from Queue
    ) sub
    where cumu_weight <= 1000
)

select person_name
from cte
where turn = (select max(turn) from cte)





with cte as (
    select *
    from (
        select a.*,
            sum(b.weight) as cumu_weight
        from queue a
        join queue b
        on a.turn >= b.turn
        group by  1,2,3,4
        order by 4
    ) sub
    where cumu_weight <= 1000
)

select person_name
from cte
where turn = (select max(turn) from cte)

36. Count Salary Categories (Advanced Select and Joins — Medium):

+-------------+------+
| Column Name | Type |
+-------------+------+
| account_id  | int  |
| income      | int  |
+-------------+------+
account_id is the primary key (column with unique values) for this table.
Each row contains information about the monthly income for one bank account.



Write a solution to calculate the number of bank accounts for each salary category. The salary categories are:

"Low Salary": All the salaries strictly less than $20000.
"Average Salary": All the salaries in the inclusive range [$20000, $50000].
"High Salary": All the salaries strictly greater than $50000.
The result table must contain all three categories. If there are no accounts in a category, return 0.

Return the result table in any order.



Input: 
Accounts table:
+------------+--------+
| account_id | income |
+------------+--------+
| 3          | 108939 |
| 2          | 12747  |
| 8          | 87709  |
| 6          | 91796  |
+------------+--------+
Output: 
+----------------+----------------+
| category       | accounts_count |
+----------------+----------------+
| Low Salary     | 1              |
| Average Salary | 0              |
| High Salary    | 3              |
+----------------+----------------+

In this problem, we need to report the number of accounts fall into each income category: Low Salary, Average Salary, High Salary. Need to note that there might be 0 accounts for certain income category(s), and in this case we still need to report it as 0.

At first I didn’t realize it’s a UNION type of question, and got super stuck after categorizing each account’s income into its corresponding category. I wasn’t sure how to report something that doesn’t exisit in a dataset, coalesce() wouldn’t be much of use in this case.

But anyway, UNION is our best friend in this problem — kindly see below my two solutions (one is cleaner, one is longer lol):

select 'Low Salary' as category,
       count(account_id) as accounts_count
from Accounts
where income < 20000
union
select 'Average Salary' as category,
       count(account_id) as accounts_count
from Accounts
where income between 20000 and 50000
union
select 'High Salary' as caregory,
       count(account_id) as accounts_count
from Accounts
where income > 50000






with cte as (
    select account_id,
        income,
        case when income < 20000 then 'Low Salary'
                when income between 20000 and 50000 then 'Average Salary'
                when income > 50000 then 'High Salary'
            end as category
    from Accounts
)

select 'Low Salary' as category,
       count(account_id) as accounts_count
from cte
where category = 'Low Salary'
union
select 'Average Salary' as category,
       count(account_id) as accounts_count
from cte
where category = 'Average Salary'
union
select 'High Salary' as category,
       count(account_id) as accounts_count
from cte
where category = 'High Salary'

37. Employees Whose Manager Left the Company (Subqueries — Easy):

+-------------+----------+
| Column Name | Type     |
+-------------+----------+
| employee_id | int      |
| name        | varchar  |
| manager_id  | int      |
| salary      | int      |
+-------------+----------+
In SQL, employee_id is the primary key for this table.
This table contains information about the employees, their salary, and the ID of their manager. Some employees do not have a manager (manager_id is null).


Find the IDs of the employees whose salary is strictly less than $30000 and whose manager left the company. When a manager leaves the company, their information is deleted from the Employees table, but the reports still have their manager_id set to the manager that left.

Return the result table ordered by employee_id.


Input:  
Employees table:
+-------------+-----------+------------+--------+
| employee_id | name      | manager_id | salary |
+-------------+-----------+------------+--------+
| 3           | Mila      | 9          | 60301  |
| 12          | Antonella | null       | 31000  |
| 13          | Emery     | null       | 67084  |
| 1           | Kalel     | 11         | 21241  |
| 9           | Mikaela   | null       | 50937  |
| 11          | Joziah    | 6          | 28485  |
+-------------+-----------+------------+--------+
Output: 
+-------------+
| employee_id |
+-------------+
| 11          |
+-------------+

In this question, we need to report a list of employees who earn less then $30000 a year and their managers have left the company — “have left” meaning these managers’ employee ids are deleted in the database. We can easily adapt these 2 criteria in a subquery in the WHERE clause below:

select employee_id
from Employees
where salary < 30000
and manager_id not in (
    select employee_id
    from Employees
)
order by 1

38. Exchange Seats (Subqueries — Medium):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| id          | int     |
| student     | varchar |
+-------------+---------+
id is the primary key (unique value) column for this table.
Each row of this table indicates the name and the ID of a student.
id is a continuous increment.


Write a solution to swap the seat id of every two consecutive students. If the number of students is odd, the id of the last student is not swapped.

Return the result table ordered by id in ascending order.


Input: 
Seat table:
+----+---------+
| id | student |
+----+---------+
| 1  | Abbot   |
| 2  | Doris   |
| 3  | Emerson |
| 4  | Green   |
| 5  | Jeames  |
+----+---------+
Output: 
+----+---------+
| id | student |
+----+---------+
| 1  | Doris   |
| 2  | Abbot   |
| 3  | Green   |
| 4  | Emerson |
| 5  | Jeames  |
+----+---------+

In this question, we need to swap the seating of each 2 consecutive students, and make the last student staying at their original seat if we have an odd number of students. To approach this problem, we can use the CASE statement and lag() & lead() functions to swap the seating.

If we have an odd number of students, the last student’s seat will be null (since it doesn’t exist). In this case, we can use coalesce() function to fill the spot by the student’s original seat. Here’s the final SQL:

select id,
       case when id%2=0 then lag(student,1) over (order by id) 
            when id%2<>0 then coalesce(lead(student,1) over (order by id), student) end as student
from Seat

39. Movie Rating (Subqueries — Medium):

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| movie_id      | int     |
| title         | varchar |
+---------------+---------+
movie_id is the primary key (column with unique values) for this table.
title is the name of the movie.


+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| user_id       | int     |
| name          | varchar |
+---------------+---------+
user_id is the primary key (column with unique values) for this table.


+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| movie_id      | int     |
| user_id       | int     |
| rating        | int     |
| created_at    | date    |
+---------------+---------+
(movie_id, user_id) is the primary key (column with unique values) for this table.
This table contains the rating of a movie by a user in their review.
created_at is the user's review date. 


Write a solution to:

Find the name of the user who has rated the greatest number of movies. In case of a tie, return the lexicographically smaller user name.
Find the movie name with the highest average rating in February 2020. In case of a tie, return the lexicographically smaller movie name.


Input: 
Movies table:
+-------------+--------------+
| movie_id    |  title       |
+-------------+--------------+
| 1           | Avengers     |
| 2           | Frozen 2     |
| 3           | Joker        |
+-------------+--------------+
Users table:
+-------------+--------------+
| user_id     |  name        |
+-------------+--------------+
| 1           | Daniel       |
| 2           | Monica       |
| 3           | Maria        |
| 4           | James        |
+-------------+--------------+
MovieRating table:
+-------------+--------------+--------------+-------------+
| movie_id    | user_id      | rating       | created_at  |
+-------------+--------------+--------------+-------------+
| 1           | 1            | 3            | 2020-01-12  |
| 1           | 2            | 4            | 2020-02-11  |
| 1           | 3            | 2            | 2020-02-12  |
| 1           | 4            | 1            | 2020-01-01  |
| 2           | 1            | 5            | 2020-02-17  | 
| 2           | 2            | 2            | 2020-02-01  | 
| 2           | 3            | 2            | 2020-03-01  |
| 3           | 1            | 3            | 2020-02-22  | 
| 3           | 2            | 4            | 2020-02-25  | 
+-------------+--------------+--------------+-------------+
Output: 
+--------------+
| results      |
+--------------+
| Daniel       |
| Frozen 2     |
+--------------+

In this problem, we need to report the user who had rated the most movies, and find out which movie has the highest average rating in Feb 2020. There are many different ways to approach this problem, you can join the “Movies” and “Users” tables with the “MovieRating” table separately & then do the calculation & filtering, or you’re welcome to try my long-ass solution below lol.

First I create two CTE tables — one to count and rank the number of movies each user had rated in descending order, and one to rank the average movie rating in descending order. Then I select the highest rank in each table, and use UNION ALL to link them together. Here’s the final SQL:

with rate_info as (
    select user_id,
        rate_count,
        dense_rank() over (order by rate_count desc) as count_rank
    from (
        select user_id,
            count(rating) as rate_count
        from MovieRating
        group by 1
        order by 2 desc
    ) sub
),

movie_rank as (
    select movie_id,
        avg_rate,
        dense_rank() over (order by avg_rate desc) as avg_rank
    from (
        select movie_id,
            avg(rating) as avg_rate
        from MovieRating
        where left(created_at, 7) = '2020-02'
        group by 1
        order by 2 desc
    ) sub
)


(select name as results
from Users
where user_id in (
    select user_id
    from rate_info
    where count_rank = 1
)
order by 1 
limit 1)
union all
(select title as results
from Movies
where movie_id in (
    select movie_id
    from movie_rank
    where avg_rank = 1
)
order by 1
limit 1)





# second solution here!!
with user_rate as (
    select name,
        count(movie_id) as rate_count
    from MovieRating a
    left join Users b
    on a.user_id = b.user_id
    group by 1
    order by 2 desc, 1
),

 movie_rate as (
    select title,
        avg(rating) as avg_rate
    from MovieRating a
    left join Movies b
    on a.movie_id = b.movie_id
    where left(created_at, 7) = '2020-02'
    group by 1
    order by 2 desc, 1
)


(select name as results
from user_rate
limit 1)
union all
(select title as results
from movie_rate
limit 1)

40. Restaurant Growth (Subqueries — Medium):

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| customer_id   | int     |
| name          | varchar |
| visited_on    | date    |
| amount        | int     |
+---------------+---------+
In SQL,(customer_id, visited_on) is the primary key for this table.
This table contains data about customer transactions in a restaurant.
visited_on is the date on which the customer with ID (customer_id) has visited the restaurant.
amount is the total paid by a customer.


You are the restaurant owner and you want to analyze a possible expansion (there will be at least one customer every day).

Compute the moving average of how much the customer paid in a seven days window (i.e., current day + 6 days before). average_amount should be rounded to two decimal places.

Return the result table ordered by visited_on in ascending order.


Input: 
Customer table:
+-------------+--------------+--------------+-------------+
| customer_id | name         | visited_on   | amount      |
+-------------+--------------+--------------+-------------+
| 1           | Jhon         | 2019-01-01   | 100         |
| 2           | Daniel       | 2019-01-02   | 110         |
| 3           | Jade         | 2019-01-03   | 120         |
| 4           | Khaled       | 2019-01-04   | 130         |
| 5           | Winston      | 2019-01-05   | 110         | 
| 6           | Elvis        | 2019-01-06   | 140         | 
| 7           | Anna         | 2019-01-07   | 150         |
| 8           | Maria        | 2019-01-08   | 80          |
| 9           | Jaze         | 2019-01-09   | 110         | 
| 1           | Jhon         | 2019-01-10   | 130         | 
| 3           | Jade         | 2019-01-10   | 150         | 
+-------------+--------------+--------------+-------------+
Output: 
+--------------+--------------+----------------+
| visited_on   | amount       | average_amount |
+--------------+--------------+----------------+
| 2019-01-07   | 860          | 122.86         |
| 2019-01-08   | 840          | 120            |
| 2019-01-09   | 840          | 120            |
| 2019-01-10   | 1000         | 142.86         |
+--------------+--------------+----------------+

In this question, starting on the 7th day, we need to report the sale amount of a 7 days window for each visit date, along with its moving average (7 days average sale). In my opinion, there are two key points here: first, starting on the 7th day, display the visit date consecutively, and second — calculate the sum and average sale of each 7 days window.

The first part is kinda easy (tho I took a hint on this ughh), in the WHERE clause, we can filter on the visited_on column to only show the 7th day and onward.

For the second calculation part, we can write two subqueries to calculate sum and average separately, and wrap the results in the SELECT clause to only select the rows that match with the correct visit date (≥ the 7th visit day). Don’t forget to order the results by visited_on column at the end. The final SQL is below:

select visited_on,
       (
           select sum(amount)
           from Customer
           where visited_on between date_sub(c.visited_on, interval 6 day) and c.visited_on
       ) as amount,
       (
           select round(sum(amount) / 7,2)
           from Customer
           where visited_on between date_sub(c.visited_on, interval 6 day) and c.visited_on
       ) as average_amount
from Customer c
where visited_on >= (
    select date_add(min(visited_on), interval 6 day)
    from Customer
)
group by 1
order by 1

41. Friend Requests II: Who has the most friends (Subqueries — Medium):

+----------------+---------+
| Column Name    | Type    |
+----------------+---------+
| requester_id   | int     |
| accepter_id    | int     |
| accept_date    | date    |
+----------------+---------+
(requester_id, accepter_id) is the primary key (combination of columns with unique values) for this table.
This table contains the ID of the user who sent the request, the ID of the user who received the request, and the date when the request was accepted.

Write a solution to find the people who have the most friends and the most friends number.

The test cases are generated so that only one person has the most friends.


Input: 
RequestAccepted table:
+--------------+-------------+-------------+
| requester_id | accepter_id | accept_date |
+--------------+-------------+-------------+
| 1            | 2           | 2016/06/03  |
| 1            | 3           | 2016/06/08  |
| 2            | 3           | 2016/06/08  |
| 3            | 4           | 2016/06/09  |
+--------------+-------------+-------------+
Output: 
+----+-----+
| id | num |
+----+-----+
| 3  | 3   |
+----+-----+

In this question, we need to calculate the number of friends each user has and figure out who has the most friends. Well, based on how I understand the question, we sorta need to count the number of times each user id appears in both requester_id and accepter_id columns. Then we can order the counts in descending order and use LIMIT 1 to select the highest number — since the problem specifically mentions that only 1 friend can have the most friends.

In my opinion, the easiest approach to use UNION ALL to combine all the rows in these two columns, and GROUP BY user id to count once and for all. Then we can wrap the results in a subquery in the FROM clause, and rank the counts in descending order, in order to select the most “popular” user. Here’s the final SQL:

select id,
       count(id) as num
from(
    select requester_id as id
    from RequestAccepted
    union all 
    select accepter_id as id
    from RequestAccepted
    order by 1 
) sub
group by 1
order by 2 desc
limit 1

42. Investment in 2016 (Subqueries — Medium):

+-------------+-------+
| Column Name | Type  |
+-------------+-------+
| pid         | int   |
| tiv_2015    | float |
| tiv_2016    | float |
| lat         | float |
| lon         | float |
+-------------+-------+
pid is the primary key (column with unique values) for this table.
Each row of this table contains information about one policy where:
pid is the policyholder's policy ID.
tiv_2015 is the total investment value in 2015 and tiv_2016 is the total investment value in 2016.
lat is the latitude of the policy holder's city. It's guaranteed that lat is not NULL.
lon is the longitude of the policy holder's city. It's guaranteed that lon is not NULL.

Write a solution to report the sum of all total investment values in 2016 tiv_2016, for all policyholders who:

have the same tiv_2015 value as one or more other policyholders, and
are not located in the same city as any other policyholder (i.e., the (lat, lon) attribute pairs must be unique).
Round tiv_2016 to two decimal places.



Input: 
Insurance table:
+-----+----------+----------+-----+-----+
| pid | tiv_2015 | tiv_2016 | lat | lon |
+-----+----------+----------+-----+-----+
| 1   | 10       | 5        | 10  | 10  |
| 2   | 20       | 20       | 20  | 20  |
| 3   | 10       | 30       | 20  | 20  |
| 4   | 10       | 40       | 40  | 40  |
+-----+----------+----------+-----+-----+
Output: 
+----------+
| tiv_2016 |
+----------+
| 45.00    |
+----------+

In this question, we need to report the total investment in 2016 for all policyholders under two specific conditions: policyholders who have the same investment value in 2015, and policyholders who locate in different cities — with different lat & lon.

To be honest, when I first read through this question, I was like excuse me what are you talking about?! This question is very poorly worded in my opinion, but we can make it work.

To figure out which policyholders have the same investment value in 2015, we can use aggregate functions to calculate the frequency of the column tiv_2015, and limit the rows to those appear more than once. For policyholders who locate in different cities — with unique lat and lon, we again use count() and GROUP BY to find out which pair of lat & lon only appears once.

Let’s then wrap the results in 2 different cte table, and combine them with an INNER JOIN on the common columns — lat and lon. Finally, we can calculate the sum of tiv_2016 and round it in 2 decimal places. Here’s the final SQL:

with same_invest_info as (
    select pid,
        tiv_2016,
        lat,
        lon
    from Insurance
    where tiv_2015 in (
        select tiv_2015
        from Insurance
        group by 1
        having count(tiv_2015) > 1
    )
),
loc as (
    select lat,
        lon
    from Insurance
    group by 1,2
    having count(lat) = 1
    and count(lon) = 1
)

select round(sum(tiv_2016),2) as tiv_2016
from loc a, same_invest_info b
where a.lat = b.lat
and a.lon = b.lon

43. Department Top 3 Salaries (Subqueries — Hard):

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| id           | int     |
| name         | varchar |
| salary       | int     |
| departmentId | int     |
+--------------+---------+
id is the primary key (column with unique values) for this table.
departmentId is a foreign key (reference column) of the ID from the Department table.
Each row of this table indicates the ID, name, and salary of an employee. It also contains the ID of their department.

Table: Department

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| id          | int     |
| name        | varchar |
+-------------+---------+
id is the primary key (column with unique values) for this table.
Each row of this table indicates the ID of a department and its name.

A company's executives are interested in seeing who earns the most money in each of the company's departments. A high earner in a department is an employee who has a salary in the top three unique salaries for that department.

Write a solution to find the employees who are high earners in each of the departments.

Return the result table in any order.



Input: 
Employee table:
+----+-------+--------+--------------+
| id | name  | salary | departmentId |
+----+-------+--------+--------------+
| 1  | Joe   | 85000  | 1            |
| 2  | Henry | 80000  | 2            |
| 3  | Sam   | 60000  | 2            |
| 4  | Max   | 90000  | 1            |
| 5  | Janet | 69000  | 1            |
| 6  | Randy | 85000  | 1            |
| 7  | Will  | 70000  | 1            |
+----+-------+--------+--------------+
Department table:
+----+-------+
| id | name  |
+----+-------+
| 1  | IT    |
| 2  | Sales |
+----+-------+
Output: 
+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT         | Max      | 90000  |
| IT         | Joe      | 85000  |
| IT         | Randy    | 85000  |
| IT         | Will     | 70000  |
| Sales      | Henry    | 80000  |
| Sales      | Sam      | 60000  |
+------------+----------+--------+

In this question, we need to report the top 3 employees who earn the highest salary in each department. First we can use window function to rank employees’ salaries in each department, wrap the results in a subquery in the FROM clause, and use SELF JOIN to fetch the department name based on its department id. And finally, we can select the top 3 earners by setting a filter in the outer query’s WHERE clause. Here’s the final SQL:

select d.name as Department,
       sub.name as Employee,
       salary
from (
    select name,
        departmentId,
        salary,
        dense_rank() over (partition by departmentId order by salary desc) as salary_rank
    from Employee
) sub
left join Department d
on sub.departmentId = d.id
where salary_rank <= 3

44. Fix Names in a Table (Advanced String Functions / Regex / Clause — Easy):

+----------------+---------+
| Column Name    | Type    |
+----------------+---------+
| user_id        | int     |
| name           | varchar |
+----------------+---------+
user_id is the primary key (column with unique values) for this table.
This table contains the ID and the name of the user. The name consists of only lowercase and uppercase characters.

Write a solution to fix the names so that only the first character is uppercase and the rest are lowercase.

Return the result table ordered by user_id.



Input: 
Users table:
+---------+-------+
| user_id | name  |
+---------+-------+
| 1       | aLice |
| 2       | bOB   |
+---------+-------+
Output: 
+---------+-------+
| user_id | name  |
+---------+-------+
| 1       | Alice |
| 2       | Bob   |
+---------+-------+

In this question, we need to convert the name column into proper casing — camel case. Well my solution below is a pretty lengthy one, using several different functions: left, substr, length, upper, lower, concat. Feel free to come up with your own solution as mine is just an example:

 select user_id,
       concat(upper(left(name,1)), lower(substr(name,2,length(name)))) as name
from Users
order by 1

45. Patients with a Condition (Advanced String Functions / Regex / Clause — Easy):

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| patient_id   | int     |
| patient_name | varchar |
| conditions   | varchar |
+--------------+---------+
patient_id is the primary key (column with unique values) for this table.
'conditions' contains 0 or more code separated by spaces. 
This table contains information of the patients in the hospital.

Write a solution to find the patient_id, patient_name, and conditions of the patients who have Type I Diabetes. 
Type I Diabetes always starts with DIAB1 prefix.

Return the result table in any order.


Input: 
Patients table:
+------------+--------------+--------------+
| patient_id | patient_name | conditions   |
+------------+--------------+--------------+
| 1          | Daniel       | YFEV COUGH   |
| 2          | Alice        |              |
| 3          | Bob          | DIAB100 MYOP |
| 4          | George       | ACNE DIAB100 |
| 5          | Alain        | DIAB201      |
+------------+--------------+--------------+
Output: 
+------------+--------------+--------------+
| patient_id | patient_name | conditions   |
+------------+--------------+--------------+
| 3          | Bob          | DIAB100 MYOP |
| 4          | George       | ACNE DIAB100 | 
+------------+--------------+--------------+

In this question, we need to find out which patients have Type I Diabetes, and report their patient_id, patient_name, along with their conditions. We can use the string operator LIKE and wildcard ‘%’ to approach this problem. Since Type I Diabetes always start with ‘DIAB1" prefix, I thought ‘DIAB1%’ should do the job — but it doesn’t work for patients with multiple conditions. We should also add another case — ‘% DIAB1%’ for such patients, here’s the final SQL:

select patient_id,
       patient_name,
       conditions
from Patients
where conditions like '% DIAB1%' or
conditions like 'DIAB1%'

46. Delete Duplicate Emails (Advanced String Functions / Regex / Clause — Easy):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| id          | int     |
| email       | varchar |
+-------------+---------+
id is the primary key (column with unique values) for this table.
Each row of this table contains an email. The emails will not contain uppercase letters.


Write a solution to delete all duplicate emails, keeping only one unique email with the smallest id.

For SQL users, please note that you are supposed to write a DELETE statement and not a SELECT one.


Input: 
Person table:
+----+------------------+
| id | email            |
+----+------------------+
| 1  | [email protected] |
| 2  | [email protected]  |
| 3  | [email protected] |
+----+------------------+
Output: 
+----+------------------+
| id | email            |
+----+------------------+
| 1  | [email protected] |
| 2  | [email protected]  |
+----+------------------+

In this question, we need to find out which ids have duplicate emails and delete the duplicate ones that have greater id number — in other words, we want to keep only the rows with smaller user id. The question specifies that we need to use the DELETE clause instead of SELECT.

We can first do a SELF JOIN on the table itself by setting the common field to email, and making the first ‘table’ ’s id smaller then the second ‘table’ — so we can then safely delete the records in the second ‘table’. Here’s the final SQL:

delete b.*
from Person a
join Person b
on a.email = b.email
and a.id < b.id

47. Second Highest Salary (Advanced String Functions / Regex / Clause — Medium):

+-------------+------+
| Column Name | Type |
+-------------+------+
| id          | int  |
| salary      | int  |
+-------------+------+
id is the primary key (column with unique values) for this table.
Each row of this table contains information about the salary of an employee.

Write a solution to find the second highest salary from the Employee table. 
If there is no second highest salary, return null. 


Input: 
Employee table:
+----+--------+
| id | salary |
+----+--------+
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |
+----+--------+
Output: 
+---------------------+
| SecondHighestSalary |
+---------------------+
| 200                 |
+---------------------+
Example 2:

Input: 
Employee table:
+----+--------+
| id | salary |
+----+--------+
| 1  | 100    |
+----+--------+
Output: 
+---------------------+
| SecondHighestSalary |
+---------------------+
| null                |
+---------------------+

In this question, we need to report the employee with the second highest salary, and report NULL if there’s none (e.g., only one record in the table). There are SO MANY different ways to approach this problem, I will just show you the 2 solutions that make most sense to me.

We can rank the salary in descending order for each user id and save the results as a CTE table, and then select the second highest salary by setting a filter in the WHERE clause (salary_rank = 2). To make the output to be NULL when there’s no record for the second highest earner, we can add a MAX() function on the salary column. For max(), it will automatically return NULL if there’s nothing to return. See my first solution below for more details :)

Alternatively, after building the CTE table, we can use the CASE statement and 2 subqueries to sort out when to output the second highest salary, and when to output the NULL value.

with cte as (
    select id,
        salary,
        dense_rank() over (order by salary desc) as salary_rank
    from Employee
)

select max(salary) as SecondHighestSalary
from cte
where salary_rank = 2




with cte as (
    select id,
        salary,
        dense_rank() over (order by salary desc) as salary_rank
    from Employee
)

select case when (select count(id)>1 from cte) 
            then (select distinct salary from cte where salary_rank = 2)
            else null end as SecondHighestSalary

48. Group Sold Products By Date (Advanced String Functions / Regex / Clause — Easy):

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| sell_date   | date    |
| product     | varchar |
+-------------+---------+
There is no primary key (column with unique values) for this table. It may contain duplicates.
Each row of this table contains the product name and the date it was sold in a market.


Write a solution to find for each date the number of different products sold and their names.

The sold products names for each date should be sorted lexicographically.

Return the result table ordered by sell_date.


Input: 
Activities table:
+------------+------------+
| sell_date  | product     |
+------------+------------+
| 2020-05-30 | Headphone  |
| 2020-06-01 | Pencil     |
| 2020-06-02 | Mask       |
| 2020-05-30 | Basketball |
| 2020-06-01 | Bible      |
| 2020-06-02 | Mask       |
| 2020-05-30 | T-Shirt    |
+------------+------------+
Output: 
+------------+----------+------------------------------+
| sell_date  | num_sold | products                     |
+------------+----------+------------------------------+
| 2020-05-30 | 3        | Basketball,Headphone,T-shirt |
| 2020-06-01 | 2        | Bible,Pencil                 |
| 2020-06-02 | 1        | Mask                         |
+------------+----------+------------------------------+

In this question, for each sell_date, we need the count the number of products it was sold and list them out in alphabetical order by a comma. We need to use the aggregate functions COUNT() to count the number of products sold on each day, and GROUP_CONCAT() to combine multiple rows into a single string (column). Here’s the final SQL:

select sell_date,
       count(distinct product) as num_sold,
       group_concat(distinct product order by product asc separator ',') as products
from Activities
group by 1

49. Lists the Products Ordered in a Period (Advanced String Functions / Regex / Clause — Easy):

+------------------+---------+
| Column Name      | Type    |
+------------------+---------+
| product_id       | int     |
| product_name     | varchar |
| product_category | varchar |
+------------------+---------+
product_id is the primary key (column with unique values) for this table.
This table contains data about the company's products.

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| product_id    | int     |
| order_date    | date    |
| unit          | int     |
+---------------+---------+
This table may have duplicate rows.
product_id is a foreign key (reference column) to the Products table.
unit is the number of products ordered in order_date.


Write a solution to get the names of products that have at least 100 units ordered in February 2020 and their amount.

Return the result table in any order.


Input: 
Products table:
+-------------+-----------------------+------------------+
| product_id  | product_name          | product_category |
+-------------+-----------------------+------------------+
| 1           | Leetcode Solutions    | Book             |
| 2           | Jewels of Stringology | Book             |
| 3           | HP                    | Laptop           |
| 4           | Lenovo                | Laptop           |
| 5           | Leetcode Kit          | T-shirt          |
+-------------+-----------------------+------------------+
Orders table:
+--------------+--------------+----------+
| product_id   | order_date   | unit     |
+--------------+--------------+----------+
| 1            | 2020-02-05   | 60       |
| 1            | 2020-02-10   | 70       |
| 2            | 2020-01-18   | 30       |
| 2            | 2020-02-11   | 80       |
| 3            | 2020-02-17   | 2        |
| 3            | 2020-02-24   | 3        |
| 4            | 2020-03-01   | 20       |
| 4            | 2020-03-04   | 30       |
| 4            | 2020-03-04   | 60       |
| 5            | 2020-02-25   | 50       |
| 5            | 2020-02-27   | 50       |
| 5            | 2020-03-01   | 50       |
+--------------+--------------+----------+
Output: 
+--------------------+---------+
| product_name       | unit    |
+--------------------+---------+
| Leetcode Solutions | 130     |
| Leetcode Kit       | 100     |
+--------------------+---------+

In this question, we need to report a list of products ordered in Feb 2020 with at least 100 units ordered. First, in the Orders table, we can use the aggregate function sum() to calculate the total number of units ordered for each product. Then we can set up the criteria to only include products that were ordered in Feb 2020 & had at least 100 units ordered, and save the results in a CTE table. Finally, we can join the CTE table with the Products table on the common field product_id, in order to fetch the product names. Here’s the final SQL:

with cte as (
    select product_id,
        sum(unit) as unit
    from Orders
    where left(order_date,7) = '2020-02'
    group by 1
    having unit >= 100
)

select product_name,
       unit 
from cte
left join Products p
on cte.product_id =  p.product_id

50. Find Users with Valid Emails (Advanced String Functions / Regex / Clause — Easy):

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| user_id       | int     |
| name          | varchar |
| mail          | varchar |
+---------------+---------+
user_id is the primary key (column with unique values) for this table.
This table contains information of the users signed up in a website. Some e-mails are invalid.

Write a solution to find the users who have valid emails.

A valid e-mail has a prefix name and a domain where:

The prefix name is a string that may contain letters (upper or lower case), 
digits, underscore '_', period '.', and/or dash '-'. 
The prefix name must start with a letter.
The domain is '@leetcode.com'.

Return the result table in any order.


Input: 
Users table:
+---------+-----------+-------------------------+
| user_id | name      | mail                    |
+---------+-----------+-------------------------+
| 1       | Winston   | [email protected]    |
| 2       | Jonathan  | jonathanisgreat         |
| 3       | Annabelle | @leetcode.com">[email protected]     |
| 4       | Sally     | [email protected] |
| 5       | Marwan    | quarz#[email protected] |
| 6       | David     | [email protected]       |
| 7       | Shapiro   | [email protected]     |
+---------+-----------+-------------------------+
Output: 
+---------+-----------+-------------------------+
| user_id | name      | mail                    |
+---------+-----------+-------------------------+
| 1       | Winston   | [email protected]    |
| 3       | Annabelle | @leetcode.com">[email protected]     |
| 4       | Sally     | [email protected] |
+---------+-----------+-------------------------+

We need to find out which users have a valid email. As the question states, a valid email starts with a letter (upper/lowercase), can contain letters, digits, underscore, period, and dash, and ends with the domain ‘@leetcode.com’. We should use regular expressions to approach this problem, let’s break it into smaller chunks:

  • email must starts with a letter (can be upper or lower case): ^[a-zA-Z]
  • email may contain letters, digits, underscore, period, dash: [a-zA-Z0–9_\.\-] we need to use backslash for period and dash here because they are wildcard characters
  • the asterisk ‘*’ sign — match 0 or more characters
  • email must ends with the domain ‘@leetcode.com’ — *@leetcode\.com
  • the last test case for this question has an email ended with a domain like ‘@leetcode?com’, so we’re ending another criterion to exclude such cases: and mail not like ‘%?%’
select *
from Users
where mail REGEXP '^[a-zA-Z][a-zA-Z0-9_\.\-]*@leetcode\.com'
and mail not like '%?%'

Here’s the end of this little SQL project, thanks so much for reading I’m sure you’re all SQL experts by now!!! Till next time :)

Sql
MySQL
Interview Questions
Leetcode
Towards Data Science
Recommended from ReadMedium