From Basic to Intermediate SQL in 10 Minutes

Basics - SELECT, WHERE, ORDER BY - WHERE details - Strings & Dates
Intermediate - Aggregation - Aggregation filtering - Joins - Unions

Introduction

If you need a more general introduction to SQL and Databases, check out the first part of this tutorial:

The Easiest Local SQL Server

Use SQLite to hone your skills

adam-shafi.medium.com

In this article, we will look at basic SQL syntax; selecting data from tables, filtering and working with data types. We will then look at Joins and aggregate measures.

The syntax in this article is all you really need to work with data in SQL competently. Anything else just makes life easier.

SQL for Data Science — why?

SQL remains one of the most important skills for all data related roles in 2020. Regardless of the advances and uptake in NoSQL databases, companies still use relational databases widely and SQL is essential to extract data out.

Having SQL syntax at your fingertips ensures you are able to get data fast, understand complex databases quickly and spend more time on analysis. Using SQL to filter and roll up data before bringing it into another tool is especially helpful when working with very large datasets.

We can see from the below chart that SQL is consistently one of the most requested skills in job listings and many companies will test for SQL as part of their interview process.

https://cvcompiler.com/blog/how-to-become-more-marketable-as-a-data-scientist/?utm=357d78dbc8a2d4a3

Using SQL

SQL Best Practice

There are multiple different SQL database management systems. These have slightly different syntax and are provided by different vendors. Generally, there is a lot of equivalence between them so learning the fundamentals in one is highly transferrable. The main names you’ll hear are MySQL, PostgreSQL, MS SQL Server.

Best practice is to put syntax such as SELECT into capital letters, separate columns on new lines and indent after joins. You’ll see me use these however, SQL doesn’t need these formatting steps to work.

This tutorial uses SQLite and an Avocados dataset. If you want to build this yourself in under 5 minutes, take a look here.

Basics — SELECT, WHERE, ORDER BY

Exploring the data

Let’s take a quick look at the table before we filter it.

SELECT * selects every column in the table.
LIMIT 50 means we only get the first 50 rows.

SELECT *
FROM avocados
LIMIT 50

SELECT, WHERE, ORDER BY

We’re going to tackle these in one query.

The order of functions is essential in SQL. The order shown in these queries is generally the only order that works

SELECT 
   date
 , avgprice
 , totalvol
FROM avocados
WHERE type = 1
ORDER BY totalvol DESC

SELECT

Here we define our columns via a comma separated list.
We can use * to select all columns in the table.

FROM

Here we define the table we want to pull the data from.
In this case, the name of the table is avocados.

WHERE

This is a filter. We can filter by any column in the table even if we haven’t selected it.
Multiple filters can be chained together using AND/OR.

ORDER BY

This sorts our data by the specified column.
We can specify ASCending or DESCending afterwards to change the order.

Using WHERE

Comparison Operators

We can use typical comparison operators with WHERE such as equals, greater than, less than etc.
These are standard symbols. You can see a full list here.

Contains

We can also use LIKE to find strings that contain a word.
We combine LIKE with a percent symbol to specify anything after or anything before.
The below query selects anything starting with ‘A’. We could put the percent sign before to find anything ending with ‘A’ (case-sensitive) or on both sides to find anything containing this character.

SELECT * 
FROM region
WHERE region LIKE ‘A%’

Filtering with a list

One of the most commonly used functions with WHERE is IN which allows you to pass a list of items to filter by.
In the example below, we treat date as a string and pass 2 dates to filter.
We also combine this with a regionid filter.

SELECT *
FROM avocados
WHERE date IN ('2015-01-04','2015-01-11')
AND regionid=1

Working with Strings and Dates

Strings

String and Date syntax can vary a lot depending on which database management system you are using. In general, most of the string and date operations you can do in Excel are transferable to SQL but with different syntax.

SELECT 
   UPPER(region)
 , SUBSTR(region,2,5)
 , SUBSTR(region,-3,3)
FROM region

UPPER

This converts the text to uppercase.
This is very useful when used in conjunction with WHERE clauses if you are searching for a single word in text that is a mixture of cases.

SUBSTR

This takes a subsection of the string using the numbers we specify, the first indicates the character to start on and the second is the number of characters we take.
We can also use negative numbers to get the last characters.

For more information on strings, try: https://www.geeksforgeeks.org/sql-string-functions/

Dates

When working with dates, the key is to make sure your data has the correct data type; it should be stored as an actual date.

If it isn’t you’ll need to either talk to your database administrator and get them to update the ‘create table’ statement to convert this.

In SQLite we can’t work with dates as easily, but the Syntax below will work in Postgres.

SELECT my_date,
 EXTRACT(‘year’ FROM my_date) AS year,
 EXTRACT(‘month’ FROM my_date) AS month,
 EXTRACT(‘day’ FROM my_date) AS day,
 EXTRACT(‘hour’ FROM my_date) AS hour,
 EXTRACT(‘minute’ FROM my_date) AS minute,
 EXTRACT(‘second’ FROM my_date) AS second,
 EXTRACT(‘decade’ FROM my_date) AS decade,
 EXTRACT(‘dow’ FROM my_date) AS day_of_week
FROM table_name

If in Postgres, you can also use the DATE_PART function

SELECT DATE_PART(‘year’, my_date) AS year 
FROM table_name

Intermediate SQL — Aggregation, JOIN, UNION

Aggregation — SUM, AVG, MIN, MAX, GROUP BY

Let’s aggregate the data using the type column. We have 2 types, which at the moment are called ‘1’ and ‘2’. We’ll find out what those mean later.

The order is also very important here. The GROUP BY must go at the end, and if we are using a WHERE, this should go before the GROUP BY.

SELECT 
   type
 , AVG(avgprice)
 , SUM(totalvol)
FROM avocados
GROUP BY type

AVG

This takes the average of the column in brackets by row

SUM

This takes the sum of the column in brackets by row

These aggregate functions are calculated over the rows selected which means we need to group the rows to get to what we want — the average or sum by type

GROUP BY

This is the column we want to group by.
SQL is performing the AVG or SUM calculation on all the rows within a group based on this function.

Other commonly used aggregations are MIN, MAX, COUNT.

Aggregate Filtering

Let’s say we want to take our previous example and filter out the types where the avgprice is less than 1.5.

Obviously there are only 2 types here so we could do this by eye but imagine we had hundreds!

We’ve already learnt to filter using WHERE but this won’t work to achieve what we want.

We can’t use WHERE with aggregate functions such as SUM and AVG because this will filter the underlying rows, not our summarised data.

Filtering the underlying data means we actually filter the individual rows and not our summary table. See the example below.

If we use WHERE, the AVG and SUM are now calculated using only rows with avgprice > 1.5.

Note the different values for Average and Sum compared to last time. This isn’t what we were trying to achieve!

Enter HAVING. This allows us to filter the aggregated rows after they’ve been calculated.

SELECT 
   type
 , AVG(avgprice)
 , SUM(totalvol)
FROM avocados
GROUP BY type
HAVING AVG(avgprice) > 1.5

JOIN Theory

JOIN might be the single most important function in SQL. Understanding how joins work, in my opinion, takes you from a beginner to intermediate SQL user.

In order to understand joins, you should understand the basics of relational databases and STAR schema tables. The first part of this guide gives a good explanation.

This is the schema of our set of tables. Without making joins we can’t get the actual type or region names because our Avocados table uses IDs in place of the actual names.

There are 4 main types of join; Inner, Outer, Left and Right. The differences are the rows that are kept in the joined table once the join is made.

INNER JOIN: Returns all rows where there is at least one match in BOTH tables.
LEFT JOIN: Returns all rows from the left table and matched rows from the right table.
RIGHT JOIN: Returns all rows from the right table and matched rows from the left table.
FULL JOIN: Returns all rows where there is a match in ONE of the tables.

Consider a situation where a table has missing rows — eg if our Region table above was missing some region ids that were present in the main Avocado table.

In this situation, an INNER JOIN would cause us to lose any rows without a corresponding region id, so here we might want to use a left, right or outer join to keep all rows in the Avocado table.

It’s also worth noting that if there are duplicates in your dataset with the same ID, joining these will cause multiple lines to be created.

Making sure you understand duplicates in your data is one of the key things to be careful of when making a join.

JOIN Composition

Specify the join type
Specify the table you are bringing on
After the ON, we name the columns
The order here doesn’t matter, if doing a LEFT JOIN, the left table is always the original table. The table you are bringing in is always the right table.
Specify the column
We put an equals sign between each column
This is the table you are joining to
This is the columns we are joining on. It should be equivalent to the table in ‘table1’ eg the same id.

You can use AND to join on multiple columns.

JOIN in SQL

We’ll just use left joins for the tutorial, but if you are following along, feel free to experiment with other join types.

SELECT *
FROM avocados
LEFT JOIN avo_type ON avocados.type = avo_type.typeid

We now have the columns typeid and type from the avo_type table. Let’s select only the date, avgprice and type columns.

Take a look at the query below, we’ve changed 2 parts of this query.

First, we’ve specified exactly which columns we’ll be using.

Then, tables have been given aliases.

In SQL aliases are used to give a table, or a column in a table, a temporary name. Aliases are often used to make column names more readable. An alias only exists for the duration of the query.

We’ve called the avocados table ‘av’ and the avo_type table ‘avt’. We’ve also used these as an easy way to specify which columns to select. You’ll notice the type column is in both tables, so specifying which type we want is essential here.

SELECT 
   av.date
 , av.avgprice
 , avt.type
FROM avocados av
LEFT JOIN avo_type avt ON av.type = avt.typeid

UNION

Unions are used when we want to put one table on top of another. It isn’t really applicable for the data we have in this example, but I’ll show the syntax anyway.

In this query, we filter to type =1 and then use a union to stack this on top of a table for type !=1

Note that we’ve renamed the sum column here but the UNION still stacks it as it uses the position of the column and not the column name.

SELECT type, sum(totalvol) as totalvol
FROM avocados av
WHERE type = '1'
UNION
SELECT type, sum(totalvol) as t
FROM avocados av
WHERE type != '1'

Summary

If you got to the end of this article, congratulations! You now have the required to knowledge to use SQL to a working standard.

Honestly, it doesn’t get much more complicated than this. Getting all the functions in this article mastered will ensure you can pass most SQL tests.

Advanced SQL for Data Scientists

Master SQL to save time and show off

medium.com

The Easiest Local SQL Server

Use SQLite to hone your skills

adam-shafi.medium.com

From Basic to Intermediate SQL in 10 Minutes

Table of Contents

Introduction

The Easiest Local SQL Server

Use SQLite to hone your skills

SQL for Data Science — why?

Using SQL

Basics — SELECT, WHERE, ORDER BY

Exploring the data

SELECT, WHERE, ORDER BY

Using WHERE

Working with Strings and Dates

Strings

Dates

Intermediate SQL — Aggregation, JOIN, UNION

Aggregation — SUM, AVG, MIN, MAX, GROUP BY

Aggregate Filtering

JOIN Theory

JOIN Composition

JOIN in SQL

UNION

Summary

More SQL Articles:

Advanced SQL for Data Scientists

Master SQL to save time and show off

The Easiest Local SQL Server

Use SQLite to hone your skills

Contact Me:

Adam Shafi - Data Science Immersive - General Assembly | LinkedIn

Data Scientist with over 4 years experience in Analytics including managing teams, delivering projects, and turning…