avatarAdam Shafi

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

7964

Abstract

lass="hljs-keyword">WHERE</span> regionid = <span class="hljs-number">1</span></pre></div><figure id="15f8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*3LJEcoDlELYKc6PycP2aDw.png"><figcaption></figcaption></figure><h1 id="81a3">Subqueries</h1><p id="9c09">A subquery is a nested query; it’s a query within a query</p><p id="128e">Let’s say we want the full dataset but only for the top 10 regions by total volume.</p><p id="b6ab">We could find the top 10 with the query below, then we could use this list of regions to filter the main table. But what if we got new data and the top 10 changed? We’d have to run both queries again. We can make this more dynamic using a subquery.</p><p id="32f0">To get the top 10, we’d run something like this:</p><div id="86ac"><pre><span class="hljs-keyword">SELECT</span> b.region, <span class="hljs-built_in">ROUND</span>(<span class="hljs-built_in">SUM</span>(a.totalvol)) <span class="hljs-keyword">FROM</span> avocados a <span class="hljs-keyword">LEFT</span> <span class="hljs-keyword">JOIN</span> region b <span class="hljs-keyword">ON</span> a.regionid=b.regionid <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-number">1</span> <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-number">2</span> <span class="hljs-keyword">DESC</span> <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">1</span></pre></div><p id="b6f1">In the query below, we use an <b>INNER JOIN </b>to filter the list to the <i>regions</i> in the subquery. We can also bring back data from the subquery because we’ve joined it.</p><p id="74a8">Please note how aliases work here. The subquery aliases are independent of the main query aliases, which is why we have 2 ‘a’ and ‘b’ aliases. To refer to subquery columns we have to use the subquery alias which is c in this case.</p><div id="4c9c"><pre><span class="hljs-keyword">SELECT</span> date, totalvol, avo_a, avo_b, b.region, c.total_totalvol <span class="hljs-keyword">FROM</span> avocados a <span class="hljs-keyword">LEFT</span> <span class="hljs-keyword">JOIN</span> region b <span class="hljs-keyword">ON</span> a.regionid=b.regionid <span class="hljs-keyword">INNER</span> <span class="hljs-keyword">JOIN</span> ( <span class="hljs-keyword">SELECT</span> b.regionid, <span class="hljs-built_in">ROUND</span>(<span class="hljs-built_in">SUM</span>(a.totalvol)) <span class="hljs-keyword">as</span> total_totalvol <span class="hljs-keyword">FROM</span> avocados a <span class="hljs-keyword">LEFT</span> <span class="hljs-keyword">JOIN</span> region b <span class="hljs-keyword">ON</span> a.regionid=b.regionid <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-number">1</span> <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-number">2</span> <span class="hljs-keyword">DESC</span> <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">10</span>) c <span class="hljs-keyword">ON</span> a.regionid = c.regionid</pre></div><figure id="314d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*63Rkhs7TAl-9msETSoprwg.png"><figcaption></figcaption></figure><p id="cb81">Subqueries can also be used after a <b>FROM</b> or <b>WHERE</b> clause. Let’s illustrate these with a simple example.</p><p id="7600"><b>FROM</b> Subquery</p><div id="857f"><pre><span class="hljs-keyword">SELECT</span> <span class="hljs-built_in">AVG</span>(total_totalvol) <span class="hljs-keyword">FROM</span> ( <span class="hljs-keyword">SELECT</span> b.regionid, <span class="hljs-built_in">ROUND</span>(<span class="hljs-built_in">SUM</span>(a.totalvol)) <span class="hljs-keyword">as</span> total_totalvol <span class="hljs-keyword">FROM</span> avocados a <span class="hljs-keyword">LEFT</span> <span class="hljs-keyword">JOIN</span> region b <span class="hljs-keyword">ON</span> a.regionid=b.regionid <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-number">1</span> <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-number">2</span> <span class="hljs-keyword">DESC</span> <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">10</span>)</pre></div><figure id="b0f7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*5JViDoLl1BwNJ3W8xRCoMg.png"><figcaption></figcaption></figure><p id="c91a"><b>WHERE</b> Subquery</p><div id="8c36"><pre><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> avocados <span class="hljs-keyword">WHERE</span> regionid <span class="hljs-keyword">IN</span> ( <span class="hljs-keyword">SELECT</span> regionid <span class="hljs-keyword">FROM</span> region <span class="hljs-keyword">WHERE</span> region <span class="hljs-keyword">LIKE</span> <span class="hljs-string">'West%'</span>)</pre></div><figure id="a6ac"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*msuWVP-cJNsIhdtCpAxzwQ.png"><figcaption></figcaption></figure><h1 id="a29a">Common Table Expressions</h1><p id="f8d4">A Common Table Expression (aka <b>CTE</b>, aka <b>WITH</b> statement) is a temporary data set to be used as part of a query. It only exists during the execution of that query; it cannot be used in other queries even within the same session</p><p id="ece1">Common Table Expressions are recommended if you plan to reuse the subquery within the same query because the subquery is temporarily saved to memory, meaning it doesn’t need to be run multiple times. They generally also look much cleaner, so if you are sharing code it can be helpful for legibility.</p><p id="f201">In this example we’ll just use a CTE in a way that could also be done with a subquery and we’ll take the <b>INNER JOIN</b> example from above.</p><p id="6d2d">Here, we specify the subquery above our main query. We name it ‘top’ and specify the column names, then write a subquery. We can now call ‘top’ throughout the query to use this.</p><p id="17af">If we call it multiple types, SQL only has to generate the table once.</p><div id="f26c"><pre><span class="hljs-keyword">WITH</span> top (regionid, total_totalvol) <span class="hljs-keyword">AS</span> ( <span class="hljs-keyword">SELECT</span> b.regionid, ROUND(SUM(a.totalvol)) <span class="hljs-keyword">FROM</span> avocados a <span class="hljs-keyword">LEFT JOIN</span> region b <span class="hljs-keyword">ON</span> a.regionid=b.regionid <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-number">1</span> <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-number">2</span> <span class="hljs-keyword">DESC</span> <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">10</span>)

<span class="hljs-keyword">SELECT</span> <span class="hljs-type">date</span>, totalvol, avo_a, avo_b, b.region, top.total_totalvol <span class="hljs-keyword">FROM</span> avocados a <span class="hljs-keyword">LEFT JOIN</span> region b <span class="hljs-keyword">ON</span> a.regionid=b.regionid <span class="hljs-keyword">INNER</span> <span class="hljs-keyword">JOIN</span> top <span class="hljs-keyword">ON</span> a.regionid = top.regionid</pre></div><figure id="2e05"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*__uFRwzL4f26ClPfQbxonA.png"><figcaption></figcaption></figure><h1 id="18c9">Window Functions</h1><p id="7800">From PostgreSQL’s <a href="https://www.postgresql.org/docs/9.1/tutorial-window.html">documentation</a>:</p><blockquote id="88cf"><p>A <i>window function</i> performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function d

Options

oes not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.</p></blockquote><p id="6e3d">Basically — we can use Window functions to look at rows above and below each row.</p><p id="9bcb">For more information and challenges, take a look at this fantastic resource: <a href="https://www.windowfunctions.com/">https://www.windowfunctions.com/</a></p><p id="0406"><b>Running Total</b></p><p id="7ff9">The <b>OVER ORDER BY</b> sequence is the window function. <b>OVER</b> is almost like a <b>GROUP BY</b>, specifying how we want to construct our sum. The <b>ORDER BY </b>allows us to create the running total</p><div id="cd9b"><pre><span class="hljs-keyword">SELECT</span> <span class="hljs-type">date</span> , avo_a , SUM(avo_a) <span class="hljs-keyword">OVER</span> (<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-type">date</span>) <span class="hljs-keyword">AS</span> running_total <span class="hljs-keyword">FROM</span> avocados <span class="hljs-keyword">WHERE</span> regionid =<span class="hljs-number">1</span> <span class="hljs-keyword">AND</span> <span class="hljs-keyword">type</span> =<span class="hljs-number">1</span></pre></div><figure id="549c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*KUMmvq5OMofxnupx3rJWvw.png"><figcaption></figcaption></figure><p id="dbd0"><b>Total by date</b></p><p id="3d27">Here, we use <b>PARTITION BY</b> to group the <b>SUM</b> by a given column (<i>Date</i> in this case)</p><div id="e09f"><pre><span class="hljs-keyword">SELECT</span> <span class="hljs-type">date</span> , avo_a , <span class="hljs-keyword">type</span> , SUM(avo_a) <span class="hljs-keyword">OVER</span> (<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> <span class="hljs-type">date</span>) <span class="hljs-keyword">AS</span> date_total <span class="hljs-keyword">FROM</span> avocados <span class="hljs-keyword">WHERE</span> regionid =<span class="hljs-number">1</span></pre></div><figure id="8888"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*v0VGLaEjYnLKA8MYZgBffA.png"><figcaption></figcaption></figure><p id="40b6"><b>Moving Average</b></p><p id="a791">This is where things start to heat up. By using <b>AVG</b> and <b>ORDER BY </b>date, we can specify <b>ROWS</b> to take a moving average. 3 <b>PRECEDING</b> is a 3 day moving average, but we can specify any number.</p><div id="8844"><pre><span class="hljs-keyword">SELECT</span> <span class="hljs-type">date</span> , avo_a , AVG(avo_a) <span class="hljs-keyword">OVER</span> (<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-type">date</span> <span class="hljs-keyword">ASC</span> <span class="hljs-keyword">ROWS</span> <span class="hljs-number">3</span> <span class="hljs-keyword">PRECEDING</span>) <span class="hljs-keyword">AS</span> date_total <span class="hljs-keyword">FROM</span> avocados <span class="hljs-keyword">WHERE</span> regionid =<span class="hljs-number">1</span> <span class="hljs-keyword">AND</span> <span class="hljs-keyword">type</span>=<span class="hljs-number">1</span></pre></div><figure id="0025"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*BGZEVP289-EfeNkOZ8dfrw.png"><figcaption></figcaption></figure><p id="67e5"><b>Get First Row</b></p><p id="63af">Let’s say we have varying end dates for each region. We can use Window Functions combined with a Subquery to get the first or last dates.</p><p id="7041">First, let’s define what our Subquery will be. The below query adds a row number, grouped by region id and ordered from most recent to oldest. We can select the first row from each grouping to get the newest date</p><div id="1edc"><pre><span class="hljs-keyword">SELECT</span> <span class="hljs-type">date</span> , regionid , ROW_NUMBER() <span class="hljs-keyword">OVER</span> (<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> regionid <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-type">date</span> <span class="hljs-keyword">desc</span>) <span class="hljs-keyword">AS</span> row_number <span class="hljs-keyword">FROM</span> avocados <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> regionid</pre></div><figure id="6f01"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*k9SWVuPtE0zKi9itXVq0eA.png"><figcaption></figcaption></figure><p id="2327">Now, we subquery the original and take only the first row.</p><div id="0f91"><pre><span class="hljs-keyword">SELECT</span> <span class="hljs-type">date</span>, regionid <span class="hljs-keyword">FROM</span> (<span class="hljs-keyword">SELECT</span> <span class="hljs-type">date</span> , regionid , ROW_NUMBER() <span class="hljs-keyword">OVER</span> (<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> regionid <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-type">date</span> <span class="hljs-keyword">desc</span>) <span class="hljs-keyword">AS</span> row_number <span class="hljs-keyword">FROM</span> avocados <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> regionid) <span class="hljs-keyword">WHERE</span> row_number = <span class="hljs-number">1</span></pre></div><figure id="8867"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*zS9yjcgriPzp9dGldmVjVQ.png"><figcaption></figcaption></figure><h1 id="6b77">Conclusion</h1><p id="bd8d">Mastering these is a bonus, many of these functions help to optimise your queries or roll data up. This can be important when working with large datasets as rolling up in the SQL layer of your analysis pipeline will speed things up down the line.</p><h2 id="75f1">More articles:</h2><div id="befd" class="link-block"> <a href="https://readmedium.com/from-basic-to-intermediate-sql-in-10-minutes-42b960ed6f9e"> <div> <div> <h2>From Basic to Intermediate SQL in 10 Minutes</h2> <div><h3>A guide for aspiring Data Scientists</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*o0JrsphlUW7yeutR)"></div> </div> </div> </a> </div><div id="0efc" class="link-block"> <a href="https://towardsdatascience.com/random-forest-29cf337c68d4"> <div> <div> <h2>Random Forest for a non-Technical Audience</h2> <div><h3>One of the most widely used algorithms today is actually quite tricky to explain…</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*hS0GWfZ04-JxHPhD)"></div> </div> </div> </a> </div><h2 id="1fda">Contact me:</h2><div id="52ad" class="link-block"> <a href="https://www.linkedin.com/in/adamshafi/"> <div> <div> <h2>Adam Shafi - Data Science Immersive - General Assembly | LinkedIn</h2> <div><h3>Data Scientist with over 4 years experience in Analytics including managing teams, delivering projects, and turning…</h3></div> <div><p>www.linkedin.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*R8o1Ct_mTmaCaNTy)"></div> </div> </div> </a> </div></article></body>

Advanced SQL for Data Scientists

Photo by Daniel Fazio on Unsplash

Introduction

This article looks at a few techniques that if mastered, equips the user with the tools to deal with a wide range of data types. This article does not cover anything relation to database management such as table creation or schemas.

If you’d like to follow along, you can set up a local SQL Server using SQLite :

If these techniques are too complex, you can get up to speed in less than 10 minutes here:

Table of Contents

Exploring the Example Data

Let’s take a quick look at the table before we filter it.

SELECT *
FROM avocados
LIMIT 50

There are also 2 mapping tables in this dataset for type and regionid which map these numeric categories to their actual text.

JOIN as a Filter

By combining an INNER JOIN with a filter using AND, we can filter the table as we are joining it, providing a small performance boost over a regular join and WHERE filter. The is handy for big datasets but you’ll need to to know your data well as you may drop rows you didn’t intend to because this technique uses the INNER JOIN.

SELECT *
FROM avocados a
INNER JOIN region b ON a.regionid = b.regionid
AND b.region = 'Denver'

Self Joins

Self joins are particularly useful when we need information in different rows to be in the same row. My dataset isn’t exactly set up to do this but we can still illustrate the point.

Let’s try to get the total volume for 2 regions on the same date on the same row.

First, note the aliases we’ve given to each avocados table. It’s the same table, but we’ve called each one a and b.

Inside the join, we’ve joined on date and regionid but we’ve used the does not equal operator (<>) to ensure we don’t join the same types together.

The result is a single row with both types on it, instead of each appearing in separate rows.

Please note, we do end up with duplicate data here as we have every combination of types. This could be avoided by filtering.

SELECT 
  a.date, a.avo_a, a.type, 
  b.avo_a, b.type, b.regionid
FROM avocados a
INNER JOIN avocados b 
  ON a.date = b.date 
  AND a.regionid=b.regionid
  AND a.type <> b.type 
WHERE a.date = '2015-01-04'
  AND a.regionid = 1

CASE WHEN

CASE WHEN is essentially an IF function, which is found in almost all coding languages. As always, remember the ‘flow’: IF -> THEN -> ELSE when writing these.

We’ll use this to map types to the actual types (we could also do this with a join to our types table).

Note that the formula starts with CASE and ends with END. We can give the column an alias using AS after the END.

We can have as many WHEN/THEN clases as needed. Here we only need one because we have 2 types and the 2nd is covered by the ELSE

SELECT date, avo_a,
 CASE
  WHEN type =1 THEN 'conventional'
  ELSE 'organic'
 END AS avo_type
FROM avocados
WHERE regionid = 1

Subqueries

A subquery is a nested query; it’s a query within a query

Let’s say we want the full dataset but only for the top 10 regions by total volume.

We could find the top 10 with the query below, then we could use this list of regions to filter the main table. But what if we got new data and the top 10 changed? We’d have to run both queries again. We can make this more dynamic using a subquery.

To get the top 10, we’d run something like this:

SELECT b.region, ROUND(SUM(a.totalvol))
FROM avocados a
LEFT JOIN region b ON a.regionid=b.regionid
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1

In the query below, we use an INNER JOIN to filter the list to the regions in the subquery. We can also bring back data from the subquery because we’ve joined it.

Please note how aliases work here. The subquery aliases are independent of the main query aliases, which is why we have 2 ‘a’ and ‘b’ aliases. To refer to subquery columns we have to use the subquery alias which is c in this case.

SELECT date, totalvol, avo_a, avo_b, b.region, c.total_totalvol
FROM avocados a
LEFT JOIN region b ON a.regionid=b.regionid
INNER JOIN (
  SELECT b.regionid, ROUND(SUM(a.totalvol)) as total_totalvol
  FROM avocados a
  LEFT JOIN region b ON a.regionid=b.regionid
  GROUP BY 1
  ORDER BY 2 DESC
  LIMIT 10) c ON a.regionid = c.regionid

Subqueries can also be used after a FROM or WHERE clause. Let’s illustrate these with a simple example.

FROM Subquery

SELECT AVG(total_totalvol)
FROM (
  SELECT b.regionid, ROUND(SUM(a.totalvol)) as total_totalvol
  FROM avocados a
  LEFT JOIN region b ON a.regionid=b.regionid
  GROUP BY 1
  ORDER BY 2 DESC
  LIMIT 10)

WHERE Subquery

SELECT *
FROM avocados
WHERE regionid IN (
  SELECT regionid
  FROM  region 
  WHERE region LIKE 'West%')

Common Table Expressions

A Common Table Expression (aka CTE, aka WITH statement) is a temporary data set to be used as part of a query. It only exists during the execution of that query; it cannot be used in other queries even within the same session

Common Table Expressions are recommended if you plan to reuse the subquery within the same query because the subquery is temporarily saved to memory, meaning it doesn’t need to be run multiple times. They generally also look much cleaner, so if you are sharing code it can be helpful for legibility.

In this example we’ll just use a CTE in a way that could also be done with a subquery and we’ll take the INNER JOIN example from above.

Here, we specify the subquery above our main query. We name it ‘top’ and specify the column names, then write a subquery. We can now call ‘top’ throughout the query to use this.

If we call it multiple types, SQL only has to generate the table once.

WITH top (regionid, total_totalvol) AS (
SELECT b.regionid, ROUND(SUM(a.totalvol))
  FROM avocados a
  LEFT JOIN region b ON a.regionid=b.regionid
  GROUP BY 1
  ORDER BY 2 DESC
  LIMIT 10)
  
SELECT date, totalvol, avo_a, avo_b, b.region, top.total_totalvol
FROM avocados a
LEFT JOIN region b ON a.regionid=b.regionid
INNER JOIN top ON a.regionid = top.regionid

Window Functions

From PostgreSQL’s documentation:

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

Basically — we can use Window functions to look at rows above and below each row.

For more information and challenges, take a look at this fantastic resource: https://www.windowfunctions.com/

Running Total

The OVER ORDER BY sequence is the window function. OVER is almost like a GROUP BY, specifying how we want to construct our sum. The ORDER BY allows us to create the running total

SELECT 
   date
 , avo_a
 , SUM(avo_a) OVER (ORDER BY date) AS running_total
FROM avocados
WHERE regionid =1 AND type =1

Total by date

Here, we use PARTITION BY to group the SUM by a given column (Date in this case)

SELECT 
   date
 , avo_a
 , type
 , SUM(avo_a) OVER (PARTITION BY date) AS date_total
FROM avocados
WHERE regionid =1

Moving Average

This is where things start to heat up. By using AVG and ORDER BY date, we can specify ROWS to take a moving average. 3 PRECEDING is a 3 day moving average, but we can specify any number.

SELECT 
   date
 , avo_a
 , AVG(avo_a) OVER (ORDER BY date ASC  ROWS 3 PRECEDING) AS date_total
FROM avocados
WHERE regionid =1 AND type=1

Get First Row

Let’s say we have varying end dates for each region. We can use Window Functions combined with a Subquery to get the first or last dates.

First, let’s define what our Subquery will be. The below query adds a row number, grouped by region id and ordered from most recent to oldest. We can select the first row from each grouping to get the newest date

SELECT 
   date
   , regionid
   , ROW_NUMBER() OVER (PARTITION BY regionid ORDER BY date desc) AS row_number
FROM avocados
ORDER BY regionid

Now, we subquery the original and take only the first row.

SELECT date, regionid
FROM
  (SELECT 
     date
     , regionid
     , ROW_NUMBER() OVER (PARTITION BY regionid 
     ORDER BY date desc) AS row_number
  FROM avocados
  ORDER BY regionid)
WHERE row_number = 1

Conclusion

Mastering these is a bonus, many of these functions help to optimise your queries or roll data up. This can be important when working with large datasets as rolling up in the SQL layer of your analysis pipeline will speed things up down the line.

More articles:

Contact me:

Sql
Advanced
Data Science
Window Functions
Subquery
Recommended from ReadMedium