avatarZach Quinn

Summary

This article discusses five bad habits in SQL programming that developers should avoid to improve their code quality and efficiency.

Abstract

The article "5 SQL Bad Habits You Need to Break" highlights common mistakes made by SQL programmers, from beginners to professionals. These bad habits include using numbers instead of column names, overusing subqueries, using the EXCEPT clause, not using dynamic date filters, and equating processing time with run time. The author emphasizes the importance of breaking these habits to improve code legibility, scalability, and efficiency. The article also provides examples and explanations for each bad habit, making it a valuable resource for SQL developers looking to improve their skills.

Bullet points

  • Using numbers instead of column names can lead to illegible and hard-to-follow queries, especially in large datasets.
  • Overusing subqueries can result in slower query execution and redundant operations.
  • Using the EXCEPT clause can lead to careless coding and potential duplicate columns.
  • Not using dynamic date filters can make queries less efficient and more prone to errors.
  • Equating processing time with run time can lead to inaccurate expectations and slower query execution.

5 SQL Bad Habits You Need to Break

From beginners to professionals, chances are we’re all guilty of at least one of the following bad SQL habits.

Nurture good SQL habits before starting your first SQL coding project with Learning SQL’s free, 5-page project guide. Download a PDF here.

Photo by Manan Chhabra on Unsplash

Unlike a bad habit such as smoking, no nicotine gum exists to wean programmers off bad habits in SQL or any other language.

However, unlike smoking, these SQL habits aren’t hurting anyone, so this begs a somewhat obvious question.

Why stop?

You need to stop writing good queries badly for the following reasons:

  • Illegible, hard-to-follow queries are an obvious sign of a beginner (or a beginner who lacks the motivation to self-police and improve)
  • Badly written queries are harder to reproduce or alter (a must if working in a team environment or sharing work with non-technical stakeholder)
  • Your bad habits can spread and become a bad influence on your susceptible teammates (meaning the whole team’s standards for style and syntax go out the window)

Recently, a fellow Learning SQL contributor Aparna Gurav wrote about ways to improve your SQL queries.

While the advice she offered was sound, I was shocked at how many of the tips were surprisingly subtle.

Be warned.

Just like subtle improvements can elevate your queries, subtle bad habits can reduce the quality and clarity of your code in the long run.

Your Days of Using Numbers Are Numbered

I learned SQL to write cool, complex, awe-inspiring queries.

I don’t want to write column names over and over again.

Since looping in SQL is a complex task, especially in BigQuery, I need some kind of variable that I can use repeatedly.

So I use numeric values as shorthand, which, on the surface is totally fine.

-- These two queries are the same 
-- Ver. 1 
SELECT col_1, col_2, col_3, COUNT(1) AS col_count
FROM sample_tab
GROUP BY col_1, col_2, col_3
ORDER BY col_count DESC
-- Ver. 2 
SELECT col_1, col_2, col_3, COUNT(1) AS col_count
FROM sample_tab
GROUP BY 1, 2, 3
ORDER BY 4 DESC

The problem with this habit is its lack of scalability.

In other words, if your query contains, let’s say, more than 50 columns and has aggregate functions interspersed, it’s going to be difficult to keep track of your operations by the time you get to any GROUP BY clause.

-- Query with a lot of columns. First, with words. 
-- Column names will be numbered so the example can be better understood. 
SELECT col_1, col_2, col_3, MAX(col_4) AS max_col, col_5, 
col_6, col_7, col_8, col_9, col_10, MIN(col_11) AS min_col, col_12, col_13,
col_14, col_15, col_16, SUM(col_17) AS sum_col, col_18, col_19, COUNT(1) AS col_count
FROM sample_tab
-- There are aggregate columns. This is where the real fun starts.
GROUP BY col_1, col_2, col_3, col_5, col_6, col_7, col_8, col_9, col_10, col_12, col_13, col_14, col_15, col_16, col_18, col_19

If this looks messy to you, you’re going to want to avert your eyes for this next bit.

-- Now let's see it with numbers.
SELECT col_1, col_2, col_3, MAX(col_4) AS max_col, col_5, 
col_6, col_7, col_8, col_9, col_10, MIN(col_11) AS min_col, col_12, col_13,
col_14, col_15, col_16, SUM(col_17) AS sum_col, col_18, col_19, COUNT(1) AS col_count
FROM sample_tab
GROUP BY 
1, 2, 3, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 18, 19

Assuming you can follow that (and depending on how much caffeine I have sometimes I can), I’ll show you how to ruin a SQL developers day in one step.

And to make it more difficult, I’m taking away the numbers appended to ‘col’ and changing the variable names to more realistic column names.

SELECT age, address, ID, income, purchase, 
payment, gross_income, credit_card_number, MAX(purchase) AS max_purchase, revenue, last_position, insurance_y_n, demo_info,
MAX(date) AS recent_date, spend_ytd, phone_number, country_of_origin, state, MIN(amount_spent) AS min_amount, zip, po_box, COUNT(1) AS col_count
FROM sample_tab
GROUP BY 
1, 2, 3, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 18, 19

So, a couple things here.

I’m sure you noticed that I moved the aggregate operations to different columns.

You may not have noticed that I increased the column count.

Either way, I don’t even need to plug this into a SQL IDE to know this won’t run.

And this is the downside of using numbers:

Numbered columns are static values. Variable names are forever (or at least until they’re reassigned).

Sooo Many Sub Queries

Photo by Thomas Haas on Unsplash

Google BigQuery supports substantial sub query depth.

While I haven’t found an official limit of GCP sub query depth, for comparison, Microsoft’s SQL Server can support up to 32 levels of nested queries.

Does this mean you should test those limits?

Absolutely not.

Even if you’re an absolute beginner at SQL you need to know the following:

SQL queries execute from the inside out.

That means if you have a query with multiple sub queries, this it may take longer to run because it needs to complete more steps in order to execute.

For context, here is a query I wrote that involves about eight levels of sub queries (a quarter of the number supported by MS SQL Server).

Aside from execution issues, the fact that I’m only conducting one operation per query means that many of these sub queries are fairly redundant.

Thanks to BigQuery formatting, it’s not necessarily the readability that’s the issue here.

It’s the fact that many of these operations can be completed elsewhere in the code.

To shake this bad habit, look for opportunities to conduct operations earlier in the code and, better yet, ways to combine operations/steps to avoid blocks like this monstrosity.

Full disclosure: This is a query I wrote when I began my current position to test both GCP and my limits for sub query depth. The fact that I’m able to reflect and re-think some of these operations and logic goes back to my earlier point about good code reflecting a certain degree of maturity and authority as you gain real-world, professional experience.

Pardon the interruption: To receive my latest writing, consider following me.

Try to Avoid EXCEPT

Sometimes I think I’m an experienced SQL developer.

Then I find myself writing queries like the following:

SELECT * EXCEPT (a_col, b_col, c_col) 
FROM sample_tab

It looks like I’m being smart and using EXCEPT to select only the columns I want.

But this negates a pretty crucial fact.

You’re still using SELECT *, which tops the list of SQL bad habits.

SELECT * of any variety, whether it’s SELECT DISTINCT * or SELECT * EXCEPT() shows a kind of carelessness in composing a query.

You’re not thinking about why you’re doing what you’re doing. There’s also the possibility that if you forget a column using SELECT * EXCEPT() you’ll end up with duplicate columns, which is decidedly unprofessional.

This can be an issue if you are trying to make any changes to the column order when overwriting or copying a table.

To me, it is much more professional to state what a query must include rather than what it can ignore.

Not Using a Dynamic Date Filter

Using a date as a filter parameter in SQL is super easy.

Unlike Python, you don’t have to use an external library like datetime or time delta to specify a date range.

You can just use BETWEEN or a comparison operator (>, <, etc.).

SELECT some_col, date_col
FROM sample_tab
WHERE date_col BETWEEN '2022-07-01' AND '2022-07-15'

Note: This query assumes date_col is purely a DATE type and not a DATETIME field, which would involve additional conversion.

However, both dates are static values.

If you’re working in a production environment and scheduling your queries to run on specific increments, you can’t get away with manually inserting a date each time.

Like manually numbering columns, having to repeatedly insert a date can get tedious, redundant and raise the potential for breakage.

Instead, in many SQL dialects (and certainly in BigQuery), it is more efficient (and frankly, less time-consuming) to define a variable using DECLARE.

However, don’t make the mistake of thinking defining a variable with a static value is much better.

DECLARE from_date DATE DEFAULT '2022-07-01';
DECLARE to_date DATE DEFAULT '2022-07-15';

Instead, you can use a date function to create the conversion dynamically once and then use it throughout the script.

DECLARE from_date DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL 14 DAY);
DECLARE to_date DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);

Admittedly, this advice also extends to the use of any other repeated variable, whether they are of type STRING, INT, FLOAT or BOOL.

Equating Processing Time with Run Time

My favorite and least-favorite feature of BigQuery is the data processing estimate displayed before executing a query.

GCP processing estimate. Screenshot by the author.

I’m not going to lie, seeing that green check pop up after debugging a query can be quite a dopamine rush (as lame as that sounds).

However, after working with BigQuery for a while I’ve learned (rather painfully) a fundamental truth about SQL environments:

Processing time != run time.

The infrastructure underlying a SQL database/environment is rather complex, so I’ll refer you to an overview of BigQuery’s Dremel engine for greater context.

While it typically takes larger queries (> 1 TB) longer to execute, this is due to many factors, not just the amount of data to be processed:

  • Operations (aggregate or window functions that need to iterate over every single row will slow down processing time)
  • Query depth (more queries usually means longer execution)
  • Meta reads (multiple references to a dataset’s table schema will decrease query speed)
  • Slot usage and resource allocation (running at peak times will likely yield a different performance than running at off-peak times)

Although this final bad habit doesn’t necessarily relate to syntax it should shape your mindset and approach to writing functional and replicable queries.

Recognizing and self-correcting your own SQL bad habits can help you become a more professional, confident and efficient SQL programmer.

Create a job-worthy data portfolio. Learn how with my free project guide.

Sql
Data Engineering
Data Science
Business Inteligence
Learning To Code
Recommended from ReadMedium