avatarBenoit Pimpaud

Summary

The article discusses the limitations of SQL for modern data analytics and the need for a new semantic to handle non-rectangular data.

Abstract

The article begins by acknowledging that SQL, created over 30 years ago, has been widely used for data analytics despite being originally designed for OLTP and CRUD operations. However, the author argues that SQL is not the most natural or optimized way to express data models, especially for large analytics projects. The article suggests that we are thinking in Euclidean terms while our data reality is non-Euclidean, and that we should seek a proper semantic for the problem at hand. The author also criticizes the use of SQL in large codebases, stating that it can be difficult to manage even with software best practices in place. The article concludes by mentioning new languages and engines being created to address these limitations, such as Malloy, SOL, PRQL, Substrait, and ibis.

Opinions

  • SQL, while widely used for data analytics, is not the most natural or optimized way to express data models.
  • We are thinking in Euclidean terms while our data reality is non-Euclidean.
  • Managing large codebases of SQL can be difficult, even with software best practices in place.
  • New languages and engines are being created to address the limitations of SQL for modern data analytics.
  • These new languages and engines are full of promise and could make it easier to write, understand, share, and maintain data models.
  • The end result of these new languages and engines will be better communication between engineering and business.
  • The author is not a prophet but feels that there is more to add to the analytics engineering space.

SQL is not Designed for Analytics

It sounds like a hot take. But does a language created more than 30 years ago is still relevant to our analytics need?

SQL was designed for OLTP (Online Transaction Processing). For CRUD operations (Create, Read, Update, Delete).

In the advent of data analytics, we now use SQL to transform data. To create ad-hoc analysis. To create business intelligence dashboards.

We have created tools (e.g. dbt) to streamline such process. To bring “software best practices”. We have made SQL our de facto lingua for anything analytics related.

SQL doesn’t need to change. It’s working fine for decades. It’s the keystone of most of our modern world databases.

It’s the data and what we do with it that has changed.

Still, we only rely on quite low-level frameworks (Spark with Hadoop/MapReduce) and we have built our analytics semantic on top of SQL to deal with data which is not rectangular anymore.

Are we missing something ? What’s next after SQL?

We’re not prophets, but we can feel there is something more to add there.

This blog post is definitely not a critic of SQL. It’s a discussion of what could come next for the analytics engineering space and why having a proper semantic for specific tasks is key for productivity and system maintainability.

Why SQL in the first place ?

SQL (Structured Query Language) is a domain-specific language used for managing and querying relational databases. It is not implemented in C++ or any specific programming language. It’s rather a language specification that can be implemented by various database management systems (DBMS) using different programming languages. Some popular database systems are implemented in C or C++, like MySQL and PostgreSQL. Others, like Oracle, use a mix of C and assembly language.

So, while SQL itself is a language for querying and managing databases, the underlying database system is a complex piece of software that involves a variety of components, often implemented in low-level languages for performance reasons.

In the first place, SQL had been made to deal with OLTP databases. With the advent of OLAP databases, SQL was the natural query language to use there too.

For example, GCP Big Query service (data warehouse) offers a SQL interfaces built on top of custom services using low-level technologies from Google systems.

And it works quite well. Thanks to OLAP architectures, we can query petabytes of data in seconds. Using the intuitive and easy-to-learn declarative language: SQL.

End of the story ? Not really.

The need for another semantic

If you ever worked on large analytics projects, you probably had hard-times to manage SQL queries. Reading and maintaining many of them with more than 200 lines is hard (not to say more than 500…).

Projects such as dbt answer this problem by providing templates and documentation out the box. Still, the challenge is then managing hundred of dbt models. In the same spot, we’ve seen a comeback of basic data modelisation in our discussions, even some new ones.

But in the end the problem is the same: we are thinking in rectangle with SQL as our main semantic (which only return rectangles). It works, but it is not the most natural nor the most optimized way to express our data model.

Everyone is like “Haha it’s like Yoda-speak!” but beyond toy examples its like “Yoda reads The Odyssey”

The reason we are struggling with SQL syntax for longer queries isn’t because we are dumb, it’s because it’s not a good syntax for our problem.

It was supposed to be (SQL) so easy that non-technical professionals could use it. This is nothing short of comical now

Data analytics require more than rectangle tables. We often group over different dimensions, nest our results over different time ranges, filters at different levels, etc.

It sometimes feel like we put a square peg in a round hole.

We are thinking full Euclidian, while our data reality is Non-Euclidian.

We should seek a proper semantic for the problem at hand. We are in a quite young field that is only an “engineering” domain (“analytics engineer”) since recently. Shouldn’t we try new ways?

We can see a kind of trend in other domains that moved toward more declarative and proper semantic.

For Terraform, the limits of Hashicorp HCL is also its strengths. Its limitations mean it is easy for humans and computers to read, write, and understand. With HCL, humans write Terraform code when it suits them but also allow computers to provide different views of that code as needed. Moving away from jQuery to React made composable and reusable code the basic in web development.

Over my experiences, one the biggest pain point was to manage important codebases of SQL (with dbt or not) within teams of data-analysts and data-engineers. Even after putting software best practice (CI/CD, linting, peer-review, etc.) things where not easy. Everyone as a different background, a different experience.

And dbt is jQuery, not Terraform (nor React).

A well written complicated mathematical equation can express clearly and tersely what English couldn’t.

What’s Next ?

We could think that the end-game solution will be talking to the computer in full plain English. It’s not:

If you’re using a language model to write code, what you’re doing is using English as a programming language. It’s the exact same job, just a different representation. It’s exactly like using ChatGPT to compile English into Python code the way gcc compiles C++ into binary. So you’re choosing English as your programming language over Python, C++, and all the other programming languages. And English is a terrible programming language.

Because that’s another thing that people are confused about when it comes to programming, and this is something that even programmers don’t all recognize. The code isn’t for the computer — it’s for you. It’s for humans. If the coding language was meant for the computer, we would all be writing in pure binary instead of these abstracted and symbolic languages. The Python/C++/whatever-code isn’t some obstacle that we are trying to overcome. The code is the interface that we designed to be able to program the computer. It’s what we need. It’s objective, explicit, unambiguous, (relatively) static, internally consistent, and robust. English has none of these properties — it’s subjective, meaning is often implicit, and ambiguous, it’s always changing, contradictions appear, and its structure does not hold up to analysis.

Designing a good semantic for our problem is the key. And we recently see many new ideas going into that direction. There are new languages and engine being created: Malloy, SOL, PRQL, Substrait, ibis, etc.

Nest operator in Malloy

They are full of premises. They are still young, but they pave the way to something easier. Easier to write, to understand, to share, to maintain.

We see the same kind of idea in other data-engineering spaces with Kestra (declarative pipeline orchestration) or RillData (BI as code) with accent put on “as code” paradigm.

Under the hood it means easier maintainability, easier onboarding and so easier discussion. It brings back everyone to the table. Different persona with different background can finally talk a same language.

We’re only at the beginning of defining our new semantic. There will be new langages created and new tools. But whatever the “winner”, the end results will be better communication between our brain and the computer. Between engineering and business.

I’m a data engineer who has worked in several industries such as journalism, retail, professional sports, music and now data software. If you’re looking for great data resources and thinking, subscribe to my newsletter👀 From An Engineer Sight.

Data Analytics
Data Engineering
AI
Sql
Dbt
Recommended from ReadMedium