This articles aims to introduce the world of SQL, the concepts involved, and some common confusions or pitfalls that people face on their journey.
- DQL - Data Query Language
- DML - Data Manipulation Language
- DDL - Data Definition Language
- TCL - Transaction Control Language
- DCL - Data Control Language
Most people begin with DQL (the familiar SELECT * FROM table) because usually, the tables are already properly designed with data inserted, and the common use case is to answer practical questions by querying the data.
DML comes next when you are not satisfied with the data you are given and want to INSERT, UPDATE, DELETE something for cleaning data up or testing how application code behaves with specific data properties.
Then comes DDL where you CREATE or TRUNCATE entire tables of data instead of just manipulating parts of the data in a table. This is done when data is being migrated across tables for purposes of backup, or during database design where normalization or denormalization is done.
TCL is when you want to batch statements up to either complete an entire sequence of steps successfully with COMMIT or not do it at all by ROLLBACK. This mechanism ensures that more complex changes to the database do not leave it in an unstable state when something fails midway.
Lastly, DCL is used to control who can run what SQL statements on what tables. For example, some people may only be granted permission to do DQL while others may be able to do DML and DDL. This is to protect the database from data corruption from users running commands and tools they may not be familiar with.
DDL is mainly the job of Data Analysts, while Data Engineers would have to be additionally good at DML, DDL, TCL. Database Administrators would also know DCL.
Beginners may think that a SQL statement begins with SELECT, but once they know that it starts with FROM, things become much easier to reason about. Specifying FROM first also helps Dbeaver autocomplete the columns when you reach SELECT. This order of execution also explains why column aliases defined in SELECT may not be available in keywords that come earlier in the order of execution. (eg. GROUP BY, HAVING)
Arguably the most important keyword, since it occurs in the FROM clause (1st keyword in order of execution), and defines the data available to the rest of the keywords.
When faced with a problem, a useful thought process is to start thinking about what data is required, in which columns and which tables they exist, and whether a JOIN is even necessary. If confirmed to require a JOIN, the next decision is whether the JOIN is INNER, LEFT or OUTER. The latter two have the potential to generate NULL for unmatched columns.
This affects whether
COUNT(right_table_col) should be used during GROUP BY or OVER since * includes NULL while the latter two does not.
A common pattern is having Customers in the left table left joined to Orders in the right table, with a
COUNT(any_col_from_right_table) to count how many orders each customer made after grouping the joined tables on customer_id, the key being to include also customers with 0 orders. This is not possible with a INNER JOIN, or even a LEFT JOIN with
COUNT(any_col_from_left_table). These subtleties in interaction of keywords and select patterns used make SQL difficult for beginners.
We should also be aware that NULLs may not only be generated from JOINs, but are part of the data already for whatever reason. This means that
COUNT(any_col_from_right_table) is not exactly accurate anymore, as there is a risk of undercounting when a column has potential for nulls. In this case it would be more accurate to
COUNT(any_col_with_NOT_NULL_constraint) to be confident that the counts are not reduced by NULLs.
Besides the mechanics of joining, we should also be aware of the physical meaning of a table and why we may want to join it multiple times or in a certain way into an existing set of tables.
There are 2 black lines and 1 red line among the 3 tables, we could have used any 2 of the 3 lines to connect the 3 tables, so there are 3C2 = 3 ways to join them together. Which of these ways are correct, and whether we should use all 3 lines instead of just 2 of them depends on the problem.
In this example, the Users can take multiple actions (writing posts as an Author and appearing as a user_id in Posts table, or liking posts and appearing as a user_id in Likes table).
We must then be clear whether to join the User table into the existing set of tables (
Posts JOIN Likes), on
Posts.user_id (bottom black line) or
Likes.user_id (top black line), and we may even need to join the same User table twice (both black lines, once for Author details, once for Liker details) into the existing set of tables.
This is counterintuitive because people may think that since we already joined once, we should have all the necessary data. The missing concept here is that JOIN is not only about what tables get added, but also the joining conditions, which are determined by the physical meaning of the analysis conducted.
WITH liker_per_author AS ( SELECT posts.user_id author_id, likes.user_id liker_id, COUNT(*) like_count FROM likes JOIN posts ON likes.post_id = posts.id GROUP BY author_id, liker_id ORDER BY author_id, like_count DESC ) SELECT authors.first_name || " " || authors.last_name author_full_name, likers.first_name || " " || likers.last_name biggest_fan_full_name, like_count FROM liker_per_author JOIN users authors ON liker_per_author.author_id = authors.id JOIN users likers ON liker_per_author.liker_id = likers.id GROUP BY authors.id ORDER BY author_full_name
WITH liker_per_author AS ( SELECT posts.user_id author_id, likes.user_id liker_id, COUNT(*) like_count FROM likes JOIN posts ON likes.post_id = posts.id GROUP BY author_id, liker_id --ORDER BY author_id, like_count DESC ), ranks as ( SELECT authors.first_name || " " || authors.last_name author_full_name, likers.first_name || " " || likers.last_name biggest_fan_full_name, like_count, RANK() OVER(PARTITION BY author_id ORDER BY like_count DESC) rank FROM liker_per_author JOIN users authors ON liker_per_author.author_id = authors.id JOIN users likers ON liker_per_author.liker_id = likers.id ORDER BY author_full_name) SELECT author_full_name, biggest_fan_full_name, like_count FROM ranks WHERE rank = 1
1st solution did
ORDER BY author_id, like_count DESC with the WITH clause just so it can make use of sqlite’s feature to SELECT the first
biggest_fan_full_name in each group in the final SELECT.
Note how these 2 columns do not appear in the GROUP BY in the final SELECT, and are not functionally dependent on the grouped column
authors.id.(I know you can’t see this point without the data, also more detail on this under GROUP BY section later in article)
It also does not cater for ties in
like_count, which is what the window rank function in 2nd solution did.
It may not be obvious to beginners that filtering can be done at joining stage instead of after join with WHERE by adding additional ON expressions to extend the JOIN ON clause. Depending on the type of join, this can yield different behaviour: https://www.pluralsight.com/guides/using-on-versus-where-clauses-to-combine-and-filter-data-in-postgresql-joins
Knowing what conditions to use in these filters are often how interviews test problem solving skills. Here’s a challenging example involving non-equi joins (https://ryxcommar.com/2019/06/24/a-cool-sql-problem-and-why-it-is-also-a-bullshit-sql-problem/)
Non-equi joins mean joins that do not used == in the JOIN ON condition, but the other 5 operators (<>, <, ≤, >, ≥). This is a neat trick for deduplication when self-joins are used to pair every row with every other row when the problem requires enumerating all pairs.
In Feature Engineering during Machine Learning, we often want to calculate statistics that were true at that point in time, ignoring all the future data that we have. It is common to JOIN ON … AND col ≤ datetime_col in such cases to ignore data from all rows past that point in time, and each row will have a different aggregation window based on it’s own point in time.
Try this fun exercise! (Top answer does not even need non-equi join, demonstrates concept of how 1 sided comparison operators achieve deduplication: https://www.interviewquery.com/questions/flight-records)
Can you substitute them for each other: https://stackoverflow.com/questions/6589695/sql-join-using-union-union-using-join
This is a little stretching it for the mental gymnastics, but those who love a little more relational algebra theory on SQL may like to take on this challenge to understand or explain their similarities and differences.
To summarize, the decision here is do you want only the boolean result and to short-circuit the query on the subqueried table (EXISTS) or do you want the data from the table on the right-hand-side of JOIN too?
Cardinality (https://en.wikipedia.org/wiki/Cardinality_(data_modeling)) is critically important to understand not only in data modelling (DDL) but also during querying (DQL). It tells the analyst how many rows the right table of a JOIN can be expected to be matched to a single row in the left table.
For example, a customer can have many orders, so the same Customer_Id is expected to appear in the Orders table multiple times, but 0 times is possible too. (More details: https://vertabelo.com/blog/crow-s-foot-notation/).
Note that these one-one, one-many, many-many relationships are properties of the data model and not the data. Meaning it does not mean that the cardinality of customer-orders is one-one just because every single customer thus far has only ordered 1 order. A Data Model is expected to prepare for likely future possible scenarios, and in real life, it is very likely that the same customer can return to order again.
Understanding cardinalities prevents surprises when the joined result set bloats in number of rows, and influences whether DISTINCT should be used to deduplicate. Such row bloating often also causes inflated aggregation results: https://alexpetralia.com/posts/2017/7/19/more-dangerous-subtleties-of-joins-in-sql.
Fan Trap and Chasm Trap are also things to watch out for: conceptual model - Fan trap and chasm trap - Database - Stack Overflow.
For people starting to learn SQL with sqlite, they unfortunately miss the chance to be alerted to this error:
Column 'X' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
This statement means that columns you SELECT have to either appear in GROUP BY or be Aggregated (barring functionally dependencies which are out of scope of this article). The intuitive reason for why aggregate is because GROUP BY condenses many rows in the group to a single row, so of course a statistic condensing a list of values in a column to a single value must be applied, otherwise how does the grouping make sense?
Sqlite being a very lax engine simply chooses the first row of each group if this mistake is made. Other stricter engines like SQL Server or Postgres would have errored on this.
There are table and column aliases to be clear. If someone understood the order of execution, he may then be confused to see that MySQL allows column aliases in GROUP BY and HAVING, Postgres allows column aliases in GROUP BY but not HAVING, and the strictest engines (MS SQL and Oracle) do not allow column aliases in GROUP BY or HAVING, only ORDER BY, which makes sense since that comes after SELECT in order of execution.
These conflicting behaviours exist because different database engines have different implementations that aim to provide convenience to the user.
It is not always convenience though, as Postgres requires a useless table alias after subqueries, which MySQL, Oracle does not need.
Division can really trip up people, because the output depends on the datatype of the input column. If an INTEGER column was provided, the output will be truncated to an INTEGER, so 3.8/2 = 1 instead of 1.9 (note the massive round down).
Even more hard to spot is when using
AVG , because it implicitly does a division beneath the hood, but contains no division sign like /.
Additionally, if the were numerator/denominator or inputs to AVG were integers, it may not be immediately obvious that the output could be a float. Eg. 2/3 or AVG(2,2,2).
The explicit way to solve this is to convert the INTEGER to FLOAT with CAST(col AS FLOAT), or simply
1. * col. The division will output a FLOAT if either the numerator or denominator (needs extra bracket since divide happens before multiply without brackets) is type casted.
However, such integer division is not always a bad thing. It is a very helpful feature when we want to map multiple items into 1 destination, such as creating buckets for GROUP BY. Imagine dividing 0,1,2,3,4,5 by 2, you will get 0,0,1,1,2,2. (3 perfect buckets for grouping)
SUM, COUNT, AVG do their normal calculations on the entire list of values usually, but if they appear in a WINDOW FUNCTION that uses an ORDER BY clause, they become cumulative.
For example, GROUP BY, SUM on a col of 1,2,3 gives 1+2+3 = 6, but in SUM(col) OVER (col ORDER BY other_col), it becomes 1, 1+2, 1+2+3 = 1,3,6. This behaviour can be changed if the frame-spec of a window is changed from its default of
UNBOUNDED PRECEDING AND CURRENT ROW but that is out of scope of the article.
If no ORDER BY was used, it becomes 6,6,6 as if the same value is copy-pasted across the entire group of 3 rows(PARTITION to be exact with sql terminology).
The bottom of this page shows a summary of what aggregations allows/requires what syntax: https://www.sqlservercentral.com/articles/understanding-the-over-clause
- What information is required?
- What columns in which tables does such information exist
- Must I join? Which tables? INNER, LEFT, OUTER?
- Are there aggregations to be done? Which columns to GROUP BY, which ones to aggregate?
- How does my query handle NULL?
- Could I have used Common Table Expressions (WITH) to break down deeply nested subqueries?
- Could I have not joined a table at all because the information is already available in another table i’m using?
- This may not be obvious at the initial stages of problem solving when you started with fewer tables that contains less duplication of information
- An example where I identified joining unnecessary tables: https://hanqi01.medium.com/i-like-point-5-on-using-cte-to-show-dependencies-9665707d637b (long comment referencing long article)
- Are the indentations clear?
- Are the table and column aliases easily understood
- Are the queries Sargable (make use of indexes): https://www.brentozar.com/archive/2019/12/the-two-ways-to-fix-non-sargable-queries/
Most people who come to sql from an existing programming language may be used to implementing things in a procedural way. SQL opens up an entire new thought process of set-based programming.
- Inteviewquery: https://www.interviewquery.com/questions?t=sql
- Linkedin Skills Assessment: https://www.linkedin.com/help/linkedin/answer/111413/available-skill-assessments-on-linkedin?lang=en
- Hackerrank: https://www.hackerrank.com/challenges/15-days-of-learning-sql/problem
- SQL Antipatterns: https://pragprog.com/titles/bksqla/sql-antipatterns/
- SQL Tips and Techniques: https://flylib.com/books/en/1.490.1/
- Linkedin: https://www.linkedin.com/in/eric-weber-060397b7/
We can connect on Linkedin (https://www.linkedin.com/in/hanqi91/ ) if you want to discuss more.