The difference of Union and join

Hi! I am a little confused about the operator ( union,intersect, and except) and join( left/right,outer,inner).

I dont understand the concept of join rows. Does it mean append the rows of table B to the table A if table a and table b have the same columns names?

Hi @candiceliu93

Union allows you to combine results from two select statements into a single result, wich will include ALL the rows from both statements (it works like concat statement in pandas)

Join combines the columns from two tables (like merge in pandas) only when the data is the same, so if column a has a data entry that doesn’t exist in column b, then that entry is “deleted”

This graphic explain it better
imagen

pd.concat not only acts as union, but can act as join too if axis=1.

You can join 1 table to itself, any number of times, called a self-join.

Pandas merge allows equi-joins only, not non-equi joins. equi-joins mean ON table.join_col = table.join_col, non-equi join means any other comparison operator. These query patterns affect what indexes are good for the DB.

“Joining rows” sounds unintuitive to me too. Think of joining as getting more information about an object (customer, order). In sql, objects are stored per row and attributes are stored in columns, so getting more information means getting more columns from the same/disparate tables to enrich information about a row.

Joining is a result of normalization during database design to save space and prevent error prone updating. For analytic purposes, if memory allows, it is better to denormalize and put all information in 1 table, avoiding the need for writing lengthy joining code.

Column names do not have to be the same, usually the result of union follows the names from the 1st SELECT. What must be the same are the types of columns and number of columns. Also, even if types are same, you must make sure the physical meaning are correct (same column types can be ordered differently, a common mistake is different One-Hot Encoding runs resulting in different column orders even through they are same type).
Once you google “constraints/requirements for union”, you will know how to break it, then go try break it and see the error statements returned to enforce your memory, then think of hacks to get pass the error message while still breaking it. This process is useful for anything tech related.

Usually people think of union as stacking the rows of 2 column-aligned sets of data, but this sql exercise shows a beautiful use of union on misaligned/different columns (i just partially spoilt the answer): https://ryxcommar.com/2019/06/24/a-cool-sql-problem-and-why-it-is-also-a-bullshit-sql-problem/

1 Like

Hi! can i say union equal to outer join?

it seems that operator( union, intersect and except) can replace join at some situation. If the number of columns and columns types are same for 2 dataset, I can use union to combine data instead of using outer join? am in right?
btw,Union wont not return duplicate data, correct?

1 Like

Yes union will deduplicate. Use union all to keep the duplicates.

Any sample code you tried to substantiate this statement? Talking about concepts is good for idea exploration, but staying in that space, holding on to more and more ideas without getting to work proving or disproving them will serve to confuse.

Try setting up your own small database and tables on sqlfiddle.com or sqliteonline.com, and create your own union and join operations to see the effects.

Here is a simple explanation with visuals: https://www.essentialsql.com/what-is-the-difference-between-a-join-and-a-union/. (this website is really good explaining other concepts too) You can copy paste small code snippets from online to set up simple databases to do your experiments. Here are some experiments i have done, you can see how to set up tables from them:
Window Functions vs DISTINCT vs GROUP BY relationship
Postgres SQL GROUP BY refering to column in SELECT?
SQL summing down columns and across rows, and COALESCE
PIVOTING in SQL - Comparing CASE vs JOIN

Here too you can find examples of how other’s set up data to prove a feature: https://www.sql-workbench.eu/dbms_comparison.html

UNION and JOIN are totally different things, so cannot replace each other’s functions, but if used as part of a solution to solve a problem, they can contribute in different ways to the same result.

  1. Replacing JOIN with UNION: https://dba.stackexchange.com/questions/168699/can-i-replace-these-left-joins-with-a-union-or-some-other-optimization

In this example, the desired result is to find whether every object is in group A and/or group B.
The UNION method (2nd solution) is basically hardcoding the 1 and 0 created from IS NULL in the 1st solution. Also the 1st solution begins join from object table, the 2nd solution joins in the object table last.

  1. Replacing UNION with JOIN: https://stackoverflow.com/questions/10157468/sql-query-replace-union-all-with-joins-or-something

In the JOIN method, the OrderID = O.ID are used for alignment when Date, Machine columns are removed from GoodPieces, BadPieces tables compared to the UNION method. In the sqlfiddle demo, you don’t have to use the exact OrderID selected in that example, as long as the Date,Machine matches, you will get same results.
Eg.

insert GoodPieces (OrderID, Pieces) values
  (3, 2),
  (3, 4),
  (4, 11),
  (7, 15),
  (8, 11),
  (9, 5);

can be replaced with

  (4, 2),
  (4, 4),
  (4, 11),
  (7, 15),
  (7, 11),
  (9, 5);
1 Like

Hi @hanqi
Thank you for sharing the website and explaining the samples.

I created the mini table on sqlfiddle.com, I tried to use intersect, except and full outer join to see the difference of operator and join. But I can’t use it on that website. any other way I can test the operator?

1 Like

What does “can’t use it” mean exactly? I know what’s wrong here but it’s more helpful to share your fiddle/code and paste the error message for future questions.

The default implementation chosen on sqlfiddle is MySQL, it has none of the 3 operators.
Use another flavour or use other commands in mysql to emulate.


1 Like

I found another good example of using UNION to replace JOIN. It’s from this exercise on HackerRank: https://www.hackerrank.com/challenges/interviews/problem, from this solution: https://www.hackerrank.com/challenges/interviews/forum/comments/658211

Basically, he UNION 2 tables of 3 columns that have only the foreign key column challenge_id in common, the other 2 columns of the UNIONed tables are unrelated stats. Then added a custom column to differentiate the 2 blocks of tables so other/outer queries can recognize which block has which metrics. After joining to other information, the final SELECT contains custom logic to point to the correct block and correct metric, and the correct column(2x2=4 combinations in total) before SUM aggregation. Very nice solution, this UNION should be faster than first JOINING the 2 tables on challenge_id before joining to other tables.

To solve this, you need to look at the sample input and consider cardinalities of tables and whether joins will create duplicate rows that bloat the aggregated value wrongly.