How to split a table column which contains a list and put it in a separate normalised table?

So lets say I have a data that looks like this, where a vendor used a variety of tags in one column
however, this isn’t a normalised table and therefore I would have to split it up:

Vendor_tag    | Vendor_tag_name            |            
-----------   | -------------------------  | 
  1,2         | burger, shawarma           |   
 2,3,4        | shawarma,icecream,sweets   |      
  1,4         | burger, sweets             |        
                  

using pandas, how would I be able to turn it into this table

VendorTagID | VendorID | TagID | 
----------- | -------- | ----- | 
          1 |        1 |     4 | 
          2 |        1 |     8 | 
          3 |        3 |    10 | 
          4 |        3 |    91 | 
                  

I can’t seem to get my head around what kind of logic I would need to do to output that table. I guess I would have to access the list in the column and iterate through it. But there’s going to be duplicates? Im not too sure how to go about in doing this.

Hello,

What you can do in this case is:

  • Step 1 - convert comma-separated data in one column to multiple columns (wide data format)
  • Step 2 - convert from wide data format (multiple columns) to long data format, where multiple columns are squeezed into one column, and their values are converted into extra rows

messy data to long format.ipynb (17.8 KB)

Click here to view the jupyter notebook file in a new tab

1 Like

You can use pd.melt(df.VendorTagID.str.split(pat=',',expand=True)) too besides semenchuk’s solution.

3 Likes

That’s such an amazing explanation! Will take awhile to get used to for definite. Thank you!!

Also, how would I do the same thing using sql?

You can use STRING_SPLIT + CROSS APPLY, or PIVOT/UNPIVOT in their hardcoded columns form or dynamic generating columns form (STRING_AGG or FOR XML PATH needed).

You can type the following code on sqliteonline.com and select SQL SERVER engine, then open new tabs to write each query. (Good to use a strict engine so errors don’t pass silently like Sqlite, look at the damage of Selecting columns not in GROUP BY and not aggregated:Why is the Solution Ignoring 'Quantity' in invoice_line when Calculating Sold Tracks per Genre per Country?)

CREATE TABLE and pd.melt using SQL CROSS APPLY STRING_SPLIT

DROP TABLE if EXISTS vendor;
CREATE TABLE vendor (vendor_id INT, vendor_tagid VARCHAR(max));

INSERT INTO vendor (vendor_id,vendor_tagid)
VALUES
(6,'1,2'),
(7,'2,3,4'),
(8,'1'),
(9,'1,2,3,4');

SELECT vendor_id, value AS split_vendor_tagid   -- value is the only possible output name
FROM vendor  
CROSS APPLY STRING_SPLIT(vendor_tagid, ',');
vendor_id split_vendor_tagid
6 1
6 2
7 2
7 3
7 4
8 1
9 1
9 2
9 3
9 4

Done!
But let’s learn more.

STORE the melted table into new table to practice PIVOT and UNPIVOT

DROP TABLE if EXISTS split_tags;
CREATE table split_tags(vendor_id INT, split_vendor_tagid INT)

INSERT INTO split_tags
SELECT vendor_id, value AS split_vendor_tagid   -- value is the only possible output name
FROM vendor  
CROSS APPLY STRING_SPLIT(vendor_tagid, ',')

PIVOT by supplying hardcoded column names (Tutorial: https://www.youtube.com/watch?v=ozy31aJpW-o&t=659s&ab_channel=BeardedDev)

SELECT *
FROM
(
	SELECT * FROM split_tags
) 
as useless_alias_prevent_error
PIVOT
(
  COUNT(split_vendor_tagid) 
  for split_vendor_tagid IN ([1],[2],[3],[4]) -- hardcoded names here
)
AS useless_alias_prevent_error2;
vendor_id 1 2 3 4
6 1 1 0 0
7 0 1 1 1
8 1 0 0 0
9 1 1 1 1

This is more organized than the output from pandas. Note how the columns 1,2,3,4 represent the hardcoded tag names 1,2,3,4, unlike the 1st,2nd,3rd,4th expanded tag shown in pandas, so information in pandas is not column aligned, while here they are and each column stands for a particular fixed tag, with 0 instead of None filled due to COUNT aggregation used to add more
information.

See how pandas flushes the numbers to the left and fills None on right for any vendor id with less than 4 (max number among all vendors) tags.

0 1 2 3
0 1 2 None None
1 2 3 4 None
2 1 None None None
3 1 2 3 4

Throughout these PIVOT UNPIVOT, a lot of useless alias have to be added else an uninformative syntax error message that doesn’t explain what’s wrong, so look out. I don’t know why they’re needed unnecessarily, same for Postgres.

Here you can get deeper understanding of SQL using CASE (how people lived before pivot was invented, and still used today for it’s greater flexibility) instead of PIVOT to get the same: https://www.youtube.com/watch?v=xPMbgM8Eb2k&ab_channel=BeardedDev.

A second way to get the same results above

What if too many tags to manually type? Dynamically find the distinct tag names and use them to fill in a SQL string to execute.

DECLARE @cols AS NVARCHAR(MAX),
		@query AS NVARCHAR(MAX);

--SET @cols = (SELECT STRING_AGG(QUOTENAME(split_vendor_tagid),',') FROM (SELECT DISTINCT split_vendor_tagid FROM split_tags) AS useless_alias_prevent_error)
SELECT @cols = STRING_AGG(QUOTENAME(split_vendor_tagid),',') from (SELECT DISTINCT split_vendor_tagid FROM split_tags) AS useless_alias_prevent_error;
--SELECT @cols;   -- to print and debug whether the correct unique columns and format were generated

set @query = 'SELECT * 
FROM
(
	SELECT * FROM split_tags
) as useless_alias_prevent_error2
PIVOT
(
  COUNT(split_vendor_tagid) for split_vendor_tagid IN (' + @cols + ')
)
AS useless_alias_prevent_error3';

DROP TABLE if EXISTS pivoted_tags 
CREATE TABLE pivoted_tags(id INT, tag1 int, tag2 int, tag3 int, tag4 int)
--INSERT INTO pivoted_tags
execute(@query)

Here we declare 2 variables, 1 for storing the list of tag names we need to use in PIVOT to replace the hardcoded IN() part, another to format the query string before execution.

There are 2 ways to get the list of distinct tag names into @cols. , using SET or SELECT, SET is good for not silently ignoring errors, SELECT is good for brevity in coding: https://www.sqlshack.com/what-to-choose-when-assigning-values-to-sql-server-variables-set-vs-select-t-sql-statements/.

You can later uncomment the INSERT INTO pivoted_tags to store the results and practice unpivoting later. Notice I created names for the tags tag1 tag2... so they are more meaningful than plain ints 1,2,3,4. This is not needed if the information were meaningful strings like Jan, Feb, Mar. This is also why the previous hardcoded version had to add [1],[2]… rather than 1,2,… to make SQL recognize these ints are columns to prevent error. QUOTENAME() handles this in the above code.

UNPIVOT (like melt in pandas) to recover split_tags

SELECT * 
from pivoted_tags
UNPIVOT
(tag_count for split_vendor_tagid IN (tag1,tag2,tag3,tag4)) as useless_alias_prevent_error4

You must have previously uncommented INSERT INTO pivoted_tags for data to be inserted to the DROPPED and CREATED table to see output.

Well technically we didn’t recover, because we added a new column of tagnames that came from the column names (specified in CREATE TABLE before saving the INSERT INTO query) in pivoted_tags.
Maybe i’m wrong, but this seems compulsary syntax, and UNPIVOT must end up with 3 columns. Note how every id now will get duplicated 4x to show 4 tags. (This is not the case when we UNPIVOT a table containing NULLS which will throw away those rows and not recover the rows before PIVOT unlike here).

A more normal example of PIVOT

More normal because this table has 3 columns, unlike the previous split_tags with only 2 columns, and this PIVOT creates a more meaningful aggregation on sales, rather than previously we were just counting 1/0 presence/absence for the sake of fulfilling the syntax forcing use of an aggregation.

CREATE TABLE sales(
  id int,
  month varchar(10),
  sales INT)

INSERT INTO sales (id,month,sales)
VALUES
(1,'jan',100),
(1,'jan',200),
(2,'jan',150),
(2,'feb',200)
DROP table if exists pivoted_sales;
CREATE TABLE pivoted_sales (id int, feb int ,jan int);

DECLARE @cols AS NVARCHAR(MAX),
		@query AS NVARCHAR(MAX), 
        @query2 AS NVARCHAR(MAX);
        
select @cols = STUFF((SELECT distinct ',' + month 
                    from sales
            		FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)') 
        			,1,1,'');
 
SET @cols = (SELECT STRING_AGG(month,',') FROM (SELECT DISTINCT month FROM sales) AS useless_alias_prevent_error)
--SELECT @cols = STRING_AGG(month,',') from (SELECT DISTINCT month FROM sales) AS useless_alias_prevent_error;
--SELECT @cols;

set @query = 'SELECT * 
FROM
(
	SELECT * FROM sales
) as useless_alias_prevent_error2
PIVOT
(
  SUM(sales) for month IN (' + @cols + ')
)
  
AS useless_alias_prevent_error3';
  
--INSERT INTO pivoted_sales
execute(@query)
id feb jan
1 (null) 300
2 200 150

See how the columns now are naturally named feb, jan, much nicer than the 1,2,3,4 previously, so you don’t have to wrap [] when specifying columns during PIVOT UNPIVOT.
Also note there are NULL because id 1 never had a sale in Feb. This will be missing from the UNPIVOT later. Another reason the data --> PIVOT --> UNPIVOT will not recover original data is PIVOT has applied an aggregation to the data, by summing the 2 sales of id = 1 in Jan.

The FOR XML PATH is just the old way to generate the @cols of jan,feb. You can learn about them from https://stackoverflow.com/questions/31211506/how-stuff-and-for-xml-path-work-in-sql-server and https://www.youtube.com/watch?v=QwPjbQosjgw&ab_channel=BlakeB.

UNPIVOT sales

Remember to uncomment previous INSERT INTO pivoted_sales to materialize the table.

SELECT  id, 
	month, 
        sales
FROM pivoted_sales
UNPIVOT
(sales for month in (jan,feb)) AS useless_alias_prevent_error
id month sales
1 jan 300
2 jan 150
2 feb 200

See how the null id=1 feb is gone after UNPIVOT. If you want it to stay, you can use ISNULL(Value, 99) before UNPIVOT to make a random number like 99 represent NULL to survive the UNPIVOTING, then NULLIF(Value, 99) to convert all values of 99 back to NULL after that.

Besides PIVOT having alternative CASE to implement, UNPIVOT has Cross-Apply and UNION alternatives: https://www.tutorialgateway.org/unpivot-in-sql/

CROSS APPLY will keep the rows with NULL, which is basically copy pasting values from table on Left Hand Side, including nulls.

SELECT  id, 
	month, 
        sales
FROM pivoted_sales
CROSS APPLY
(
  VALUES
	('jan',jan),
	('feb',feb)
) not_useless_alias(month,sales)

To learn cross-apply

DROP TABLE IF EXISTS sales_pivoted 
CREATE TABLE sales_pivoted(
  id int,
  jan varchar(10),
  feb VARCHAR(10)
  )

INSERT INTO sales_pivoted (id,jan,feb)
VALUES
(1,50,100),
(2,30,200),
(3,50,150)
SELECT * 
FROM sales_pivoted 
CROSS APPLY 
( VALUES 
  ([jan],'jan_custom'),
  ([feb],'feb_custom')
 ) alias(qty, month)


See how if we selected id, qty, month, that is exactly the unpivoted table.
The VALUES you see here is a table-constructor. It normally appears in INSERT INTO statements but can be used to generate tables on the fly without creating them first.
Cross-Apply works by taking each row on the Left hand side of cross apply, and duplicating it once for each value in the right hand side table.

In the above example, if i had added [jan],mar_custom, that would have worked too (cross apply does not break) to create 3 rows per id, but makes no sense in the context of unpivot we trying to achieve here.

If you wonder about how the RHS table joins to the LHS, or what columns align them, they don’t care about that. It simply duplicates the LHS row as many times as the RHS requires. This example shows how the values in the RHS table does not have to match the LHS: https://www.youtube.com/watch?v=eVsG9oQsr-c&ab_channel=JiFacts
He simply created 2 new columns (gender,m) per row based on per row information, to spend more time creating a bigger table with 2 extra preprocessed columns to select from, to save time writing CASE code when WHERE filtering later.

Both examples above duplicates the LHS rows the same number of times (2 or 1) for each LHS row, because the RHS is always 2 rows for my eg, and always 1 row for the youtube link above.

The STRING_SPLIT() i mentioned above however gives a different number of rows in the RHS for each row on LHS, so just note that it’s not always the same number of duplication on LHS. You can play with example A in https://docs.microsoft.com/en-us/sql/t-sql/functions/string-split-transact-sql?view=sql-server-ver15 to understand it better and indirectly understand cross apply better.

So to summarize using CROSS APPLY with a VALUES() table value constructor on the RHS, you just have to decide what information you want to grab from each row on the LHS, what custom information you want to add to the RHS (by deriving from left, or hardcoding your own logic), and how to layout such information on the RHS (could be across rows, down columns).

Good with PIVOT and CROSS APPLY? Go for DOUBLE UNPIVOT: https://www.sqlservercentral.com/articles/an-alternative-better-method-to-unpivot-sql-spackle

PIVOT puts data in wide form, UNPIVOT puts it in long form. Wide is better for analysis, long is better for storage (read/write). You probably will need to do them in pandas anyway one day, even if not in sql.

Self-promotion if you like my tutorial: https://towardsdatascience.com/transformations-of-stack-melt-pivot-table-901292196d9e

2 Likes

Oh my, that’s definitely a lot to take in!
But I’ll be taking notes of what you have written. Can’ t thank you enough for your time and effort in explaining this!

So from my original table, how would I be able to relate or query the vendor_tag and vendor_tag_name together ? I can’t seem to find anything about it.

Vendor_tag    | Vendor_tag_name            |      vendor_id      
-----------   | -------------------------  | -----------------------
  1,2         | burger, shawarma           |  1
 2,3,4        | shawarma,icecream,sweets   |  2      
  1,4         | burger, sweets             |  3

In a bigger dataset, the vendor_tags are larger with different combinations corresponding to a vendor id. How would I combine that specific vendor_tag_id to the vendor name? I know I can use the ‘SELECT DISTINCT’ to find the unique values/names but im just not sure how I can make it so that the unique names and id relate to each other, such as
[vendor_tag_id = 4] == [vendor_tag_name= sweets]
Im not too sure if I’m explaining this correctly sorry!

Vendor_tag_id | Vendor_tag_name            |           
-----------   | -------------------------  | 
  1           | burger                     |  
  2           | shawarma                   |   
  3           | icecream                   |  
  4           | sweets                     |

where it results in this:

VendorTagID | VendorID | TagID | 
----------- | -------- | ----- | 
          1 |        1 |     4 | 
          2 |        1 |     8 | 
          3 |        3 |    10 | 
          4 |        3 |    91 | 

This image might be a better explanation of expressing my question.
The top dataset is the original un-normalised form and The second one is the normalised form. So with data like that below, I want it so the the specific tag number detects that number 30 is fries etc

Oh wait, maybe I’d need to cross split and distinct both vendor tag and vendor_tag_name. How would I be able to that with one query?

Maybe like this?

SELECT DISTINCT (vendor_tag, vendor_tag_name)
FROM vendor
CROSS APPLY STRING_SPLIT((vendor_tag, vendor_tag_name), ‘,’);

With the diagram, now I see what you’re trying to accomplish.
It was so hard to understand from the 1st question because i’m thinking where did VendorID, TagID in the 2nd table come from, and how are the various numbers and columns in 1st table related to 2nd table. The only pattern I could see was VendorTagID in 2nd table comes from Vendor_tag in 1st table.
If you prepared the DML code and sample data, that will motivate more people to help.

Prepare data

CREATE TABLE unnormalized(
  vendor_tag varchar(max),
  vendor_tag_name varchar(max),
  vendor_id int
  )
  
 INSERT INTO unnormalized
VALUES
('5,8,30,24','Burgers,Desserts,Fries,Salads',1),
('5','Burgers',2),
('8,42','Desserts,Mexican',3),
('1,5,30,16','American,Burgers,Fries,Sandwiches',4),
('1,5,30,16','American,Burgers,Fries,Sandwiches',5)

Query

SELECT
	*
	--DISTINCT CAST(tag_id AS INT) as tag_id ,tag_name
FROM unnormalized 
CROSS APPLY 
(
 	(SELECT 
  		value as tag_id,
  		ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rn
 	 FROM STRING_SPLIT(vendor_tag,',') 
	) a1
  	INNER JOIN 
 	(SELECT 
  		value as tag_name,
  		ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rn
  	 FROM STRING_SPLIT(vendor_tag_name,',') 
 	 ) a2
  	ON a1.rn = a2.rn
) 
--ORDER BY tag_id

Explanation
You can get the Tag Id: Tag Name relationship by uncommenting the 2nd SELECT and ORDER BY (optional). tag_id got split from a varchar so INT casting is necessary for proper ORDER BY. (Very important to cast string to date before plotting dates in pandas to make pandas help you fill in the missing datetime values in the index, to prevent time periods with no data disappearing from plot and making it look like every time period had data when it’s actually empty and data lines should be broken).

General idea is still the same, take source table, and CROSS APPLY + STRING_SPLIT to break it down. Only difference now is we have to split on both tag_id and tag_name columns.

In the previous article, it shows 2 alternative ways to row-align after doing UNPIVOT 2 times sequentially, based on last digit of column names, and a more robust WHERE method to custom match.

If we tried sequential CROSS APPLY tag_id CROSS APPLY tag_name (because i don’t know how STRING_SPLIT can mix with UNPIVOT), it seems impossible to align the correct tag_id to tag_name.

Both ways in that article do not work here because all the column data is stored in csv form in a single column here, so there aren’t the multiple fixed number of columns prepared, and so there is no column name matching to speak of.

To achieve what you see in that article, you can actually prepare the unique items under vendor_tag and vendor_tag_name as columns beforehand assuming you know all the possibilities (which you should if you’re the store owner owning the menu strategy), but if you could do that, you won’t be doing this execise to discover tag id to tag_name relationships. So I’m still unsure why you are tackling this problem. Is this a database migration exercise or you are starting a new design from scratch?
If the former, this solution of constantly querying for new values and inserting into new schema doesn’t feel sustainable, and if the latter, then you should know the relationships and won’t be asking this question.

Random Quirks

You see rn as Blob because that is a problem with sqliteonline not being able to show same name columns from multiple tables. Sqlfiddle can though. Try running this experiment in both places to see the difference:

SELECT A.VALUE, B.VALUE FROM (
    SELECT VALUE, ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS RW
    FROM STRING_SPLIT('b;a;c;d;e', ';')
) A
INNER JOIN (
    SELECT VALUE, ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS RW
    FROM STRING_SPLIT('3;1;4;2;5', ';')
) B
ON A.RW = B.RW

This is also to demonstrate how to row align tables by producing a list of row numbers using SELECT NULL (because ORDER BY is compulsary in ROW_NUMBER(), but we don’t want to order by any column, just positionally).

Because STRING_SPLIT always produces output under column name of value, I aliased it to 2 different names to prevent name clashing. (Seems like this causes sql errors, not just the sqliteonline blob display issue)

Why other solutions fail
The CROSS JOIN and UNION alternatives in

also do not apply here because they require knowing and hardcoding number and names of columns, which is difficult here because we don’t know how many columns each row should contain. (maybe we can do correlated subqueries to make this dynamic per row column tracking work but too tedious).
So it looks like letting STRING_SPLIT automatically explode the csv and ROW_NUMBER to automatically align is good for now. Not sure if performant on large data. You can generate random numbers and test (explained in the sqlservercentral double unpivot article).

just to clarify the query you wrote,

SELECT
	*
	--DISTINCT CAST(tag_id AS INT) as tag_id ,tag_name
FROM unnormalized 
CROSS APPLY 
(
 	(SELECT 
  		value as tag_id,
  		ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rn
 	 FROM STRING_SPLIT(vendor_tag,',') 
	) a1
  	INNER JOIN 
 	(SELECT 
  		value as tag_name,
  		ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rn
  	 FROM STRING_SPLIT(vendor_tag_name,',') 
 	 ) a2
  	ON a1.rn = a2.rn
) 
--ORDER BY tag_id
  1. It selects a created table value called tag_id and ''ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) ‘’ called rn. From the newly made vendor_tag table separated from the comma renamed to a1.
  2. It then joins with the table called tag_name and repeats step one but renamed to a2.
  3. both tables joined together from rn column.
  4. select all columns from unnormalised table?

I tried repeating your code above in SQL online but I get this syntax error which I assume is due to the fact that it doesn’t recognise CROSS APPLY since it doesn’t seem to highlight the ‘APPLY’ part? Therefore I renamed it to INNER JOIN as CROSS APPLY and INNER JOIN are quite similar?:

Now, once I’ve changed CROSS APPLY to INNER JOIN and run the query, I get another syntax error regarding STRING_SPLIT as it doesn’t seem to recognise it. I’ve also noticed the it doesn’t highlight ‘value’, is that normal?

This is just a little project I’m doing to practice database creation using sqlite3 so I could migrate CSV data to a database and then use python to create some kind of GUI to add things like:

  1. input new customers in database
  2. buttons which Print things like, mean cost of items etc
  3. button to plot graphs for data etc

The Kaggle Data didn’t look too complicated which is why I chose it. [https://www.kaggle.com/mrmorj/restaurant-recommendation-challenge]
I only chose the vendor, customer,location and order csv files.
You think this is a decent project to put on my GitHub to show off?
And my apologies , I struggled to explain my problem from the start as I had trouble trying to word it.

I’m inspired in your deep knowledge of sql. How many languages are you proficient at and how long did it take for you to get to that level?
I’ve been doing a month of hardcore R. Now im doing a combination of SQL and python this month. I feel like I’ve forgotten most of what I’ve learnt in R aha.

Syntax error because my code is in SQL Server and the default engine on that site is sqlite, see the tabs on left and connect MS SQL. Sqlite has too lax syntax so i wanted to learn with sql server first which has better error checking and more built-in functions. I’ll await you to convert my answer to sqlite once you understand and share it here.

vendor_tag is a column from the LHS unnormalized table, it is not a table.
STRING_SPLIT is a table-valued function that takes in a column of separator-separated values (vendor_tag in this case) and returns a table.

tag_name is not a table, it is a column alias to rename the value column that STRING_SPLIT outputs.

The rest of your understanding are right.

They are slightly similar, but different in the sense that in CROSS APPLY, what you write on the RHS has access to information from LHS. In INNER JOIN RHS is totally independent and does not know/ cannot use anything from LHS. Also in JOIN, joining conditions and columns from both sides must be specified, that is not necessary in CROSS APPLY. (explained in my 1st answer)

Nice project, I think it is unique for a data scientist to be doing such work, assuming you want to go towards such a role rather than data engineer. However the risk is interviewers may not value the manipulation skills, but more focused on modelling and practical impact. My guess for data engineer interviews is also that they would value working with bigger scale and distributed system experience. Nevertheless it is still good attempt and practice since SQL is so prevalent and important.

I think it’s not a linear growth, probably takes a year, I learn’t C in university, then python and abit of R from here, but not only here. Realpython, pydata youtube videos, pythonmorsels, reuvenlerner, so having extensive exposure is 1 source of growth. 2nd is having high standards for yourself to write proper code, make it fast and readable, 3rd is being curious and not taking things at face value, asking why something is done one way and can it be done another way, having alternative solutions to every problem, and think about how what you learn’t is transferrable to future scenarios.

I didn’t know any of the STRING_SPLIT, CROSS APPLY, PIVOT, UNPIVOT, joining row numbers to align rows, local variable declaration and dynamic query string formatting before coming across your question, but learn’t all those before answering you.

To understand my above answer, make sure you understand these steps independently.

  1. STRING_SPLIT (when used alone as RHS in CROSS APPLY in my 1st answer, and when used behind FROM in a SELECT in my 2nd answer).
  2. CROSS APPLY (explained in 1st answer through my own experimental data you can edit)
  3. OVER() window functions: most important is to note some aggregations like ROW_NUMBER() make ORDER BY compulsary, while other aggregation function (COUNT/SUM) changes behavior based on whether ORDER BY is written in the OVER or not (can change to cumulative COUNT/SUM if ORDER specified)

Definitely go through the 2 articles on union alternatives to UNPIVOT and double UNPIVOT in my 1st answer to build a clear mental framework of what unpivot and cross apply are doing exactly.

1 Like

Thanks for the explanation, I think I’m understanding how the split string, pivot and unpivot stuff work a lot better now!
Now the next problem to tackle is to translate it onto sqlite.
I’ve found a way to make to split the rows of values which is separated by commas.
Here is what I gathered from my research with the help of your explanations.
To split the tag_id I did the following:

WITH split(vendor_id, vendor_tag, str) AS (
    SELECT vendor_id, '', vendor_tag||',' FROM unnormalized
    UNION ALL SELECT vendor_id,
    substr(str, 0, instr(str, ',')),
    substr(str, instr(str, ',')+1)
    FROM split 
  	WHERE str
) 

SELECT vendor_id, vendor_tag
FROM split 
WHERE vendor_tag
ORDER BY vendor_id;

which gave me this result:

Now for the vendor_tag_names I did the following.

WITH split(vendor_id, vendor_tag_name, str) AS (
   		 SELECT vendor_id, '', vendor_tag_name||',' FROM unnormalized
         UNION ALL SELECT 
                       vendor_id,
                       substr(str, 0, instr(str, ',')),
                       substr(str, instr(str, ',')+1)
         FROM split 
         WHERE str!=''
  )
  
SELECT vendor_id, vendor_tag_name 
FROM split 
WHERE vendor_tag_name != '' 
ORDER BY vendor_id;

to get the result:

Basically each time a string is passed into the query, it splits the string at the first comma it finds. It returns the left part of the string as a vendor_tag_name and the right part of the string is passed into the next iteration of the query to be processed again.

Knowing what I know so far, I tried your SQL query but with SQLite syntax where I want vendor_tag and vendor_tag_name to join together with a common column which is vendor_id. Therefore I did the following:

SELECT *
FROM unnormalized
CROSS JOIN 
(
    WITH split(vendor_id, vendor_tag, str) AS (
      SELECT vendor_id, '', vendor_tag||',' FROM unnormalized
      UNION ALL SELECT 
      				vendor_id,
      				substr(str, 0, instr(str, ',')),
      				substr(str, instr(str, ',')+1)      				
      FROM split 
      WHERE str
	) a1
INNER JOIN
(
	WITH split(vendor_id, vendor_tag_name, str) AS (
   		 SELECT vendor_id, '', vendor_tag_name||',' FROM unnormalized
         UNION ALL SELECT 
                       vendor_id,
                       substr(str, 0, instr(str, ',')),
                       substr(str, instr(str, ',')+1)
         FROM split 
         WHERE str!=''
	) a2
  	ON a1.vendor_id = a2.vendor_id
)

I’m not too sure if I see any problems in my code as it seems to make sense to me but I get some errors unfortunately. I replaced the CROSS APPLY with CROSS JOIN as that was the only similar clause that I found to CROSS APPLY for sqlite.

Here is my error:
image


Just in case I don’t get a data scientist job straight away and would need more experience. So may go for data analyst or data engineer job first. I guess I want to look versatile. :slight_smile:

Thanks for the advice, I really appreciate it!

I had a bit of help, and they said to use a recursive CTE.
Here is the Solution I gathered :

WITH cte AS (
  SELECT 
    vendor_tag, 
    vendor_tag_name,
    SUBSTR(vendor_tag, 1, INSTR(vendor_tag || ',', ',') - 1) col1,
    SUBSTR(vendor_tag_name, 1, INSTR(vendor_tag_name || ',', ',') - 1) col2
  FROM unnormalized 
  UNION ALL 
  SELECT 
    SUBSTR(vendor_tag, LENGTH(col1) + 2), 
    SUBSTR(vendor_tag_name, LENGTH(col2) + 2), 
    SUBSTR(SUBSTR(vendor_tag, LENGTH(col1) + 2), 1, INSTR(SUBSTR(vendor_tag, LENGTH(col1) + 2) || ',', ',') - 1),
    SUBSTR(SUBSTR(vendor_tag_name, LENGTH(col2) + 2), 1, INSTR(SUBSTR(vendor_tag_name, LENGTH(col2) + 2) || ',', ',') - 1)
  FROM cte  
  WHERE LENGTH(vendor_tag) AND LENGTH(vendor_tag_name)
)
SELECT DISTINCT col1 vendor_tag, col2 vendor_tag_name
FROM cte
WHERE NOT (INSTR(col1, ',') OR INSTR(col2, ',')) AND (LENGTH(col1) AND LENGTH(col2))
ORDER BY vendor_tag + 0
1 Like

Brilliant use of recursive CTE here! Where did you find such help?
Looks like the theoretical runtime shouldn’t be too bad also, every string gets it’s 1st word/number before comma chopped off in each recursion.
Also nice use of INSTR to search for 1st comma appearance which naturally handles the added comma for strings without comma separators.

I removed the distinct and empty row filtering (just SELECT COUNT(*) FROM cte) to see that there are actually 20 rows (increase each starting row by 1 to also count the empty string in last recursion to give 5+2+3+5+5) instead of the visible 17 (which includes 1 empty row in middle). This taught me to be careful when dealing with empty strings.

I don’t understand why WHERE NOT (INSTR(col1, ',') OR INSTR(col2, ',')) was added. It makes no difference? At the end of the recursions, every cell must have been split by it’s commas and non comma strings extracted, there is no way for any comma left existing in the col1 and col2 columns.
It would make more sense to do WHERE NOT (INSTR(vendor_tag, ',') OR INSTR(vendor_tag_name, ',')) applying on 1st two columns instead since those are the ones that may have commas in them, except the rows from each starting row’s 2nd last recursion level (which should contain only 1 number/word) before reaching the last recursion returning empty string.

Had some guidance from stack overflow :slight_smile:
Nice catch, you are right, NOT (INSTR(col1, ',') OR INSTR(col2, ',')) is not needed. Maybe I initially put it there while testing (I don’t remember). I tested again even with inconsistent data. I agree that it can be removed.