SQL I notice it is hard to understand or debug in SQL, any suggestion?

I notice it is hard for me to read and understand errors from our script.sql and therefore harder to debug.

For example yesterday, this small exercise took me an hour to debug and the last 15 minutes to realize I missed a “,” in the second field name.

I tried to understand the error information but it barely pointed where the problem could be.
I believe in the last sql project under jupyter notebook, it is a lot easier to understand error information.

Any suggestion? Or it is because our console didn’t display the detailed information?

Another question is about how to organize what we have learned so far and to learn more effectively.

I felt I just learned some new syntax or new tools and then about one week later, I forget everything. I know it is good to google and read each documentation. Or leave some comments while coding to make sure I remember some mistakes I made. I have already left so many detailed comments in the console page, it took even some time to go back and find the exact page where I was stuck with the same problem before.

Do you think it is a good idea to keep some key notes in the paper notebook? As a new programmer, I want to have a good habit, stay focused while organize what I have learned so far. I heard that DQ will provide some cheat sheets at the very end of the courses. I am new to programming.
Programming is like learning new language. If I don’t practice it everyday, I will easily forget a lot.

BTW, how to leave comments in sql?

Line one of error message: near “mt”: syntax error.

The first line should give you ample hints about the error. You have a syntax error near mt. mt is first used on line 5. Checking the syntax, you observe there is a missing comma on line 4.

Keep on practicing. If you forget, you can refer to the documentation.

Store it online - i.e. upload it to github - in a form of tutorial to teach yourself. In the future, you can refer to your own guide.

1 Like

Single line comments start with --.

Any text between – and the end of the line will be ignored (will not be executed).

The following example uses a single-line comment as an explanation:


--Select all:
SELECT * FROM Customers;
1 Like

Understanding Order of Execution is useful in having a thinking framework when writing queries, i believe it would help during debugging too. If you knew FROM comes before SELECT, you will then know that database management software like Dbeaver can help you autocomplete the columns in the SELECT clause if you specified which table you are querying FROM first, but this autocomplete will not appear if you typed SELECT first without FROM.

For practice, try to think frequently about business/analytical questions and how to implement them in sql, it is important to practice filtering values at the correct stage of the sql pipeline, lest you filter too early and undercount or too late and overcount rows. You can even take all the code in pandas and do it in SQL. (That’s how you can learn df.groupby excludes NaN groups while SQL GROUP BY includes NaN groups).

Being clear about the cardinality of joins is super important too. validate param in pd.merge is a useful and self-documenting check that should always be specified. Here’s what can go wrong with joins: https://alexpetralia.com/posts/2017/7/19/more-dangerous-subtleties-of-joins-in-sql

Eventually learning to analyze purely in SQL has to happen as pandas with RAM is not big enough (for storing merged tables) and speed of merge in pandas is much slower (eg. 15 minutes vs 30 secs)

1 Like

@alvinctk Thank you. I will see to it. :grinning:

Thank you for your advice. I really enjoy reading your code in the FAQ section. I found lots of
programming tips from a skilled programmer simply by looking at his code.
From those lines of code of yours, I learned 3 things:

SELECT  f.name country, SUM(c.population) urban_pop, f.population total_pop, SUM(1.0*c.population)/f.population urban_pct
FROM facts f INNER JOIN cities c
ON f.id == c.facts_id
GROUP BY country
HAVING urban_pct > 0.5

First, 1.0*c.population is an easier way to convert to float. I do not like to write CAST(), especially together with count or sum function, I will end up missing some silly () somewhere.

Second, if a/b, I just need to convert either a or b, just one variable to float, not both.

Third, group by field name, not field number is an indeed better practice.

The only thing I am not sure of is:
Do I need to write f.id == c.facts_id or f.id = c.facts_id

I know == is not the same as = in python. But I am not sure if they are equal in sql.

I guess I will have to write on a scratch paper of a flowchart or pseudo-code for those intermediate or complicated sql task. It is more like a brain teaser for me know.

1 Like