Is pd.concat() equvalent to an Outer join or a Cross join?

I’m at the ‘Data Cleaning and Analysis’ course on ‘Combining Data with pandas’ mission.

It seems to say here that pd.concat() works as an Outer join. However, it does not seem to work as a join at all but rather just glue the tables together as the function does not attempt to combine the dataframes by any key (join rows from different dataframes by the key). In the Outer join, on the other hand, this joining by the keys is still being done. However, all the non matching rows from left and right are still kept also.

If I am not mistaking then the pd.concat() function works the same as a Cartesian (Cross) join in SQL.

Or have I understood something wrong here?

1 Like

pd.concat has nothing to do with sql joins. It is unfortunate the api has a parameter named join, but too bad designers have only so few synonyms to express basic concepts.
If you read the first line in function docstring, with optional set logic along the other axes.
There are 2 key points. 1. set logic 2. along the other axes

  1. Have you thought why the join parameter only allows {‘inner’, ‘outer’} but not left right like SQL joins?
    Because it is actually describing set operations of intersection, union rather than the sql type of join.
    Why default outer? Because keeping more information is practically more useful, and the user can dropna(axis=1) manually later to get the same effect as join=‘inner’ but the reverse is impossible.

  2. “the other axes” refers to the axes that you are not concatenating along specfified in the 2nd parameter of pd.concat. I’ll just explain using row concatenation (eg 6,7 on https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html shows the difference between inner, outer) and you can apply the same reasoning for columns. When concatenating rows, it is expected that you have the same column names. (Maybe there were multiple files you were trying to process into a single giant dataframe with each file producing the same columns for you to join the row records together).
    In an environment with less data integrity or bad pre-processing code, these df to concat may mismatch in columns, so the library wants you to tell it whether to get only columns that appear in both df (join=‘inner’) or all columns that have appeared even if not both df have them.(join=‘outer’)
    The former (setting to ‘inner’) is good hack for ensuring functions down your pipeline doesn’t see surprising additional columns appearing.

You seem like a great thinker, if you haven’t yet, how about subscribing to pandas on https://www.codetriage.com/ and get on board how the core maintainers make decisions and how things work, or even making your own pull requests to grow this library. Doing this helps you learn how other libraries are designed too (eg. matplotlib) since most data analysis libraries are written in OOP way and pandas calls matplotlib in df.plot