Return rows by extracting string in columns

Hi all! I have trouble in understanding the below code. I understood that we use str. contains() to select the rows we want from a column, but why we can returns all rows and columns by using this codemerged_national_accounts = merged[national_accounts] In the merged dataset we dont have columns called national_accounts, so why we put this inside[ ]? Thank you in advance.

My Code:

pattern = r"[Nn]ational accounts"

national_accounts=merged['SpecialNotes'].str.contains(pattern,na=False)
merged_national_accounts = merged[national_accounts]

merged_national_accounts.head(5)
1 Like

Hello @candiceliu93!

The above according to pandas documentation

Test if pattern or regex is contained within a string of a Series or Index.

Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.

This variable national_accounts is a boolean Series containing True or False of whether a row in SpecialNotes column contains the pattern National accounts or national accounts

I hope you know that in Pandas we can use a boolean series to index the a dataframe if the boolean Series has the same length as the DataFrame.

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

So the national_accounts Series was used to index the DataFrame to select only the rows that had the pattern [Nn]ational accounts

Hi @candiceliu93, I am new to DQ so maybe I am not the best to answer this but I will try!

I think your confusion might be coming from picking up on a pattern and drawing a false conclusion:
merged['SpecialNotes'] looks a lot like: merged[national_accounts] and 'SpecialNotes' is a column in merged…so that must mean national_accounts is a column too, right? Not necessarily…the fact that 'SpecialNotes' is inside '' and national_accounts is not, is a hint that they might not be the same kind of thing.

What happens when you try type(merged_accounts)…what kind of object is it? Like I said, I’m still very new to DQ and programming in general but I think merged_accounts is a list of boolean values (ie a bunch of True/False) which you then use as a mask on your dataset to select specific rows based on the T/F values inside merged_accounts.

Both merged['SpecialNotes'] and merged[national_accounts] are ways of selecting specific parts of your larger dataset (merged) but they go about it in very different ways; one selects an entire column whereas the other selects particular rows based on a boolean list.

Does this make sense? Does it help? I really hope so! If not, I’m sure a more experience DQ user will be along to help you out.

Hi!! Thank you for your help!

It helps!

wow! I undersetood!! your explanation is so detailed!! thank you!

Now i know that in Pandas we can use a boolean series to index the dataframe if the boolean Series has the same length as the DataFrame.

Thank you again!!

1 Like