Code Check: Using re.match() to solve 136-4

Screen Link: https://app.dataquest.io/m/136/data-cleaning-walkthrough/4/reading-in-the-data

Hi,
I was working on 136-4, Data Cleaning Walkthrough: Reading in the Data. I figured I would use a regular expression to pull the file names, and then use str.format() to read the csv files into the dictionary.

The code I came up with:

k = re.match(r"\w+(?=.csv)",f)
data[k] = pd.read_csv("schools/{}".format(f))

I know through testing with regexr.com that the regex works, and running the code doesn’t seem to pop an error. However, I got a popup that says “Your code doesn’t seem to have the correct side effects. Please re-check the instructions and your code”.

I’ve been trying to troubleshoot, but I can’t figure out why this doesn’t work. It seems to be an issue with using the re.match() to identify the strings I’m using as keys.

Can someone please explain why this isn’t working?

1 Like

Hey Chris,

Nice work taking a Regex to this approach. I’ve taken a screenshot of me using your code, with an addition below that prints the keys of the dictionary:

As you can see, each dictionary key is a match object, where what we’re expecting it to be is a string, for example 'demographics'.

You’ll need to extract the text from the match object before you use it as the key, and then everything should work fine.

I hope this is helpful,

3 Likes

Thanks, Josh! That makes a lot of sense.

For anyone who might stumble upon this thread in the future, by adding .group() to the end of my line, I was able to extract the match result as a string, per Josh’s suggestion.

Final code:
k = re.match(r"\w+(?=.csv)",f).group()
data[k] = pd.read_csv("schools/{}".format(f))

4 Likes

good job Chris, all worked out perfectly fine.

HI Chris, I have a question about the regex you created. There is ‘’ in some file names.I try to undertand the pattern here, we need to extract name before ‘.csv’, and there is '’ in some file names. so for my understanding is r"(\w_?)(?=.csv)' ------used the ? for underscore, because it has one or 0 underscore, and put( ) around \w_? since it is the group we want to capture, and used the lookahead to extract only words before .csv.

And since we want extract the name without .csv to use for key. Shouldn’t we use str.extract()?
So my code is
for f in data_files:
extract_name=f.str.extract(r'(\w_?)(?=.csv)')

but it did not work…

1 Like

Hi Candace,
From a quick glance at the documentation of series.str.extract(), my first thought is that in looping we’re no longer dealing with a series. That is to say f is not a series, so series.str.extract() won’t work.

My second thought was to try running str.extract() on the list data_files, but that won’t work since data_files is a series, not a list. We would first need to convert the list into a series (pd.Series), and then we’d probably need to use extractall instead of extract. extract only grabs the first match, but we want the full list.

Regarding your RegEx specifically, (\w_?) is going to find every one character combination or two character combinations that end with _. You need to have a + in order to capture as many characters as necessary. (As a side note, the \w code captures underscores as well, according to regexr.com.

So in order to extract the list, you’d needs something along the lines of:

s = pd.Series(data_files)
data = s.str.extractall(r"(\w+)(?=.csv)")

And then after, you’d still need to match the resulting dataframe with the corresponding dataframe for each value. Unfortunately, I have to go to work now, but I hope this helped a little bit!

3 Likes

Thank you Chris!! it is so clear! Solves many problems for me, now i have a clear concept of regex!

Thank you!

2 Likes