Different outcomes from the same code in course terminal and local Jupyter notebook

Hi, I am in the regular expression mission, “https://app.dataquest.io/m/354/regular-expression-basics/7/accessing-the-matching-text-with-capture-groups”. When I ran the code that resulted in pass in the course terminal, I got a weird problem in local Jupyter notebook. In the course code, using a capture group seems to extract the desired information:

pattern = r"[(\w+)]"
tag_freq = titles.str.extract(pattern).value_counts()

However, when I ran the above code in Jupyter notebook, I got an attribute error
“AttributeError: ‘DataFrame’ object has no attribute ‘value_counts’”. I checked “titles” in the cell below and it is a pandas series, not a DataFrame! I also checked the values extracted by “titles.str.extract(pattern)” and found all the values are NaN. The file I am using should be the same as the one used by the course since I downloaded from the course terminal. Anyone knows why this is happening?

Thanks
Xuehong

1 Like

Did you execute all of the jupyter notebook cell?

Yes I did, multiple times. That’s how I can check “titles”.

Thanks
Xuehong

You have to share your files and notebook in order for someone to figure out easier with your files.

I think it’s the regex pattern. Shouldn’t the brackets be escaped?
r"\[(\w+)\]"
I get an error saying that there are no capture groups when using r"[(\w+)]" without the escapes.

1 Like

Thanks April,

I am not sure how I missed copy the “” before the “[”. I did have it in the Jupyter notebook (see below). The issue is that the notebook mistaken a pandas series as a DataFrame in this particular cell.

This is the code,

pattern = r"[(\w+)]"
tag_freq = titles.str.extract(pattern).value_counts()

This is the error message,
AttributeError Traceback (most recent call last)
in
1 pattern = r"[(\w+)]"
----> 2 tag_freq = titles.str.extract(pattern).value_counts()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in getattr(self, name)
5065 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5066 return self[name]
-> 5067 return object.getattribute(self, name)
5068
5069 def setattr(self, name, value):

AttributeError: ‘DataFrame’ object has no attribute ‘value_counts’

pattern = r"[(\w+)]"

No information is given on the order of execution of which particular cell in a Jupyter notebook. To avoid any further confusion, the code given below is given within a single independent code cell:

Here is the working code:

import pandas as pd

hn = pd.read_csv("hacker_news.csv")
titles = hn["title"]

pattern = "\[(\w+)\]"
tag_freq = titles.str.extract(pattern).value_counts()

Knowledge about pd.Series.str.extract:

s = pd.Series(['a1', 'b2', 'c3'])

b = s.str.extract(r'([ab])(\d)')
>>> b
     0    1
0    a    1
1    b    2
2  NaN  NaN

c = s.str.extract(r'([ab])')
>>> b
     0
0    a 
1    b 
2  NaN

d = s.str.extract(r'([zZ])')
>>> ValueError: pattern contains no capture groups
  • If there is one capture group and expand=False, then pd.Series.str.extract returns a pd.Series object.

  • If there is one capture group and expand=True (by default), then pd.Series.str.extract returns a pd.DataFrame object.

  • If there is more than one capture group, then pd.Series.str.extract returns a pd.DataFrame object

  • If there is no capture group, then ValueError: pattern contains no capture groups is raised.

Further readings, you can read pd.Series.str.extract documentation.

About your issues:

Given from the problem, we only want a single captured group described by pattern. And by default expand=False, the pd.Series.str.extract(pattern) return of type pd.Series.

You have to ensure the following:

  • titles is of type pd.Series in order to use pd.Series.str.extract.
  • pattern is of type str.

Your error is due to titles.str.extract(pattern) being a pd.DataFrame object. There is no information given in your post above about what exactly does titles represents. You have to provide the code in the order in which it executes.

Next issue:

You have to escape the brackets [ and ]. That is, \[ and \]. This informs the function pd.Series.str.extract that [ ] are part of the regex statement.

Otherwise in your example code, the pattern is literally looking for a single open [ and close brackets ]. Therefore, there is no pattern found. When no pattern is found, ValueError: pattern contains no capture groups is raised.

Formatting your code in your post:

Using a triple back ticks ``` to format a code block improves code readability, results in better quality discussion about your questions, and creates a faster response time from the community.

You can refer to this discourse post about markdown formatting a block of code.

Hi Alvinctk,

That was an error I made. I was checking whether I would get different errors if I modified the pattern and forgot to copy the original pattern when answer you r question the second time. I jsut copy paste your code in a jupyter cell and got the same error. I even added a statement to make sure titles is a pd series and that didn’t change anything. Not sure why my jupyter notebook acts so oddly. Could you run the code in your jupyter notebook and see what might happen. This has never happened before to me.

here is the code,

import pandas as pd

hn = pd.read_csv(“hacker_news.csv”)
titles = hn[“title”]
titles = pd.Series(titles) # note: commenting this off didn’t make any difference.

pattern = “[(\w+)]”
tag_freq = titles.str.extract(pattern).value_counts()

here is the error message

AttributeError Traceback (most recent call last)
in
6
7 pattern = “[(\w+)]”
----> 8 tag_freq = titles.str.extract(pattern).value_counts()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in getattr(self, name)
5065 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5066 return self[name]
-> 5067 return object.getattribute(self, name)
5068
5069 def setattr(self, name, value):

AttributeError: ‘DataFrame’ object has no attribute ‘value_counts’

Same thing happened again after I copied the code from the instruction to my notebook. It seems this on;y happens when str.extract() is used.

titles = hn[‘title’]
pattern = r"(\w+SQL)"
sql_flavors = titles.str.extract(pattern, flags=re.I)
sql_flavors_freq = sql_flavors.value_counts()
print(sql_flavors_freq)

The above code resulted the error below,

AttributeError Traceback (most recent call last)
in
1 pattern = r"(\w+SQL)"
2 sql_flavors = titles.str.extract(pattern, flags=re.I)
----> 3 sql_flavors_freq = sql_flavors.value_counts()
4 print(sql_flavors_freq)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in getattr(self, name)
5065 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5066 return self[name]
-> 5067 return object.getattribute(self, name)
5068
5069 def setattr(self, name, value):

AttributeError: ‘DataFrame’ object has no attribute ‘value_counts’

I’ve been learning a lot while looking at your problem! I downloaded the Hacker News dataset to work with it in Jupyter notebook and pasted in your code to see your error. Then I read the documentation for str.extract along with rereading alvinctk’s post. It looks like expand=True is the default, so titles.str.extract(pattern, flags=re.I) is returning a DataFrame object and not a series. I added expand=False which caused it to return a series.

pattern = r"(\w+SQL)"
sql_flavors = titles.str.extract(pattern, flags=re.I, expand=False)
sql_flavors_freq = sql_flavors.value_counts()
print(sql_flavors_freq)

output:

PostgreSQL    27
NoSQL         16
MySQL         12
nosql          1
MemSQL         1
mySql          1
CloudSQL       1
SparkSQL       1
Name: title, dtype: int64
1 Like

I had been stuck with the same problem like xuehong.liu.pdx as I’m using Jupyter on my computer. Thank you all for the help and the solution!!! :smiley: