Pandas Data Cleaning Practise Problem 8

I’m Minh, I’m in the progress of practicing cleaning data, but there is a mystery error that I still do not know why. The problem is in my code at problem 8, like below:

Screen Link:

(To Python library):
https://docs.python.org/3/library/re.html
My Code:

In DQ:
def split_group(str_):
    m = re.match(r"([\w]+) ([-?'?\w]+)",str_)
    return m.group(2)
names['lastname'][names['lastname'].isnull()] = check_2
check_3 = names['lastname'].apply(split_group)

In my local:
def split_group(str_):
    m = re.match(r"([\w]+) ([-?'?\w]+)",str_)
    return m.group(1)
series_1 = pd.Series(list_test)
check_2 = series_1.apply(split_group)
check_2

What I expected to happen:
0 Kalindi
1 Roderich
2 Nickolas
3 Althea
dtype: object

What actually happened:

AttributeError: 'NoneType' object has no attribute 'group'

I’m sorry but the code that m = re.match(r"([\w]+) ([-?'?\w]+)",str_) is a pattern in Match object of Python, and actually it run successfully in my Jupyter local
Is there anything mismatched?? While the function runs successfully in the list that I created on my local, it seems like I can’t use this idea of function in here??


Here is the result of code that I’m planning to use in DQ

You are running your code locally on test input and not the actual content of the names.csv dataset.
When you run the following code -

You run the above on the lastname column which contains just one string per row from what I can see. However, your test example contains the full names. Your regex pattern is defined on the full name, so when you try to apply the function on to lastname with just a single string, it returns empty (None value). Hence the error (I haven’t played with your code enough to be entirely sure of this, but this is what seems to be happening as of now)

Make sure whatever you want to do matches with the dataset and not just your test case.

About the None error that you’ve warning for, yes I know and I’ve handled it
I’ve changed my method to re.findall + if-else statement, it works. But, what I am confused about isn’t that problem.
I’m confused about why I read in Python document that m = re.match(r"([\w]+) ([-?'?\w]+)",str_) is actually a Match object, and this object has attribute Match.group which I can get the result is the lastname, is run successfully in my Jupyter, but it not work on here.
The error that I got back is : AttributeError: 'NoneType' object has no attribute 'group' but not the None when cleaning column name (because I added if-else statement). This is the error about one object is detect as NoneType object and it doesn’t get any attribute followed by, but in the contract in Python they said that it actually a object of Re, named Match Objects
I don’t know it’s the problem of the different between other environment or something??

I would recommend sharing the exact code (and all of it) that you are running so far on your system and on DQ’s platform. It doesn’t seem the code you shared above matches exactly with what you are currently attempting given your last reply. I can then try running it and try to see what the issue might be.

The first is code that I’m planning to be using (use Match object like I said):

import pandas as pd
import re

names = pd.read_csv('names.csv')

check_1 = names['lastname'].str.extract(r'(^[\w-]+)',expand=False)
check_2 = names['firstname'].str.extract(r"([-?'?\w]+$)",expand=False)
names['firstname'][names['firstname'].isnull()] = check_1
names['firstname'] = names['firstname'].str.split().str[0]

def split_group(str_):
    if len(str_.split())>1:
        m = re.match(r"([\w]+) ([-?'?\w]+)",str_)
        return m.group(1)
    else:
        return str_
names['lastname'][names['lastname'].isnull()] = check_2
names['lastname'] = names['lastname'].apply(split_group)

names[names['firstname'] == 'Terri-jo']

And the second is after I change to re.findall

import pandas as pd
import re

names = pd.read_csv('names.csv')

check_1 = names['lastname'].str.extract(r'(^[\w-]+)',expand=False)
check_2 = names['firstname'].str.extract(r"([-?'?\w]+$)",expand=False)
names['firstname'][names['firstname'].isnull()] = check_1
names['firstname'] = names['firstname'].str.split().str[0]

def split_group(str_):
    if len(str_.split())>1:
        store = re.findall(r"([\w]+\s[-?'?\w]+)",str_)
        for i in store:
            return i.split()[1]
    else:
        return str_
names['lastname'][names['lastname'].isnull()] = check_2
names['lastname'] = names['lastname'].apply(split_group)

names[names['firstname'] == 'Terri-jo']

And the code that I used to test a simple case in my local:

list_test = ['Kalindi Ivanonko','Roderich Biffin','Nickolas Kiernan','Althea Hoyle']

def split_group(str_):
    m = re.match(r"([\w]+) ([-?'?\w]+)",str_)
    return m.group(1)

rel = []
for i in list_test:
    syx = split_group(i)
    rel.append(syx)

series_1 = pd.Series(list_test)
check_2 = series_1.apply(split_group)
check_2

The issue is with the above. The above returns a None for the name "Terri-jo Dobell". Your pattern will match the jo Dobel part and not the whole name.

It returns a None because of match(). As per the documentation -

If zero or more characters at the beginning of string match this regular expression, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

Since your pattern does not match the entire string (from the beginning), match() returns a None.

If you add the Terri-jo Dobell name to your list_test and run the code again, you should get the same error. I tried that in DQ’s environment and got the error.

You need to modify the pattern to account for that specific full name if you want to use match(). Or, you can just use findall() as you did without changing the pattern.

1 Like