Exploring Ebay Car Sales Data: Next Steps: how to remove some substrings?

After a whole day of trying to solve a part of additional exercise I give up.

The task is: “See if there are particular keywords in the name column that you can extract as new columns”. I found that the ‘name’ column has brand name in the string ( same as ‘brand’ column). ex:
image
I was trying to drop this name from a string ( column “name”: starts with this name/brand). The problem is: not all rows have a brand name, so I can’t use autos[‘name’].str[1:].str.join(sep=" "). for ex (row 7):
image

My code:

# See if there are particular keywords in the name column that you can extract as new columns:

#The only reasonable split we can find is the **"name"** column. In a matter of fact, it won't be split, #because the name of the brand exists now. Because the **"name"** column has it too, we will drop it.

autos['name'] = autos['name'].str.replace("_", ' ').astype(str)

autos['name'] = autos['name'].str.split() # split string to list
brand_maybe = autos['name'].str[0].str.lower() # looking for brands; lowercase conversion



test_name = autos['brand'] == brand_maybe # boolean comparison
print(test_name) # test

I tried to use boolean comparison: compare if in each row autos[‘name’].str[1] is in autos[‘brand’] series.
I have no idea how I can process those strings with this method further.

I can’t use “If” statements: they are not allowed

I tried to use str.replace with regex: but I didn’t learn about it, so I don’t know how to use it .

No idea how to solve it :cry:

The problem here is that your boolean comparison is between a string and a Series, so it’ll always be False.

But what do you mean if statements are not allowed?

The way I see it you’ll need to check if the brand_maybe variable is among all the brands. Or you can filter the DataFrame by the brand column to match all rows where the brand is equal to brand_maybe, if the length of the resultant DataFrame is greater than 0, then it is a brand.

Hi.
Thank you for the reply. :slight_smile:

I tried that, but without a success - somewhere is an error, after few hours I’m still not sure where( I bet it’s in the part: “autos[‘name’] = autos[‘name’].iloc[index_num].str[1:].str.join(sep=’ ')”. My code is:

autos['name'] = autos['name'].str.replace("_", ' ').astype(str)

autos['name'] = autos['name'].str.split() # split string to list
brand_maybe = autos['name'].str[0].str.lower() # looking for brands; lowercase conversion

# autos # test

output for this part ( when autos is activated[the test]):

below the code I mentioned at the beginning:

# test code below works when all 3 parts are activated
# test shows that our code works till the end of "if" statement
# the boolean comparison gave 32865 instances from a total amount of rows: 48023
# it seems working ok

# index_test= [] # test code 1/3
for c in autos['brand'].value_counts().index:
    for index_num in range(48023): # there is 48023 rows in DataFrame
        if brand_maybe.iloc[index_num] == c:
            autos['name'] = autos['name'].iloc[index_num].str[1:].str.join(sep=' ')
        else:
            continue
#               index_test.append(index_num) # test code 2/3
# print(len(index_test)) # test code 3/3


# test_name = autos['brand'] == brand_maybe # boolean comparison
# print(test_name.head(20)) # test


autos

ps. I’m not sure how your second idea should work. I do not quite understand this part of your reply:

1 Like

Hi @drill_n_bass,

The issue here is that the first item in the name column (which in your code is str[0]) doesn’t always represent a brand, because some brand names contain 2 words (or sometimes, probably, even more). So this brand_maybe will be never equal to autos['brand'].

To confirm it, try the following code (here we’re visualizing the name column for the cases where the brand is “mercedes_benz”, taking only the first 5 examples)

autos.loc[autos['brand'] == 'mercedes_benz', 'name'].head()

Output:

11                           Mercedes_A140_Motorschaden
29    Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz...
31                            Noch_gut_erhaltenen_C_320
39                   Mercedes_C_180_Kombi__zu_verkaufen
47                                  Mercedes_Benz_A_150
Name: name, dtype: object

Now, let’s try to apply this part of your code:

autos['name'] = autos['name'].str.replace("_", ' ').astype(str)
autos['name'] = autos['name'].str.split()

(by the way, you can combine it in one line and also remove the astype(str) part, since it’s already a string, i.e. autos['name'] = autos['name'].str.replace("_", ' ').str.split())

After that, we again add the same code line as above (autos.loc[autos['brand'] == 'mercedes_benz', 'name'].head()), to check the same 5 examples:
Output:

11                       [Mercedes, A140, Motorschaden]
29    [Mercedes, Benz, E, 320, T, CDI, Avantgarde, D...
31                      [Noch, gut, erhaltenen, C, 320]
39             [Mercedes, C, 180, Kombi, zu, verkaufen]
47                             [Mercedes, Benz, A, 150]
Name: name, dtype: object

Now when we’ll try to create the brand_maybe column (applying .str[0].str.lower() to the name column), then for the row 29 (where name is now ['Mercedes', 'Benz', 'E', '320', 'T', 'CDI', 'Avantgarde', 'DPF7', 'Sitze', 'Voll']) we’ll obtain ‘mercedes’, instead of ‘mercedes_benz’, which is the value of the brand column for this row.

So we see that we cannot apply your code for extracting brand names from the name column, because the brand can sometimes consist of 2 or probably more words. Honestly, about that task (“See if there are particular keywords in the name column that you can extract as new columns”) I failed to find any particular keywords.

2 Likes

Ok, so I misunderstood your code. brand_maybe is a Series, not a string. My bad.

So here are the solutions I had mentioned (now considering brand_maybe as a Series):

# Create a list with all brands
all_brands = autos['brand'].unique().tolist()

# Iterate over the possible brands
for brand in brand_maybe:
    # Check if each possible brand is in the list containing all the brands
    if brand in all_brands:
        #Then it's a brand, do whatever you want
# Iterate over the possible brands
for brand in brand_maybe:
    # Filter the autos df by the 'brand' column, creating a new df
    new_df = autos[autos['brand'] == brand]
    
    # If it's not a brand, then the new_df length will be 0, if it's greater than zero, then it's a brand
    if len(new_df) > 0:
        #Then it's a brand, do whatever you want

However, now that I’m aware that brand_maybe is a Series, I believe your code is actually correct so far as your boolean comparison shows in which rows the first word in the name column is a brand.

The problem is (and this is also true for my solution) that when the code says it’s a brand, you can be pretty confident it’s correct. But when it says it is not a brand, you can’t trust it.

Honestly, I can’t see the value of extracting the brand for the column name either. You already have a brand column, so you’re not using the extracted name for anything, you just want to delete it. And even after you do it, the name column would remain pretty hard to use as it would still have lots of different names, numbers, initials, and patterns in each row.

1 Like

The only way is to add all variations for names of the brands to some list and use it as a part of a boolean comparison, then, there will be always match ( for brand in name column).

It’s thought exercise :wink: To be confident that I can do almost everything :smiley:

True. That’s why I will skip further digging deeper in this, I need to accelerate my learning :rocket: I have a delay in my learning plan.

1 Like

@drill_n_bass

The only way is to add all variations for names of the brands to some list and use it as a part of a boolean comparison, then, there will be always match ( for brand in name column).

Yes, sounds like a good approach, if you want to do this exercise, or when it’s really inevitable. Remember, though, that it can be quite time-consuming, since there are plenty of brands and especially their variations in this dataframe. And don’t forget to use unique() :innocent: Probably you’ll have to use it in several iterations,considering the amount of data in both columns (name and brand).

Anyway, before diving into this approach, try to run nunique() (number of unique values) for both columns. If you get really astronomical numbers (and I suspect it will be exactly the case), then it’s better not to continue, for not to lose too much time on this exercise.