Regroup string by prefix

I have a dataset with a company name column and I have to regroup them in company group. For example: Dell UK, Dell France, Dell US => Dell Group The data is not clean and some cells might contain mistakes like dots, or mispelling.

I tried some things like fuzzy comparaisons but sometimes the group prefix is small like 3M Group and the end of name is longer.

Do you have some tracks for me ?

Thank you and sorry for my English

Data :
Dell France
Dell UK
Del US
3M Italy
3 M

Output expected:

{ Dell : [Dell France, Dell UK, Dell US], 3M : [3M Italy, 3M COMPANY, 3M SCIENCE, 3.M SPAIN, 3 M], MCDONALDS : [MCDONALDS France, MC DONALDS COMPANY, MCDONALD’S INC.], Toshiba : [Toshiba]}

Thank you and sorry for my English :slight_smile:

1 Like

Hi arouf!

A few questions to get a better idea of your problem -

What format is your data in? A Pandas DataFrame or something else?

Would you like to create a new column containing the group name, or overwrite the current supplier names with the group name?

Are you familiar with regular expressions?

You English is excellent by the way :upside_down_face:

Hi Slavina, Thank you for your answer and for my english :grin:
Exactly, my data is in a pandas dataframe. And I want to write in a new column the common group name.
I do not know a lot about regular expressions, but the data is not 100% clean, for example I can have 1&1, 1 & 1, 1x1 companies, but these three companies are the same group 1&1.

I hope I have been clear :sweat_smile:

Don’t have the time for a long answer right now but have you been through Dataquest’s Data Cleaning missions?

Not yet, I just discovered Dataquest. I will do a research hoping someone had the same issue as me.
It is ok if you do not have time :wink:

@arouf Going through the Data Cleaning missions would help you for sure! How many rows are there in your dataset?

I have about 50,000 rows in my dataset

And how many unique “Company Group” names?

I can not know. It is just for the procurement team to have a better view of this dataset. It has to be an automated script or maybe with some machine learning, but first i have to assign them a common company group. I tried differents methods like this one and others similarity functions but they take all of the string, and will regroup for example McDo FRANCE and BK FRANCE by FRANCE…

TF-IDF works best for bigger corpuses (longer documents) and you have short names. It would be helpful if you had a list of your “groups”, which you then tried to match to the “unclean” supplier names based on similarity, I think.

That is the point that I want to reach. But I can not have this clean table of company groups

It’s a very interesting problem. Let me think about this for a bit :slight_smile:

Thanks a lot for your help :smile:

1 Like

You’re more than welcome :smiley:

1 Like

Would you mind saving say 1000 rows of your SUPPLIER_NAME column in whatever format is convenient to you and sharing it with me so that I can run some tests?

It is sensitive data from my company, but I could share a sample with in private