Data Cleaning Wakthrough - Parsing Geographic locations - use str.replace instead of a custom function?

I’m working on this walkthrough, and it seems to me it would be easier to use Series.str.extract rather than creating a custom function and the apply() function. Am I missing something?

Screen Link: <[!-- Enter URL of the screen below --]

My Code:

data['hs_directory']['coords'] = data['hs_directory']['Location 1'].str.extract("(\(.+\))")
data['hs_directory']['latitude'] = data['hs_directory']['Location 1'].str.extract(r"\((.+),")
data['hs_directory']['longitude'] = data['hs_directory']['Location 1'].str.extract(r",\s(.+)\)")

#print out to test
print(data['hs_directory'][['coords','latitude','longitude']].head())
1 Like

Hi @PatrickSmith

The “easier” word is a bit tricky here.

Let’s just take a simple question. Here it’s one field that we need to use to extract Lats and Longs. so you have created three lines of code for one field.

Imagine we have to do the same for 25 columns. That will be 24 * 3 lines of extra code, all serving the same purpose. Also, the no. of variables we may need to define. Will it not make sense to define a function that can be applied/ used with these columns as and when necessary.

You can also modify/ update this function to work with multiple cases of string extraction/ manipulation (think if-else or switch-case within the function!)

You may refer to this article for more details - The Principles of Good Programming (artima.com)

@Rucha Fair enough. I think in this case functions violate the KISS principle since we only need it once, but I see how it’s a good exercise and good practice for larger projects. Still if you’re going to create functions, would it not be better to use Series.str.extract() rather than Series.apply(), since it’s vectorized, as covered in the lesson “Working with Strings In Pandas”?

1 Like

Hi @PatrickSmith

Yes. A vectorization would make the extraction of values much faster as compared to the “.apply” method. For a small no. of records though the two would be comparable. I am not sure how many rows are there in this task. Assuming they are few, the author may have suggested using the latter.

Out of curiosity, have you given a thought to writing a regexp that can combine the 3 step extraction of string into say 1?