Two approaches to update column names...?

Screen Link:
https://app.dataquest.io/m/294/guided-project%3A-exploring-ebay-car-sales-data/2/cleaning-column-names

In the above-mentioned screen, the challenge is to clean column names. I was looking back to check what I’d learned to do so, and then noticed that it seems I learned two different approaches.

  1. Create a new list of column names to replace the current ones. As per:
    https://app.dataquest.io/m/293/data-cleaning-basics/3/cleaning-column-names-continued

  2. Direct renaming of columns. As per:
    https://app.dataquest.io/m/293/data-cleaning-basics/7/renaming-columns

My questions:

  • Correct, that this is two approaches to achieve the same thing?
  • If so, what are reasons to choose one or the other?
  • (related) In the guided-project, I am suggested to use the first of those two it seems. (Or actually: suggestion is now to make a copy of the columns names, which is a bit different again?) Any particular reason for that, and would the other one work to?

Kind regards,
Jasper

2 Likes

They are not doing the same thing.

First link edits all column names, ignoring values.
Second link edits all values and types in 1 column, ignoring any column name.

For a list comprehension, take out a single element and see how is it transformed. For a method chain, break down the method chain into their individual components and print them to see the data transformation process.

You may need to edit column names to make the life of the person dealing with databases easier. Imagine you had comma in column name. That will really mess up SELECT a,b. Is that 2 columns or 1 column called a,b?

You may need to edit column names to make it easier for you to write regex in future to control the columns.

I haven’t done the project, so can’t answer why there’s copying there.
Generally for copying anything, it is so the original data is left untouched. That could be because you need the original for something else, such as Displaying labels on a graph (because what is good for visualisation as text labels may not be what’s good for the computer), or act as a key in a key-value mapping showing before:after conversion, to be persisted in ram/disk for future use.

1 Like

Hey hanqi,
A (very) belated ‘thank you’ still for the response. I wasn’t sure whether I could follow your reply completely, as really also for the second link, the result is that the column names are updated, not the data in the columns. (But never mind, I have been able to achieve what I needed to achieve.)
Thank you,
Jasper

Now have an idea why we’re talking past each other.
When I open that page, I only see the default starter code of laptops["ram"] = laptops["ram"].str.replace('GB','').astype(int). That’s why I said there was no renaming here.

Now I opened the answer and see

laptops["ram"] = laptops["ram"].str.replace('GB','').astype(int)
laptops.rename({"ram": "ram_gb"}, axis=1, inplace=True)
ram_gb_desc = laptops["ram_gb"].describe()

so I guess you are referring to the middle line then?

And for 1st example you are refering to laptops.columns = [clean_col(c) for c in laptops.columns] ?

Yes they do the same thing.
If you are talking about the patterns of df.rename vs df.columns = edited_columns.

However, their flexibility is different.
.rename() is method chaining, meaning you can do df.sort_index().map(df2).merge(df3).rename(), as long as df is returned, you can write an infinitely long chain and rename() can be called on it.

That is not possible with df.columns = edited_columns. This is a single statement that cannot be chained because it is not calling a method of a df and returning a df, but simply accessing the column attribute, changing it a little, and putting it back into the same place. (you can understand it this way until you go more in-depth into how name binding works in python).

A major jump forward when learning programming is to think in terms of objects/types and methods. Keep in mind every time you learn a new framework what object you are dealing with. That influences what methods that object have, which determine what actions you can do on it.

df.rename works on a DataFrame object, while df.columns is a Index object. You can find out by doing type(df) to see pandas.core.frame.DataFrame and type(df.columns) to see pandas.core.indexes.base.Index.

For the available actions, you can throw any object into dir() to see their attributes (which includes data members, methods of the class).

With columns, you can treat them as sets and study intersections and unions with columns from other df. It’s much more convoluted to do the same in another way by merging whole data frames and analysing their results if you only cared about studying column statistics.

If looking at dir() is too dry, you can open the colourful https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html and read what inputs it takes to explore the capability of a method. Note that pandas often designs each input position to take multiple datatypes, as you see in the 1st input mapper: dict-like or function . Allowing receiving a dict-like (eg. base python dictionary, pd.Series) is very helpful because dictionaries are the most important python object. Other functions like df.map also use them.

For now, you can just choose the comfortable one, until one day you realise you hit the limit of capabilities of that API you will automatically find new ways. Learning is just continuously doing more difficult things that your existing skillsets cannot handle.

4 Likes

Thank you @hanqi!

Yes, indeed those where the pieces of code that I was referring to.
Thank you for the elaborate explanation of the two, and the additional comments.
That’s helpful - it makes more sense now!

Thanks @jasperquak for the great question, came here because I had the same exact one!
Thanks @hanqi for your detailed answers, they really helped.

2 Likes

For anyone with this same question in the future,

I also had the same question, but was left feeling overwhelmed from the responses above. Clearly both exercises were attacking different problems, but I figured the “copy and update” column names approach could be done for this project. I found the main difference in implementation between the two methods.

Method 1 seems easier since we are renaming many columns, and the code was slightly more straightforward for a simple rename. However, upon trying to implement it, there is an error do to the fact that df.columns returns a pandas index data type. Meaning we can easily iterate though all of the column names using the ‘copy’ we create, but we cannot use the series.replace command to change the names because what we have saved is an index containing those names not the names themselves.

A quick and easy fix is to use pd.Series() constructor (capital S) on the pandas indexed list we retrieved from df.columns. Now, we have a series with all of our column names in it and we can use series.replace() method instead of df.rename() method.

col = pd.Series(df.columns)
col = col.replace(“old_col_name”, “new_col_name”)

Hope you found this helpful.

5 Likes

It was a little cumbersome, but I think it achieves the same thing.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

Example:
autos.rename({‘yearOfRegistration’:‘registration_year’}, axis=1, inplace=True)

I used this method at first. I had a line of code changing the name for each column. I couldn’t help but think that approach was incorrect or at least inefficient. I explored the idea of looping through autos.columns and using re.sub() to convert each column name to snakecase. That worked fine, but still felt like I was overdoing it.

In the end, I opted for a more elementary approach. I copied auto.columns to a new array, manually updated the column names, and assigned them back to the autos.columns attribute.

new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'date_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

autos.columns = new_columns

autos.head()

I’m sure this screams beginner, but I felt it was cleaner and more readable then repeating autos.rename() 13 times or writing some complicated regex formula.

1 Like

For sure. When I saw that similar solution from the github solution page, my immediate thought was “keep it simple stupid” in the future.

below is my beginner approach. it shows the column names before and after the rename.

mapping_dict = {
‘dateCrawled’: ‘date_crawled’,
‘offerType’: ‘offer_type’,
‘vehicleType’: ‘vehicle_type’,
‘yearOfRegistration’: ‘registration_year’,
‘powerPS’: ‘power_ps’,
‘monthOfRegistration’: ‘registration_month’,
‘fuelType’: ‘fuel_type’,
‘notRepairedDamage’: ‘unrepaired_damage’,
‘dateCreated’: ‘ad_created’,
‘nrOfPictures’: ‘nr_of_pictures’,
‘postalCode’: ‘postal_code’,
‘lastSeen’: ‘last_seen’
}

autos = autos.rename(columns= mapping_dict )
autos.columns