Looping over the correct series

Screen Link: https://app.dataquest.io/m/294/guided-project%3A-exploring-ebay-car-sales-data/7/exploring-price-by-brand

Details: The series top_car_count is a series of the top 20 most popular brands. I’m not sure how to loop over the values in this. In the code above I am looping over ALL the brands in the dataframe: auto[“price”] which yields a dictionary for ALL car models which isn’t what i want. Silly question, yes. Slightly confused. Much appreciated.

My Code:

top_car_count = autos["brand"].value_counts().head(20).index
top_car_mean = {}

for cars in top_car_brands:
    each_row = autos[autos["brand"] == cars] #incorrect line here
    mean_price = each_row["price"].mean()
    top_car_mean[cars] = mean_price

    
sorted_top_car_mean = sorted(top_car_mean.items(), key=lambda x: x[1], reverse=True)
print(sorted_top_car_mean)

When you say “code above” I’m assuming you mean “code below”.

I’m not sure if I correct understood your problem, but if you want a dictionary with only the top 20 brands, then you should loop through the top_car_count you just created instead of looping through top_car_brands . Also, you should use the sort_values method to have the series sorted, like this:

top_car_count = autos["brand"].value_counts().sort_values().head(20).index

When sorting values like otavios.s wrote above, I would add ascending = False criteria as well, otherwise I would get the bottom 20 brands.

2 Likes

Hello, I am trying to aggregate to get the top brand by vehicle type but nothing prints when I execute the code below. Any suggestions please?

top_brand_by_vehicle_type = {}
vehicle = autos_c["vehicle_type"]

for v in vehicle:
    selected_rows = autos_c[autos_c["vehicle_type"] == v]
    sorted_rows = selected_rows.sort_values("vehicle_type", ascending=False)
    first_sorted_row = sorted_rows.iloc[0,:]
    top_brand_by_vehicle_type = first_sorted_row["brand"]
    
print(top_brand_by_vehicle_type) 

Also, when I print the brands and vehicle types using print(autos_c[“brand”]).value_counts()
AND
print(autos_c[“vehicle_type”]).value_counts() I get an error that ‘NoneType’ object has no attribute ‘value_counts’. Why is this and how can I resolve it?
https://app.dataquest.io/jupyter/notebooks/notebook/Basics-Copy3.ipynb

Hello @vroomvroom,

To Identify the unique values we want to aggregate by, we have to use unique () as in:

vehicle = autos_c["vechile_type"].unique()

Secondly,

The .value_counts() should be in the print() function as in:

print(autos_c['brand'].value_counts()) and
print(autos_c['vehicle_type'].value_counts()) respectively.

Happy learning!

1 Like

Thank you! May I ask what is the difference between unique() and value_counts()? Could you please explain the aggregation process to me? I did the following but am now getting the error “ValueError: Cannot index with multidimensional key”:

top_brand_by_vehicle_type = {}
vehicle = autos_c[“vehicle_type”].unique()

for v in vehicle:
selected_rows = autos_c[autos_c[“vehicle_type”] == v]
sorted_rows = selected_rows.sort_values(“brand”, ascending=False)
sorted_brands = autos_c.loc[sorted_rows]
top_brand_by_vehicle_type = sorted_brands

print(top_brand_by_vehicle_type)

Hi! @vroomvroom
To get the top brands by vehicle type, you can try the code below:
To get the Data frame:

import pandas as pd
autos_c.groupby('vehicle_type').brand.agg(pd.Series.mode).to_frame()

To get the Dictionary:

import pandas as pd
autos_c.groupby('vehicle_type').brand.agg(pd.Series.mode).to_dict()

Hope it helps!

@vroomvroom,

I think I just understood what you are trying to do. To format your code properly, copy your codes into a pair of this ``` and also try to specify the mission you are referring to.

I am going to make a brief explanation regarding this line above and regardingthis mission, hoping that it will answer all your questions:

Instruction 1:

  • Explore the unique values in the brand column, and decide on which brands you want to aggregate by.
    • You might want to select the top 20, or you might want to select those that have over a certain percentage of the total values (e.g. > 5%).

Solution 1:
The value_counts() function is used to get a series containing counts of unique values, the resulting series has the unique values as its index and the counts of each unique value as its values. Note that when we include the normalize = True parameter, it uses percentages instead of counts.

while

The unique() function is just used to find the unique elements of an array.

Now, Let’s aggreagte by brands that have a at least 5% of the total values.

Code 1: (read the comments)

brands = autos['brand'].value_counts(normalize = True) 
#assigns a series to the variable `brands` such that its indexes are the unique values in the `brand` column of the `autos` dataframe and its values are the percentage distribution of each unique value 
significant_brands = brands[brands > 0.05].index #assigns the index labels whose values are greater than 0.05 in our series above, to the variable `significant_brands`
print(significant_brands) #prints the brands that have at least 0.05(5%) of the total values

Output: Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

Instruction 2:

  • Create an empty dictionary to hold your aggregate data.
    • Loop over your selected brands, and assign the mean price to the dictionary, with the brand name as the key.
    • Print your dictionary of aggregate data, and write a paragraph analyzing the results.

Solution 2:
Now, To use loops to aggregate data, the process involved are:

  • Identify the unique values we want to aggregate by
  • Create an empty dictionary to store our aggregate data
  • Loop over the unique values, and for each:
    • Subset the dataframe by the unique values
    • Calculate the mean of whichever column we’re interested in
    • Assign the val/mean to the dict as k/v.

In Code 1, We already identified and found the unique values we want to aggregate by. Next is to aggregate

Code 2:

brand_mean_prices = {} #create an empty dictionary

for brand in significant_brands: #loop over `significant_brands` which holds the brands we want to aggregate by
    selected_brand = autos[autos['brand'] == brand] #subsets the `autos` dataframe by each unique value and assigns it to `selected_brand`
    price_mean = selected_brand['price'].mean() #calculate the mean of the `price` column for the subset
    brand_mean_prices[brand] = int(price_mean) #assigns the key of the dictionary as each unique value and its corresponding value as the mean
print(brand_mean_prices) #prints the dictionary

Output: {'volkswagen': 5402, 'bmw': 8332, 'opel': 2975, 'mercedes_benz': 8628, 'audi': 9336, 'ford': 3749}

You can edit the codes above by replacing the variable names with the ones you already defined in your code and check the output.

Tried to explain as simple as possible, I hope this helps

9 Likes

Thank you, this is very helpful.

This was very helpful, thanks!

Is there a way to sort the dictionary keys in ascending or descending order based on their assigned vaules, inside the print function?

I tried adding the .sort_values method, but it did not work, I assume because it is a Series method and not for dictionaries?

Hello @bc330,

It’s my pleasure.

Check out this post on stack overflow. I hope it answers your question.

1 Like

You explained it so well!! It is clear and easy to understand. Thank you.

2 Likes

Could you help explain .index ? You use it in significant_brand. what is it for?

.index is used to return the index (row labels) of the DataFrame/Series.

When you run this line of code above, it returns the Series below:

volkswagen       0.211264
bmw              0.110045
opel             0.107581
mercedes_benz    0.096463
audi             0.086566
ford             0.069900
Name: brand, dtype: float64

To then extract the Index of the Series which represents the brands with value counts greater than 0.05, you use the Series.index.

This line of code above, then returns the Index of the Series such that when you print it out, it gives:

1 Like

I guess i can use it to return all the columns too? just like we use df.columns?

They perform different functions.
df.index - The row labels of the Dataframe.
df.columns - The columns labels of the Dataframe.

index

Simple and extremely helpful.

1 Like

top_brands = autos[“brand”].value_counts().sort_values().head(20).index
top_brands_price = {}
for b in top_brands_price:
desired_row = autos[autos[“brand”] == b]
mean_price = desired_row[“price”].mean()
aggregate_data[b] = int(mean_price)

print(top_brands_price)
my dictionary is empty can someone help me out.

You didn’t call the dictionary top_brands_price when creating a key. Change aggregate_data[b] = int(mean_price) to top_brands_price[b] = int(mean_price)

I have been doing something wrong because in the first line of code:

brands = autos['brand'].value_counts(normalize=True)

I get the following error message:

TypeErrorTraceback (most recent call last)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

TypeError: an integer is required

During handling of the above exception, another exception occurred:

KeyErrorTraceback (most recent call last)
<ipython-input-182-dcfd61c84b5c> in <module>()
----> 1 brands = autos['brand'].value_counts(normalize=True)

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/series.py in __getitem__(self, key)
    621         key = com._apply_if_callable(key, self)
    622         try:
--> 623             result = self.index.get_value(self, key)
    624 
    625             if not is_scalar(result):

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   2558         try:
   2559             return self._engine.get_value(s, k,
-> 2560                                           tz=getattr(series.dtype, 'tz', None))
   2561         except KeyError as e1:
   2562             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

KeyError: 'brand'

I cannot figure out what I missed. I appreciate any feedback.
Thanks :raised_hands: