Aggregated brands across mean price don't match

Screen Link: https://app.dataquest.io/m/294/guided-project%3A-exploring-ebay-car-sales-data/8/storing-aggregate-data-in-a-dataframe

In this step, when I wanted to find the aggregrate the brands in terms of the price in 2 methods:

  1. Top 20 companies

top_20_brands = autos["brand"].value_counts().sort_values(ascending=False).head(20).index

aggregate_data = {} 

for b in top_20_brands: 
    selected_row = autos[autos["brand"] == b]
    mean_price = selected_row["price"].mean() 
    aggregate_data[b] = int(mean_price)

I expected to get:

audi             9336
bmw              8332
ford             3749
mercedes_benz    8628
opel             2975
volkswagen       5402

However, I got:

sonstige_autos : 12338
mini : 10613
audi : 9336
mercedes_benz : 8628
bmw : 8332
skoda : 6368
volkswagen : 5402
hyundai : 5365
toyota : 5167
volvo : 4946
nissan : 4743
seat : 4397
mazda : 4112
citroen : 3779
ford : 3749
smart : 3580
peugeot : 3094
opel : 2975
fiat : 2813
renault : 2474
  1. Companies greater than 5% of the company frequencies in the dataset

I wrote the following code:

top_percent = autos["brand"].value_counts(normalize=True)
top_percent_index = autos[top_percent > 0.05].index  

But got the error:

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

Not sure what I’m doing wrong in both of the methods I tried to approach this step. I was hoping that someone could help me out, thanks a lot!

2 Likes

@aminurkhan750 Welcome to DQ Community!

  1. As I understand you want to select the top 20 brands based on their mean price. That is the maximum 20 mean prices in descending order.

Then why is your expected solution different from the observed one?
Like expected answer has only 6 brands and the observed answer has 20 brands. Is that the issue or you are trying to match the results with that of the provided solution?

If it’s the latter then you have to review the previous steps as well and compare them to the solution so as to understand, where you may have deviated.

  1. when you already have created a series top_percent which gives you the % contribution of each of the brands then why are you trying to refilter the autos dataset.

Try this code check the result: top_percent[top_percent > 5.0] and if this solves the purpose then think about the difference between this code and this one:
autos[top_percent > 0.05].index

Let us know if this doesn’t help you.

2 Likes

Hi Rucha, I have the same issue, I was looking for the max mean price from the Top 20 brands. By comparing mean price among these 20 brands to decide which brand holds the best value. Is it the correct thinking process?

if i want to get the answer like the instructions provided, I can only analyze data from the most popular 6 brands, correct?

hey @candiceliu93

I am not sure if I understand your question correctly.

Let’s say we have 35 brands in total in the dataset. We group all the cars by their respective brands and calculate the Mean Price for each of the brands.
Then arrange the Mean Prices in a descending order and select the first 20 rows, there by selecting the top 20 Brands based on Mean Price. Then out of these 20 brands you want to select only one and that is the Top Most Brand.

If my above assumption is correct, then yes you can do that. Your thought process is not wrong.

Well it’s not a question of “Can analyze”. You can analyze as much as data as you can. However, If you do wish to match the solution, then only consider the first 6 brands as they have considered.
The challenge will be if your Top 6 brands are something like - A, B, C, D, E, & F and solution’s top 6 brands are like - A, B, D, F, G, H
Then all you need to do is find out where did your code deviated from that of solution’s.

Please do understand I said “deviate” and “not wrong”. Just because you get a different solution does not imply in any way that your code went wrong. The difference can be for any reason such as - You clubbed the cars and brands differently, DQ solution analyzed missing values differently and so on.

Do let me know if you lost the actual answer in this worded response of mine :open_mouth:

1 Like

Go it!! yes, you got my question right!

Thank you for your clear explanation!

2 Likes

Hello! I think i hav the issue of my answer beeing a diffeent answer.
This is my code:

brands = autos[‘brand’].value_counts(normalize = True)
significant_brands = brands[brands > 0.05].index

avg_price_by_brand={}

for b in significant_brands:
selected_rows=autos[autos[“brand”]==b]
mean=selected_rows[“price_usd”].mean()
avg_price_by_brand[b]=mean

print(avg_price_by_brand)

And when I sort i get this answers:

print(sorted(avg_price_by_brand.items(), key=lambda kv: kv[1],reverse=True))

[(‘audi’, 9336.687453600594),
(‘mercedes_benz’, 8628.450366422385),
(‘bmw’, 8571.480147917478),
(‘ford’, 7456.547932618683),
(‘volkswagen’, 6729.81956411556),
(‘opel’, 5432.479195699781)]

Wich is diffeent from what I see in DQ:

|bmw|8332|
|mercedes_benz|8628|
|opel|2975|
|audi|9336|
|volkswagen|5402|
|ford|3749|

Can you help me find why?
Thanks so much

1 Like

Welcome to DQ Community @sue.ancaya!

  • If it’s the order that is different then I guess DQ solution hasn’t arranged the result in descending order which you have done.

  • if it’s values that are a mismatch then - have you compared the results of only this step between your work and solution workflow? If yes, then I would suggest, you go through the entire solution workflow to identify where you could have deviated in steps.

(As I executed your code in my notebook, the values are again different for me.)

If the answer is no or if you can’t identify, then please share your notebook.

Also please use quotes </> for the code part of the post. This preserves the syntax and structure of the original code. This guideline has further details.