My answers for additional questions posted in exercise 13

Hi:)

Below my answers for additional questions from this exercise( 13/14). I’ve done it in the last page where I could, so, here(exercise 12/14).

I’m open to suggestions ( what could be done better, more simply ect.). Also I posted some additional questions at the end of this post. Would be very happy if someone put some light on those topics.

Base code from exercise 293-12:

laptops['weight'] = (laptops['weight'].str.replace('kg','')                     .str.replace('s','').astype(float)
                    )

laptops.rename({"weight":'weight_kg'}, axis=1, inplace=True)

Below additional DQ tasks ( don’t know why they was on 13/14 page):

# Convert the price_euros column to a numeric dtype.

laptops['price_euros'] = (laptops['price_euros']
                          .str.replace(',','.')
                          .astype(float)
                         )

# Extract the screen resolution from the screen column.
laptops["screen resolution"] = (laptops["screen"].str.split().str[-1])

# Extract the processor speed from the cpu column.
laptops["processor speed"] = (laptops["cpu"].str.split().str[-1])



# last part of the exercice 12/14 page ( it has to be under 13/14 page for saving cause)
dtypes = laptops.dtypes
laptops.to_csv('laptops_cleaned.csv',index=False)

“Here are some questions you might like to answer in your own time by analyzing the cleaned data”:

# 1. Are laptops made by Apple more expensive than those made by other manufacturers?

# Answer: That's a tricky one! The correct answer is YES. No need to write a code to check it.
# The idea of programming is to simplify things. Obvious things like: do birds fly? Is the flame hot? 
# Are laptops made by Apple more expensive than those made by other manufacturers and will they not have usb ports? - those things are from the same "obvious" category. :)  

# Let's suppose that the brand isn't called Apple( at this point you can see how much I like this company), but: worthless-trendy-garbage(aka. WTG) compared with good laptops. The answer is:

# Identify laptops from wtg and good ones:
wtg = laptops[laptops['manufacturer'] == "Apple"]
good_laptops  = laptops[laptops['manufacturer'] != "Apple"]


# Find the minimum, maximum and average prices for both objects written above:
wtg_av = wtg['price_euros'].describe()[1]
wtg_min = wtg['price_euros'].describe()[3]
wtg_max = wtg['price_euros'].describe()[7]

good_laptops_av = good_laptops['price_euros'].describe()[1]
good_laptops_min = good_laptops['price_euros'].describe()[3]
good_laptops_max = good_laptops['price_euros'].describe()[7]

# Pring of the answer for the first additional mission:
print("Answer 1:")
print('\n')
print("WTG average price is:", int(wtg_av), "EUR")
print('"Good laptops" average price is:', int(good_laptops_av), "EUR")
(print('"Good laptops" are cheaper by', int(wtg_av) 
       - int(good_laptops_av), "EUR", "on average"))
print('\n')

print("WTG maximum price is:", int(wtg_max), "EUR")
print("'Good laptops' max price is:", int(good_laptops_max), "EUR")

# Additional "if statement" needed, 
# because - surprisingly - there are good laptops more expensive than WTG.# There is some logical reason for that, for sure..if (int(wtg_max) - int(good_laptops_max)) > 0:
    (print('"Good laptops" are cheaper by', int(wtg_max) 
                  - int(good_laptops_max), "EUR", "on maximum price range")
    )

if (int(wtg_max) - int(good_laptops_max)) < 0:
    (print('WTG are cheaper by', int(good_laptops_max) 
           - int(wtg_max), "EUR", "on maximum price range")
    )
    
# the exception proves the rule on maximum price range and with penguins - penguins don't fly. 
    
print('\n')

print("WTG minimum price is:", int(wtg_min), "EUR")
print("'Good laptops' minimum price is:", int(good_laptops_min), "EUR")
(print('"Good laptops" are cheaper by', int(wtg_min) 
       - int(good_laptops_min), "EUR", "on minimum price range"))



# 2. What is the best value laptop with a screen size of 15" or more?

pc_15_inch_mask = laptops[laptops['screen_size_inches'] >= 15.0]
pc_15_inch_mask_sorted = pc_15_inch_mask.sort_values('price_euros')
sorted_pc_15 = pc_15_inch_mask_sorted.iloc[0]
top_pc_15plus_name = sorted_pc_15['model_name']

# Pring of the answer for the second additional mission:

print('==========================================================')
print('\n')
print("Answer 2:")
print('\n')
print('top laptop with a screen size of 15" or more is:', top_pc_15plus_name)
print('\n')
print('full data for best 15":', '\n', '\n', sorted_pc_15)



# 3. Which laptop has the most storage space?

# Building def for removing prefix:
clean_storage = (laptops[laptops['storage']
                         .str.extract('(\d+)').astype(float) < 64]
                )# "< 64" allows to filter just TB storages

clean_storage_max = laptops[laptops["storage"] == clean_storage['storage'].max()]
clean_storage_max_the_one = clean_storage_max.iloc[0]
print('==========================================================')
print('\n')
print('Answer 3:')
print('\n')
print(clean_storage_max_the_one['manufacturer':'model_name'])
print(clean_storage_max_the_one['storage'])

Additional questions to DQ community:

question 1

What’s the difference between Series.str.rsplit and Series.str.split? At one point I tried to use this method. But I was defeated…

I thought that the code:

laptops["screen resolution"] = (laptops["screen"].str.split().str[-1])

is equivalent to:

laptops["screen resolution"] = (laptops["screen"].str.rsplit().str[1])

…but it’s not. Why ? Is the “rsplit” not the reverse method to “split”?

question 2:

why series.str.split() doesn’t remove the thing that goes to the new column from the old one?

If I have an ex screen: IPS Panel Retina Display 2560x1600 - and I want to move just “2560x1600”. After using this method, the screen column contains “2560x1600” still.

Hi @drill_n_bass

Here why didn’t you use min(), max(), avg() functions?

A1. This article


will answer you the difference between split and rsplit. Basically it doesn’t have any difference unless the maxsplit argument is provided.

A2. I’m not sure if I understood the question correctly. But split() only splits a string along delimiter(by default the white space) like ;,'. It won’t remove the string slice from the original string. You can access them using index like str[0] from the list it creates after splitting.

I hope this answers your questions.

1 Like

You got me :D. True, I should ! Thank you.

Thank you I will check it tonight ! :slight_smile:

Hi @drill_n_bass,

Well, I’m not an expert, but now I got curious why you are so against of Apple laptops! :grinning: Especially I’m worried that you say that it’s an obvious thing for everyone, and I am completely out of the whole topic :see_no_evil:

Now answering your questions.

Question 1: Series.str.rsplit() indeed starts splitting the strings in a Series from the end, and in this sense it’s a kind of “reverse”. But the order of the resulting list of values is the same as for Series.str.split()! This method can be good only when we want, for example, to limit the number of splits in the output (using the n parameter) and we’re interesetd only in the last value (or values): to keep only them or, just the opposite, to exclude exactly them. However, the indexing is the same negative, if we are counting from the end. In your case, it would be:

laptops["screen resolution"] = laptops["screen"].str.rsplit().str[-1]

which is practically identical to the line where you are using Series.str.split(). And it will be always identical when you don’t use the parameter n. So if you have no need to use this parameter, you can easily use Series.str.split() instead of Series.str.rsplit(). Look the examples on this page when Series.str.rsplit() matters and makes difference (always with that n).

Question 2. After creating the 'screen resolution' column, you want to remove the resolution values from the 'screen' column, right? Then use this code:

laptops["screen"] = laptops["screen"].str.split().str[:-1].str.join(sep=' ')

Or, if you want to use your favourite str.rsplit() :yum:, here is the equivalent code:

laptops["screen"] = laptops["screen"].str.rsplit(n=1).str[:-1].str.join(sep=' ')

Well, you don’t have almost no usb ports, neither other ports. They (laptops) are expensive as he||. You can’t use other brands’ hardware ( cables ect), because they can destroy those dumb ports that they incloude to the mac laptop. They have problems with simple tasks, like to do Zip( and unpack it, when someone sends one from MS Windows). Many apps do not work ( because it’s not a pc platform). You have to pay for everything you want to have, pay hard.

The only thing I miss in this company and what I would do if I was its boss, is to subscribe for bragging about having devices from this company (why would it be for free ?!). it should also be payable! When buying, you should make a contract stating that you are aware that bragging about this equipment is an option. Penalty for breach of contract of 10,000. Monthly subscription for bragging at 500. Let them pay more !!! :japanese_ogre:[quote=“Elena_Kosourova, post:4, topic:549190”]
Well, I’m not an expert, but now I got curious why you are so against of Apple laptops!

Could you tell me what happens behind the scene with this code? I can’t understand how this code removes the resolution values from the 'screen' column. Neither split(), nor join() doesn’t seem to have this functionality. What’s more, this process took place in the same array " laptops[“screen”] = laptops[“screen”]… "

Merry Christmas ! :partying_face:

Hi @drill_n_bass,

Merry Christmas! :christmas_tree:

From your description, these laptops really look like a nightmare! :frowning: And despite all those problems they are even expensive! Probably, I didn’t lost anything not knowing about them before :smile:

About the code that I suggested to you. Starting from the second part of your question

What’s more, this process took place in the same array " laptops[“screen”] = laptops[“screen”]… "

yes, of course, you can create a new column (called 'screen_without_resolution', for example), instead of re-writing the existing column 'screen', and probably this approach is even more preferable:

laptops['screen_without_resolution'] = laptops['screen'].str.split().str[:-1].str.join(sep=' ')

Now, what this code actually does, step by step.

First, with str.split(), it splits each string of the 'screen' column and returns a list of separate words. For example, instead of IPS Panel Retina Display 2560x1600 you will have [IPS, Panel, Retina, Display, 2560x1600] (I didn’t add the quote marks for the strings here and further, since we are talking about the visualized output).

Second, with adding str[:-1] to our code, all the values from each list will be preserved except for the last one. From the list above, we’ll now have [IPS, Panel, Retina, Display], i.e. everything except for the screen resolution, which was always the last value. For the rows where in the initial 'screen' column we had only the screen resolution, we’ll have now an empty list.

Finally, adding str.join(sep=' ') to our code, we gather each such list in a string again (just the opposite to what we did in str.split()), using a white space as a separator between words (otherwise they will be attached to each other). Now the example above will look IPS Panel Retina Display. For the rows where in the initial 'screen' column we had only the screen resolution, and where at the previous step we had an empty list, we’ll have an empty string now.

Hi @drill_n_bass ,

Your solutions are amazing!

For Question 3, how does < 64 for the line of code below works?

clean_storage = (laptops[laptops['storage']
                         .str.extract('(\d+)').astype(float) < 64]
                )# "< 64" allows to filter just TB storages

I tried your lines of code on my solution:

import re

x = laptops_cleaned['storage'].str.extract('(\d+)').astype(float)

print(x.head())
print('\n')
print(x.tail())

I got this:

       0
0  128.0
1  128.0
2  256.0
3  512.0
4  256.0


          0
1298  128.0
1299  512.0
1300   64.0
1301    1.0
1302  500.0

But when I tried this line of code:

clean_storage = laptops_cleaned[x < 64])

I got this:

I don’t understand why it didn’t work. Could you help me with this?

Thanks in advance!

1 Like

to Elena_Kosourova:

oh… So this was the part that was removed. I thought it would just pick the last element. Where can I read more about it? Can’t find documentation of this particular function, when I typed: python .str[ ] - on Google.

And this function is kind of counterintuitive - I thought it works opposite (that’s why I couldn’t understand your code).

whait! This function doesn’t remove string. It moves the string… When I extracted the screen resolution from the screen column I used:

First code:

laptops["screen resolution"] = (laptops["screen"].str.split().str[-1])

Code above moved screen resolution data to the “screen resolution” column.

So, the only difference between the code above and the code below is the fact that it puts data to the same column:

Second code:

laptops["screen"] = (laptops["screen"].str.split().str[:-1]
                     .str.join(sep=' ')
                    ) 

…so, how does it happend, that in the first code it moves this data and in the second code it removes/deleted it ?

This issue, this point is at the heart of my misunderstanding

1 Like

to shipWship:

could you send me full code? Need more data :slight_smile:

second thing: It’s not the best idea to call this object “x”. Quote:

x = laptops_cleaned['storage'].str.extract('(\d+)').astype(float)

It’s harder to understand what’s all about :wink:

Hi @drill_n_bass,

No, wait, I want to clarify here that this part of the code str[:-1] doesn’t remove anything, just the opposite: it keeps all the items of our list composed of strings (previously, before becoming a list of strings in the first part of my code, it was a whole big string), apart from the last item with the index [-1] (which is, in fact, the screen resolution).

Analogically, in this piece of code str[-1] we keep only the last item of our list of strings and, correspondingly, we ignore all the previous items (let’s say, it’s not exactly like removing those items: we just ignore them and don’t keep them :blush:).

Now about this piece of code: str.join(sep=' '). Since at the previous step we ignored the last item and kept all the previous ones, now we have a list of strings without the last item (screen resolution) for each row. Now, we want to do the operaion opposite to str.split(): to re-combine all the items of the list in one big string, for each row. The items (pieces of strings) have to be separated by a white space, hence sep=' '. It is needed to avoid strings like “goodmorning” instead of “good morning”.

To have a more in-depth understanding about how this whole line of code works:

laptops['screen_without_resolution'] = laptops['screen'].str.split().str[:-1].str.join(sep=' ')

just divide it into those my 3 steps and run, first, only the first step:

laptops['screen_without_resolution'] = laptops['screen'].str.split()

and observe the result (of course, the result of the newly appeared column only, this one laptops['screen_without_resolution'], since nothing else was changed). Then, add the second piece of the code and ran this line:

laptops['screen_without_resolution'] = laptops['screen'].str.split().str[:-1]

and see what has been changed in our last column. Finally, add the last 3rd step and run the whole code:

laptops['screen_without_resolution'] = laptops['screen'].str.split().str[:-1].str.join(sep=' ')

At every step, it’s a good idea to compare this new last column with the “old” column laptops['screen'], since we decided to keep it. What is happening after each step with the last column? How is it different with respect to the “old” column?

Hi @drill_n_bass. Great solutions, although I’d like to understand your logic for Q3) Most storage space. In your results, the output you get is 8GB SSD, which seems fairly low.

1 Like

Hi @drill_n_bass,

You are right with the variable x! :slight_smile:

Here is the full code. The code for Question 3 is at the bottom, under the subtitle “13.4.2 Which laptop has the most storage space?

Just realised that from some perspective you are right! :smiley: But! The problem isn’t with low capacity, but with the fact that we have here another error with string. There should be TB not GB! I’ve upgraded my code for this matter( added new cleaning feature). Also I’ve changed the boolean comparison from 64 to 31( it will be more secure: if someone will put 32 GB hdd drive - it’s almost impossible, but yet). I will poste my code at the end of this post.

Let’s start with your question:

I tried to recreate your “error” with NaN, without success. For sure you did something that removed all data from the dataframe. I couldn’t discover how.
But what I checked, try to change this part of your code:

x = laptops_cleaned['storage'].str.extract('(\d+)').astype(float)

to this one:

x = laptops_cleaned[laptops_cleaned['storage'].str.extract('(\d+)').astype(float) < 31]

so, now you will process laptops_cleaned in laptops_cleaned and - at the same time - you will do boolean operation “< 31” at the same time. Note that I changed 64 to 31 (It should be safer value. I did the same on my code).

Let me know, does it help?

Below - off top about earlier part of your code: 13.4.1 Are laptops made by Apple more expensive than those made by other manufacturers?

You should examine whole code from this part ( it has a lot of loops)
There are easier aproches than this one you wrote

mean_price = {}
mean_price_list = []

for um in unique_manufacturer:
    mean_price[um] = mean_price_manufacturer(um)    # This function was written above 

# print(mean_price)

for key, val in mean_price.items():
    val_key = (val, key)
    mean_price_list.append(val_key)

# print(mean_price_list)

mean_price_sorted = sorted(mean_price_list, reverse=True)
# print(mean_price_sorted)

for i in mean_price_sorted:
    print(i[1], ':   ', i[0])

I think that you should check out methods and approaches from pandas to solve this additional mission that you did above. Your code works, but you didn’t use stuff from Panda part of our learning. Another reason why I think you should do it this way is: your code is very complex and could be more simple. Anyway, it’s your decision. Either you will leave it as it is or check what could be used from the lessons. To make it easier you can analyze my code as a " focal point" for this learning, or ask the “elders” :wink:

My last two cents about this part: it reminds me, when I tried to solve a lot of missions in a hurry - in the past. I don’t know if that’s the case here, but I had issues with planning what I wanted to write. I just wanted to do it instantly, after reading the mission’s tasks. But, like with many things in life, a good plan is 90% of success. If there will be flaws in this “part”, all things later will be difficult or impossible to do. So it’s good to not underestimate planning and contemplating: how to do things in the most effortless, fastest and simplest way! :slight_smile: After that the code should be written.

My new updated code related to third additional mission:

# 3. Which laptop has the most storage space?

# Because the answer(for the third question - posted later) 
# is "8GB SSD" and there is no drive that is SSD and has 8GB 
# (noone produces such a device), the most probable solution 
# is another string error, where 'GB' was typed instead of 'TB'. 
# There is a need to clean this data further. ATM (end of 2020) 
# the biggest TB SSD drive is 8 TB. There is no HDD drive 
# lower than 128 GB, so if something has "int < 64" 
# it's TB for 99% chances.

laptops['storage'] = (laptops['storage']
                          .str.replace('1GB','1TB')
                         )
laptops['storage'] = (laptops['storage']
                          .str.replace('2GB','2TB')
                         )
laptops['storage'] = (laptops['storage']
                          .str.replace('4GB','4TB')
                         )
laptops['storage'] = (laptops['storage']
                          .str.replace('8GB','8TB')
                         )


# Building def for removing prefix:
clean_storage = (laptops[laptops['storage']
                         .str.extract('(\d+)').astype(float) < 31]
                )# "< 31" allows to filter just TB storages



clean_storage_max = (laptops[laptops["storage"] 
                             == clean_storage['storage'].max()])
clean_storage_max_the_one = clean_storage_max.iloc[0]




print('==========================================================')
print('\n')
print('Answer 3:')
print('\n')
print(clean_storage_max_the_one['manufacturer':'model_name'])
print(clean_storage_max_the_one['storage'])

1 Like

Hi @drill_n_bass,

I tried the new line of code and checked several lines of code before, but I still got NaN. I think I shall check my whole solution.

Thanks for your advice! I will redo the mission to get a learning outcome better learning outcome!

Wish you a Happy New Year! :slight_smile:

Try to run all cells. Maybe this is a reason.

Hi @drill_n_bass, i agree that there might be some faulty data lurking around, which needs cleaning. Great solution although i couldn’t wrap my head around how does max() work with clean_storage[‘storage’] since it’s datatype is an object and not numeric.

It doesn’t because the code starts with:

laptops['screen_without_resolution'] = laptops['screen']

and in original code it was:

laptops["screen"] = laptops["screen"]..........

Thank you for your guidance Elena! :):):slight_smile:

I have two additional questions:

  1. Is it the only way to remove part of string ? or is there - also - some other approach with something like ( pseudocode): .str.del[ ] ; or str.delete[ ] ect?
  2. In my filtering, when I removed the prefix in laptops[‘storage’] I was using:
# Building def for removing prefix:
clean_storage = (laptops[laptops['storage']
                        .str.extract('(\d+)').astype(float) < 31]
                )# "< 31" allows to filter just TB storages

in this code I used .extract(’(\d+)’). This code works great, but I would be very happy to understand what is going on inside brackets. Things I don’t understand:

image

1 Like

Hi @drill_n_bass,

and in original code it was:

No-no, forget completely about the code starting with

laptops["screen"] = laptops["screen"]..........

since we decided to keep the column laptops['screen'] as it is. Now, we are talking only about the one starting with

laptops['screen_without_resolution'] = laptops['screen']

I meant, to compare not the old code with the new one, but the new column laptops['screen_without_resolution'] (in the new code, of course) with the “original” column laptops['screen']. And these 2 columns shoud be different at each of my 3 steps :slightly_smiling_face:

To remove a part of the string in a column of your dataframe, you can also use str.replace() method, with the following syntax:

df['new_column'] = df['initial_column'].str.replace(pattern_to_replace, "")

As for extracting the pattern '(\d+)' in laptops['storage'], what we are doing here:

  • Quote marks are used here to define a pattern to be extracted (this pattern is called also a regex expression, or regular expression, which you will learn in this mission). Such patterns are always of string type, hence we are using single or double quote marks around it.
  • Rounded brackets mean that our pattern is a so-called capture group. Capture groups capture the text matched by the pattern inside them into a numbered group that can be reused with a numbered backreference. Actually, this concept is an unnecessary complication in our case: there is no need to use a capture group here, since we are not going to reuse it. Hence, you can easily omit the rounded brackets here and use just '\d+'. You will encounter the capture groups later, in this mission, where you will learn more about their usage. But I hope that also my explanation here is more or less clear :blush:
  • \d is one of the standard regex expressions used to find a digit in the string, i.e. whatever numbers from 0-9.
  • + in this pattern (and in general, in regex expressions) means that we are looking fore one or more occurrences of the pattern that was placed just before that + (in our case, it is \d).
  • Hence with this whole pattern '\d+' we are searching for a digit or several digits (going one after another, without other symbols in between) to be extracted from the string. Once again, you will learn all this and much more about regular expressions (or regex expressions) when you arrive at the above-mentioned missions.

Happy New Year! :santa:

1 Like

Thank you for your explanation and effort. :relaxed:
Thanks to your help I feel that I understand all that happens and because of that I don’t feel burned out :exploding_head:. So there is more dedication to going further.

Happy new year ! :partying_face:

1 Like

Why did you not opt to convert your TB values to GB as 1TB=1000GB, then proceed to compare/find the max value

1 Like