Hi,
as i’m trying to work out on the data cleaning basics, i got struck by a question which i dont know to answer/find a logic.
In one of the additional questions , it was asked to give the highest storage capacity of the laptops. but i’m unable to find a logic that helps to sort the value which has both numbers and alphabets.
Screen Link:
https://app.dataquest.io/m/293/data-cleaning-basics/13/next-steps
Hello, seems you provided wrong link, kindly update the link.
Hi,
What you have to do is to get rid of letters (.str[-2]
) to choose the last two characters) and keep with numbers only. The challenge is that although most values are in GB, there are some in TB.
What I did was to replace the ‘GB’ by ‘’ and convert to float those rows where the storage space measured in GB and for those with TB additionally multiply by 1024.
laptops.loc[laptops['storage_space'].str[-2:] == 'GB', 'storage_space'] =laptops.loc[laptops['storage_space'].str[-2:] == 'GB', 'storage_space'] .str.replace('GB', '').astype(float)
laptops.loc[laptops['storage_space'].str[-2:] == 'TB', 'storage_space'] =laptops.loc[laptops['storage_space'].str[-2:] == 'TB', 'storage_space'] .str.replace('TB', '').astype(float) * 1024
laptop_max_storage = laptops[laptops['storage_space'] == laptops['storage_space'].max()]
3 Likes
good idea 
Below another approach( I rid off all numbers higher than 64 GB ( there is no smaller drivers atm, so all things smaller than 64 is in TB
)
btw. It can be defined even lower, when error occurs ( if some older laptop with 32 GB appears).
# 3. Which laptop has the most storage space?
# Building def for removing prefix:
clean_storage = (laptops[laptops['storage']
.str.extract('(\d+)').astype(float) < 64]
)# "< 64" allows to filter just TB storages
clean_storage_max = laptops[laptops["storage"] == clean_storage['storage'].max()]
clean_storage_max_the_one = clean_storage_max.iloc[0]
print('\n')
print('Answer 3:')
print('\n')
print(clean_storage_max_the_one['manufacturer':'model_name'])
print(clean_storage_max_the_one['storage'])
output:

Variables: