Hi. I just finished my third guided project. It took 7-10 days. I’ve done all tasks that I was told to do. Sometimes I did more than it was told ( because I was feeling that some things should be investigated further, or analyzed better). Any feedback is welcome!
I’d like to thank all members of DQ that helped me with this project. Especially thanks to Elena Kosourova
“Identify categorical data that uses german words, translate them and map the values to their English counterparts” with “See if there are particular keywords in the name column that you can extract as new columns” in the one block.
pps. The version of the project posted here is the latest version. All things discussed below are updated now. So, this version from this post, it’s a final version.
Thanks for sharing your project! And I also noticed that you mentioned me in it, thanks a lot!!! I’m glad that my suggestions were useful!
Your project looks very nice, well-structured, well-commented where necessary, with a thorough analysis and interesting insights. Also, good emphasis of the main ideas and intermediate conclusions (using italic and bold). And finally it’s not only me who thinks that it’s not so strange to have free cars on eBay!
Here are some suggestions from my side, hope they will be interesting:
While you wrote intermediate conclusions after different sections of your project, it’s still a good practice to write also a general conclusion at the end of the project, with the most important results.
Please add a link on the original dataset in the introduction (or the modified version, since you mentioned that the original is not available anymore).
A good idea is to use backticks when mentioning column names in markdown cells. Then they will be more eye-catching.
I would omit giving technical details and describing in markdown cells different methods and how they work (for example, those right before the code cells , , and ). The reader can always find them in the documentation (hopefully in the last edition ). Alternatively, you can add these technical details as short comments in the code cells.
About the code commenting now. Your code is rather well-commented, only that I would place each comment before the code lines it describes (i.e., not laterally). For example, in the code cells , , and some others. Well, if a comment is really very short (1-3 words), it’s ok also to put it laterally.
You’d better remove all the commented-out code from the project (since you don’t use it anymore), like in the code cell  .
The code cells -: you can use print('\n') to visually separate all the printed outputs.
About between() (it seems that you had doubts about it in the code cell ): by definition, it includes both limits.
I noticed you tend to use (print()) instead of print(). It’s not an issue, but the external parenthesis are redundant, better to remove them.
The code cell : I wouldn’t convert datetime data into int.
The code cell : here it’s better to use unique() instead of autos to check “suspicious” columns which potentially can contain German words.
That’s all from my side. Good job indeed!
Happy learning and good luck with your further projects!
I will update all things you mentioned in my initial post. I appreciate your help, thank you!
I don’t understand what you meant here:
yes, I did sometimes, so the code isn’t so long ( according to PEP 8 I should fold it into few rows when it’s long). Quote from this site: “When you’re using line continuations to keep lines to under 79 characters, it is useful to use indentation to improve readability. It allows the reader to distinguish between two lines of code and a single line of code that spans two lines. There are two styles of indentation you can use.”
You are right. I missed some german words ! Now it’s perfect( there was a need to update a lot to be ok)
About print(). The idea to divide your code into several lines for better readability is great, I also do so. However, in case of the print statement, you can still divide your code in several lines without the necessity of using the external parenthesis. I mean, for example, this code from the code cell , with the external parenthesis:
Of course it’s not an error, nor a big deal here if to use or not the external parenthesis Anyway, it’s always a good idea to reduce unnecessary elements when writing any code.
Now about between(). In the code cell  you could use directly between(0,10000000) to preserve the values in these limits (i.e, the values from 0 to 10000000 will be preserved, while all the values outside this range will be excluded). By the way, now that I’m thinking about it, probably you were going to exclude the car with the price 99999999$? Then the piece of code above should be between(0,99999998), i.e. the price of 99999999$ will be excluded in this case.
The range was set in such way, that the maximum limit is 10,000,000 USD. I believe it’s a possible price for the car… wait! update: almost 19,000,000USD - ok, I will upgrade it to 20,000,000USD. Quote: " A new Bugatti costs from 1.7 million USD for the cheapest model, a Bugatti Veyron, to upwards of 18.7 million USD for a Bugatti La Voiture Noire , the current most expensive model on the market." Source link
Ah, the value 10000000 is included, ok. But then you still can use between(0,10000000) anyway, because both 0 and 10000000 will be included in the range to keep (the between() function has a parameter inclusive that by default is True, meaning that both lower and upper limits are included in the range, unless you decide to use inclusive=False). I just mean there is no need to extract 1 from 0 or add 1 to 10000000
Wait, but if you want to preserve both 0 and 10000000 in the dataframe, then you don’t need this between() at all, since these values are already the minimum and maximum of the price column! only now I realized it.
I mean, of course you need to remove outliers, only that in this case you decided not to consider the anomalous values as outliers Usually people consider as outliers for this columns exactly these values: 0 (which I don’t agree with) and 10000000 (which I’m also having doubt about). Let’s say, there are no other anomalous values. You can check it running autos['price'].max() and autos['price'].min() (before running between(), of course).
Anyway, it’s quite ok not to remove these “outliers” in given case, since they can easily be real values.
ok, now I know why I couldn’t understand you. I didn’t check now, what was the “highest outliner”. I estimated without any empiric checking that there will be some outliners like 100,000,000$ or even higher…
anyway, this project is finally ended. Felt like “never-ending story”
I upgraded it in my first post ( if someone would need it).
I really like your project! I’ve just finished same one and It was cool to see how you did additional tasks. I’ve done them little bit different, maybe you would be interested to see other approach to the same problem.