Going fast! #DataquestChallenge Premium Annual Offer:
500 get 50% & the next 1000 get 40% off.
GET OFFER CODE

GP- Predicting House Sale Prices. Some Issues! (Though I have completed it)

Hi guys. I have just completed the project on ML. I want to go really far in this but I am having some confusions. Please check my work and provide with your feedback.
Following are the confusions I am having

  1. Should we scale the features? In Guided Project’s screens, we aren’t told to do so
  2. How to know if my predicted RMSE is good? Are there any standards to know that?
  3. My CV RMSEs (in the last part) are coming out weird. Out of 20 n-splits, a few of them are extremely large. What does that mean?
  4. What possible things can be done to improve the results? (Like trying out Lasso and Ridge)

Predicting House Sale Data.ipynb (682.2 KB)

Machine-Learning/Predicting House Sale Data.ipynb at main · letdatado/Machine-Learning · GitHub

Click here to view the jupyter notebook file in a new tab

3 Likes

rizvey.ma, hi!
My little notes:

  1. MAE or RMSE are good parameters to find optimal values of different slopes or hyperparameters for Ridge or Lasso Regression.
    But after predicting the y_pred model for x_test everybody in case linear regression must have get the R2 metrics - on the other hand, you have great values MAE or RMSE but on another hand you can instable and not convergence model with negative R2.
  2. For check collinearity features among themselves your first neseccary step - exclude y values in our case SalePrice, and after it you plot correlation only between remaining features, no values that me want predict
  3. Before plotting it would be nice to bring everything features to the same scale - here you read
    6.3. Preprocessing data — scikit-learn 0.24.2 documentation
    Then possible and correlation plot may be have different colors and values.
    In general ames is a bad and hard choice for linear regression.
    P.S If you can use Ridge or Lasso regression - use only StandardScaler…
    Simple OLS Linear regression invariant for scaling.

PPS . Scaling may decrease collinearity features between each other - but models may not convergence.

1 Like

Hi Vadim Maklakov!
Thank you very much for sharing your notes. I am going to read them thoroughly and update my project.
I don’t understand that why they didn’t ask us to scale the data in guiding screens.
Do you think that it might be by mistake?

regards,
Ali

1 Like

The ML and DL are very complicated subjects. I think that the learning modules take simple essentials for general presentation. See here https://machinelearningmastery.com/ - especially pay attention for notes further reading books.

When you start studying module 8 “Deep Learning” - you will feel like a hero of the movie “Enemy at the Gate” - “You must have got a rifle in the battle!”(fables of the American agitprop) and module 7 “Machine Learning” seems easy and understandable. Get ready for it)))

Alright vadim! I do keep taking help from this website but never gave a thought to go to further reading.
I am definitely going to read them.
Kindly help me a bit more. Which readings would you recommend me to go for?
You can understand that I am not good to understand complicated stuff for now.

But I am working towards it

Hahaha! I tried doing Deep Learning course.
But I felt that its important to be really good at “not-so-deep learning” before moving on to “deep learning”
Do you think its correct to work with ML before going into DL?
Or could one proceed towards DL having a mediocre level with ML?
This has been quite a bug confusion for me

Classical right learning path - from simple to complicate.
I think that before learn DL you must have learned ML.
In the fact, DL is a set of bricks from ML from which you build the model of regression of classification or clustering. And knowledge of the principles the working bricks of ML is very necessary.
It’s bad that no one has yet written a book on DL analog of “Learning Python by Mark Lutz 5E”. The only similar good book on ML/DL subject for beginner:
“Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow
SECOND EDITION
Concepts, Tools, and Techniques to Build Intelligent Systems” by Aurélien Géron
also good resources with quality content:
https://machinelearningmastery.com/ - very many links for further reading
https://www.analyticsvidhya.com/ - good describing base concepts with sample…
In our platform, most parts of contents in the style - “It is better to be beautiful and healthy than poor and sick " and " It is better to drink vodka than to fight”.
The training modules themselves begin from Stats are very stingy in content, I can say that is empty modules.

Hey Vadim! Sorry about getting back to you late.
I follow both websites for quick references. I have also read a couple of books on ML.
The good thing about DataQuest is that it literally helped me to develop coding habits. I don’t know how good or bad they are, but still. Reading raw literature helped me to learn concepts quick but I forgot them even quicker :joy:
And the best thing about DQ is the presence of extremely helpful people like you who help newbies like me alot. For these two things, I am really grateful to DQ.

Apart from that, I found DQ jumping around concepts without giving them proper attention. I have been doing a Kaggle’s beginner level challenge and I am amazed to see the work people are doing there and unfortunately, none of it was touched in DQ’s ML modules. As you rightly noted, Lots of attention was given to styling and aesthetics.
ML’s guided projects were not guided till the end and this disappointed me to some extent.

But Vadim, if we see the purpose behind what DQ calls Data Scientist’s Path, it is rather to help guys to get nice introductory level expertise rather than making them anything above than that.

Just my thoughts.

BTW, Vadim, I’d love to connect/follow you on linkedin, if you don’t mind.

my best regards,
Ali

P.S: I have just ordered the book Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow SECOND EDITION Concepts, Tools, and Techniques to Build Intelligent Systems” by Aurélien Géron
you recommended. Thanks again!

1 Like

Vadim, What is your take on putting efforts on Kaggle’s?

What I am seeing is that the data they provide is super clean! So there is rarely any effort needed for data cleaning which is very important skill is the real world. Feel free to correct me.

However I think that the competitions are very good to learn the optimization game there. Do you agree? Do you recommend me to put effort there?

Another thing that I heard about Kaggling is that it is a very good platform to practice deep learning.

Again, I request your input on these ideas. Thank you !

Ali, hi!
I looked at Kaggle - IMHO the 90% of projects are of the same level as us on DQ. It makes sense to use Kaggle if you want to make yourself PR in the DS community. I do not have such a goal, the main thing for me is to get an understanding of the principles and practical skills I do not see anything wrong with data cleaning, since it prepares you for real life, since reading the DS reviews on the Internet you can see that ML / DL really takes 10% of the time, that’s all the rest is data cleaning and preparation. Regarding books - there is such a resource http://libgen.li/ - there you can find books on IT for free. Do not rush to run headlong and order the book, the book you ordered - it is definitely there. By the way, if you want to get a conceptual understanding of ML / DL, I recommend the book by the creator of Keras “Deep Learning with Python, Second Edition 2nd Edition by François Chollet” ISBN-13: 978-1617296864, ISBN-10: 1617296864 - it’s better to even start reading it from the beginning at least Chapter 1 for understanding how ML differs from DL, is very easily written, you can download it from the link above - don’t waste your money :grinning:

It is the excellent book, IMHO but you need to have at hand two or three more books, no more than the same level, so you see different points of view, and each book will describe the nuances that are not found in others. This is a normal practice, no one can describe any subject area entirely in one book.
A lot of books on the same topic are also not needed, otherwise you will get confused.
Now the main problem is finding a good book with the dominance of huge number on the “bull ■■■■” style - really good books for ML and DL, no more than ten good and fundamental books in each area.

Ali, hi again!
Regarding DL - it seems to me that the “Deep Learning with Python, Second Edition 2nd Edition by François Chollet” is more suitable for beginners to study, since it is written in simple language, outlining the principles of action on the fingers - in one word, a real Stalinist classtbook where complex things are presented in a simple and accessible language. Nowadays such books are rare. And after “Deep Learning with Python, Second Edition 2E by François Chollet” the “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 2E Concepts, Tools, and Techniques to Build Intelligent Systems” by Aurélien Géron will be much easier to read. The little note - “E” mean Edition

Hi Vadim!!
So excited to know about these books. Have put them on download even before replying you haha.

I understand. And i think that same goes with the “courses” that are offered on these subjects. ML/DL have shown up in an era where once things go viral, its impossible to stop. Sometimes it seems like every other person is either making an online course or publishing books on these topics.

So far, I have been learning basic ML from two sources. One is DATAQUEST and other one is a book called “Introduction to Machine Learning with Python, A guide for Data Scientists - By Andreas C Muller and Sarah Guido”
What is your take on this book? I am sure that you must be knowing about it.

So, “Deep Learning with Python, Second Edition 2nd Edition by François Chollet”
and “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 2E Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron" should be the best ones for me to begin with ML/DL?

One more thing Vadim, I don’t understand that what is the right way to learn these and get the best of of these books. I forget stuff quickly :expressionless: if i just read the books.

Secondly, I just wrote a code. Will really appreciate if you could provide your feedback if it doesn’t take much of your time :slight_smile:
Will request you to go through it if it is convenient for you.

def apply_grid (df, model, features, target, params, test=False):
    '''
    Performs GridSearchCV after re-splitting the dataset, provides
    comparsion between train's MSE and test's MSE to check for 
    Generalization and optionally, deploys the best found parameters 
    on the Test Set as well.
    
    Args:
        df: DataFrame
        model: model to use
        features: features to consider
        target: labels
        params: Param_Grid for Optimization
        test: False by Default, if True, predicts on Test
        
    Returns:
        MSE scores on models and slice from the cv_results_
        to compare the models generalization performance
    '''
    
    my_model = model()
    # Split the dataset into train and test
    X_train, X_test, y_train, y_test = train_test_split(df[features], 
                                                        df['cnt'], random_state=0)
    
    # Resplit the train dataset for GridSearchCV into train2 and valid to keep the test set separate
    X_train2, X_valid, y_train2, y_valid = train_test_split(train[features],
                                                          train['cnt'] , random_state=0)
    
    # Use Grid Search to find the best parameters from the param_grid
    grid = GridSearchCV(estimator=my_model, param_grid=params, cv=3,
                        return_train_score=True, scoring='neg_mean_squared_error')
    grid.fit(X_train2, y_train2)
    
    # Evaluate on Valid set
    scores = grid.score(X_valid, y_valid)
    scores = scores # CONFUSION
    print('Best MSE through GridSearchCV: ', grid.best_score_) # CONFUSION
    print('Best MSE through GridSearchCV: ', scores)
    print('I AM CONFUSED ABOUT THESE TWO OUTPUTS ABOVE. WHY ARE THEY DIFFERENT')
    print('Best Parameters: ',grid.best_params_)
    print('-'*120)
    print('mean_test_score is rather mean_valid_score')
    report = pd.DataFrame(grid.cv_results_)
    
    # If test is True, deploy the best_params_ on the test set
    if test == True:
        my_model = model(**grid.best_params_)
        my_model.fit(X_train, y_train)

        predictions = my_model.predict(X_test)
        mse = mean_squared_error(y_test, predictions)
        print('TEST MSE with the best params: ', mse)
        print('-'*120)
    
    return report[['mean_train_score', 'mean_test_score']]   

ML and DL are slightly different subjects.
In my opinion, it makes sense to start learning DL only after studying fundamental conceptions ML
Code above how I understand from byke prediction project.
I don’t understand why split train set - the idea of cross-validation is that the entire dataset is iterated over several times and the highest possible result is evaluated, which is compared with what we got.
When we several times random shuffle all dataset splitted in planned proportion we define required our hyperparameters.
In my opinion in this project, I do not see the point in writing separate functions, since in fact there is no code reuse and it will take a lot of time to debug functions and write function logic internally for various types of models
Books:

  1. Aurélien Géron - its more practical book for reader with with some already existing knowledge. The base knowledge of how to write a pipeline and some primer ML may get in the https://machinelearningmastery.com/. Be carefully, part of code written on Python 2
  2. François Chollet - the best books for begin learn DL - simple and accessible concepts are described from simple to complex.
  3. “Introduction to Machine Learning with Python, A guide for Data Scientists - By Andreas C Muller and Sarah Guido” - this is a short book with examples in Python but without detailed describing concepts of ML. For understanding the base concepts of ML think that have to read Gareth James • Daniela Witten • Trevor Hastie Robert Tibshirani An Introduction to Statistical Learning with Applications in R ISBN 978-1-4614-7137-0 ISBN 978-1-4614-7138-7 DOI 10.1007/978-1-4614-7138-7 where detailed described ML conceptions.
    And you always have to be ready that your model does not converge using only one simple decision method like linear regression. :slight_smile:
1 Like

Hey Vadim!
Thanks a lot for your feedback. I understand that I have done much in this which was not important at all. I did it to practice the skills I will require to use in the future like doing these things chained up in functions.
I will soon upgrade this project to match your recommendations. And I have worked on that Bikes project too. I request you to take a look at that as well
And Yes, I have downloaded all the books you recommended and going to dive into them very soon :slight_smile:
Vadim, I thank you from the core of my heart for these fantastic recommendations and feedback …

I am really glad to be around such a helpful person.
My best regards.
Ali

Hi everyone watching this convo: I have redone this project Guided Project - Predicting House Prices

It has a lots of improvements. I am not deleting this post as @vadim.maklakov has shared a lot of recommendations I do not wish to be deleted

Enjoy Learning and helping each other.
Best Regards,
ALI