Help on project - understanding linear regression

Hi, I’m working on a personal project based on NYC restaurant inspections grading data. My idea is to see if we can predict a restaurant grade through ML based on the type of cuisine, borough, etc. I already know the answer is no, there is no correlation between the 2, but will keep on pushing through with this project for practice.

I am however having trouble with linear regression. I did a linear regression model with scikitlearn but I am not sure if what I did is correct or not.
Here is my code, relevant part tot my question starting at # turn qualitative values into quantitative (before that is my data pre-processing)

Would appreciate any help or pointers, I’m a beginner still trying to wrap her head around statistics notions, linear regressions, Python and all that :slight_smile:

Thanks a lot!

Hi @Eline

Well your R2 score is really low for train and negative for test (that basically means that the algorithm is not predicting anything). And the reason might be because there aren’t enough features feeding the algorithm. You’re just using the city of the restaurant to predict the score (if i’m not mistaken). If i were doing this project i would use CUISINE DESCRIPTION, VIOLATION CODE and CRITICAL FLAG. I know that these features have a lot of missing data, but if the amount of missing data is less than 5% of the full data i would drop, if not them i would try to fill them. Or i would choose a random sample of 20-25% of the data and see if it works

Good luck!

1 Like

Thank you for your reply @alegiraldo666! Yes I only had one feature here to try it out but was planning to add more. I will work on feeding more features to the algorithm then, unsure how exactly to go about that but I will come back here for questions if needed :slight_smile:

What I have right now is correct though, right? The results are low/negative but the way I wrote the code is correct?

Cool! Hope everything goes well!

Yes, the code is correct. I just saw something while i was reading it again, here
You didn’t predict anything, it’s interesting to see the score between both test sets but the real test it’s the prediction made by the model

pred = reg.predict(X_test)
print('Score Prediction:', reg.score(X_test, pred)

After that you really see the actual R2 score of the model or in other words how well the model works

Good luck!

1 Like

Great thank you so much again for your help @alegiraldo666 :slight_smile: I’ve just started learning so it’s nice to get some feedback.