General Question about predicting results after observing them

So far in the Data Analysis and Visualization portion in the course we have focused on observing and visualizing correlations and patterns in the data.

I am wondering where in our data science learning journey we can learn about predicting a result based on the data?

Using the NYC Public Schools project as an example:

We have the SAT scores and lots of demographic data for each school. We can observe things like the percentage of English language learners correlates negatively with SAT scores, safer schools correlate positively with SAT scores.

What I was thinking about during this project is to what degree each of these factors has on a student’s eventual SAT score. For example if we randomly place a hypothetical student in any of the schools, what SAT score can we expect them to have? What are the chances of them getting a specific score based on the school they attend?

To say it another way: For all of the variables (race, safety scores, percent English leaners, etc.). what impact to each of them have on the student’s eventual SAT score? We know percent English learners has a significant impact but to what degree in relation to everything else?

What area of Data Science, statistics, and programming toolkits focus on this?

Thank you for your time and let me know if I can explain my question any further.


This is a very good question.

The task you have described can be achieved with machine learning. I am assuming that you have completed the data preparation steps.

In my opinion, the first step is knowing the machine learning algorithm(s) that can accomplish the task. Since we are talking about SAT score, which is a continuous variable, regression algorithms are your best choice.

After choosing the regression algorithms, you will also have to deal with your features (or columns). Regression algorithms do not work well with string datatype, so may have to transform columns like race to be numerical/integer. You transform using dummies or one-hot encoding. You may also want to scale continuous data. You may also want to measure feature importance and train you model on the most important features. There is also feature engineering…

You select a metric, train several models and select the best performing one. You can use this model to predict the SAT score. The DataQuest machine learning path covers these topics. The course uses the sklearn library.

1 Like

I see, thank you for this thorough explanation! Sounds very interesting, I’m looking forward to getting into that material.

1 Like