I am studying some machine learning models with some data sets that I have found on the internet. I found one quite interesting and it has me stuck without achieving a solution.
The main difficulty with this data set is that there are only cars that were made in China in the training data set. While in the test data set there are cars made in China and the US. The goal is to predict whether the car is made in China or the US and then predict how long the car will last.
Do you have any idea how to identify the country that made these cars?
Training data set has the following variables: Weight, Price, Length, Origin (All from China), Planned number of years to last.
Test data set: Weight, Price, Length.
While I am not a machine learning expert, My approach would be to:
And then, based on my findings, add some artificial rows of US cars to the training data.
As you’ve pointed out, you don’t really have what you need for supervised machine learning.
Seems like there are a few next-best options.
As Sahil mentioned, you can try to improve the training set - right now it can’t help you differentiate between manufacture in US and China because it only has one.
Depending on what you’re doing, you can re-shuffle the training and test sets and make sure the new training set is a stratified sample, or use a technique like leave-one-out cross validation to build a model on all the data (although you won’t be able to validate how well the data generalizes, since you’ve used all your data to build the model).
Another alternative would be to add some simplifying assumptions based on properties of the training list. In the extreme, machine learning can be reduced to asking the question “Is this new car on the list of cars from China?”. In this case, (where you felt like you had most of the specs for all cars you might come across in the training data) you might assume that there are no overlaps in specifications between cars from China and cars from the US. In a more relaxed case, you might assume that all cars from China will resemble each other on these characteristics and that all cars from the US will NOT resemble examples that you have from China. In this case, you might adopt an approach that finds nearest neighbors and assigns China when the nearest neighbor is closer than an arbitrary threshold, and US when it is farther than the arbitrary threshold. These assumptions could very easily turn out to be false, and so you’ll need to do some homework to figure out if any of these might apply. (This is why subject matter expertise makes a big difference in data science - knowing a lot about cars from China and the US can meaningfully guide your assumptions here).
I’ll close by pointing out that machine learning isn’t an all-purpose tool. You can do your best to apply it to this dataset, but if it turns out you can’t, there’s no shame in turning to other methods to try to answer your question. Maybe your local car mechanic can just look the answer up for you in their parts system