How to handle unknown values of categorical features in input of model?

Hi, DataQuest community!

I have been currently working on building a regression model for predicting car prices and one of the features in the dataset is the brand of the car. I tried to create new feature based on it, which by mean selling price of car brand selects either it is in top half or bottom half, but faced next issue.

Assume that we have the list of unique car brands present in the dataset: [‘Toyota’, ‘Hyundai’, ‘Skoda’] (in reality list of course larger, this one is just for example). We train our model, deploy it and then someone tries to predict price of car with brand ‘Volkswagen’, which wasn’t present in training dataset. Are there any advices for handling such unexpected inputs?

@Zaika_Bohdan: I would say generally, of course, it’s much better to have the train and test data (which you are going to build the model to predict), ideally with the same/similar brand (with similar data points or statistics). You might want to find out how to fit the trained model on the other brand to the specific use case (VW cars).

Also it might be useful to use car brands in a similar region so maybe German or European brands instead of mainly Japanese brands (I would assume) if you are targetting a certain market, because imported cars tend to be more pricy than ones produced locally/within the same region. If you are focusing on an area with more EU/German car brands, then maybe you can scale down on the American/Asian brands if you are thinking of doing car prices say in Germany.

Hope my advice helps you!

You would have to use dummy variables for the brand of car. So you’ll have a ‘Toyota’ column and a ‘Hyundai’ column and so on and if a car is a Hyundai it’ll get a 1 in the Hyundai column and a 0 in every other column related to the brand of the car.

If you entered a car that’s a Volkswagen, it would get a zero in every brand’s column and there would be no Volkswagen column for it to get a 1 in.

You could change the model to add a Volkswagen column in order to include more vehicles but barring that, it would be treated as having a 0 in all brand categories.

Hope this helps!