This was a long one, started with a simple trick to shine among many:
- check performance of 3 different versions of the dataset on different models
then I’ve started experimenting with random seed and that was opening up a pandoras box:
I’ve run all models on 100 random seed numbers to check how the model performs in 100 different cases, not in a single case (I’d really be grateful for feedback on that matter) instead of looking for the lowest result in those 100 runs . I was looking for a model that performed the best on average on 100 runs(100 different random seeds).
I’ve checked all column combinations (selecting just the top columns from single column model results is not a great solution)
Apart from that, the usual: k-values, column numbers, cross validation - check it out. I’d be curious if that approach with 100 random seeds makes any sense or was it a total waste of my laptops cooling fans. (in 1 case I went over 1.4 million rows with results dataframe)
cars_ml_small.ipynb (2.8 MB)