It’s been a while since I’ve been posting in the community, hope you are all safe and well.
I spent some days working on one of the guided projects (car rentals) by taking some items and approaches that were used by my colleagues and I managed to learn on my job, mostly focus on model explanation; this comes in the form of model feature importance, and SHAP (https://github.com/slundberg/shap) - I was especially new to the package and was not easy to understand the whys of certain outputs, but looks like I got it in the end, more or less.
I would be happy to get some feedback over:
- The train/test approach, that I follwed from Aurèlien Géron’s Hand-on Machine learning book, where he isolates a test set completely, to avoid data leakage, and uses the train test to perform validation. I like this approach and I think it really is robust, but requires careful setup sampling - if needed.
- pipeline processing and methodology (I am a bit struggling with some of the sklearn transformers)
- approach in feature engineering: do the transformation used make sense?
- to scale or not to scale: given theregression problem I felt I did not need to scale the values of the variables, but at the same time I wonder if this could have impacted the Random Forest feature importance.
here is the whole notebook: