I have tried to build ML model using RandomForest, Adaboost, LR, and XGBOOST but unfortunately I didn’t good R2 42% only and most likely I have overfitting in my model. is there any ideas to improve my model accuracy like building Deep Learning model or trying try with another ML algorithm.
Total samples: 5000 sample
correlation.matrix: bad correlation between the features and the label
To give some more context, you many want to compare the performance of your models to a simple baseline estimator, like the numeric average.
You can check for overfitting by measuring performance on a hold out set.
Lastly, to increase performance, you can try some feature engineering, e.g. binning numeric variables, standardizing numeric variables, etc. Another option is to combine models, taking the average prediction, or building a model on top of model predictions (stacking).
To Confirm Overfitting :
- Please use cross validation technique. (or)
- Divide data in 60:20:20 ratio (Train 60% , Test 20% , Validation 20%).
As per your post , it looks like most of the models (Linear and Tree based) are performing poorly. May be , you need to refine data pre-processing and focus more on Feature selection and engineering.
Data pre-processing ->
- If Data is not normally distributed -> You can do Log transformation.
- Scale your data using -> Standard or MinMaxScaler.
- Check for multicollinearity and remove correlated features.
- Use appropriate techniques for categorical features : Label / Onehot / Mean Encoding.
- Use Feature selection in an iterative manner , e.g. filtering out less important feature and then see how performance is changing then add some more features and compare .
- Try using stacked ensemble method.
- You can try GBM /CATBoost Regressor as well.