When I used binning for ‘hr’ column in data set and trained decision tree model , I got an error of around 30,000 but when I used that column without binning , the error reduced to 2,700. I expected a change in error , but why is this change that big ? Is it bad to use binning for decision tree algorithm ?
Here’s a pointer on how visualization effects feature engineering: https://www.youtube.com/watch?v=N9fDIAflCMY
Binning at the correct places allows splits to happen at places that may not be possible without binning because other split points dominated the DT algorithm which is greedy.
@hanqi thanks for your response ! Can you explain the big difference between the errors ? Is there a proper reason behind this or is it just by chance ?
DecisionTree is a partition based algorithm. New data points to be predicted flow down
feature < value (comparison direction varies depending on implementation) rules built during model training. Each new data point ends up in 1 partition in the training instance space. From here, a DTclassifier usually selects the majority class as the class to assign to the new data point, while a DTregressor uses the average.
When you bin, you reduce the number of partitions available. Given the same number of training points, there are more points in each partition being averaged together to give a prediction. This could also mean there is less variance among the average values among all the partitions. This means a new point has less chance to cause a big error value.
If you really want to prove what i said above, you can implement your own DTregressor, simulate some data, compare bin vs no bin. Then you will know decision tree inside out:https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/
@hanqi thanks ! that really helped.