you use the sum of words calculation without explanation.
why do you consider all the spams as one mail while they are not?
would not make it more sense to average the occurrence of filter word/wordcount ratios?
or find the minimum ratio as a threshold?
or even better to make a distribution curve of ratios (probabilities) to adjust the threshold?
Can you clarify what you mean by “sum of words calculation” here?
Spams are not being considered as one mail. The focus is on the words in a given message and whether that message is spam or not. And then given the words in a new message the probability is calculated whether that message is spam or not.
These approaches are for what exactly? To say whether the message is spam or not? I would, personally, have to think of cases where this might or might not work to be able to provide a better response. I will think about it.
But, regardless, if you have ideas to work on your own modifications to the algorithm, then absolutely try those out and share the results with the community as well. I’m sure we all could learn from that too!