CanIt by Roaring Penguin utilizes Bayesian filterings, which is a statistical technique whereby CanIt assigns a spam probability based on training from users. Bayesian filtering can greatly improve the accuracy of CanIt, and makes it harder for spammers to evade filtering. You can enable Bayesian Filtering under Preferences > Quarantine Settings > Add links to message to train Bayesian analyzer.

Bayesian filtering works as follows:

1) Each incoming e-mail message is broken up into tokens. Roughly speaking, a token corresponds to a word. In addition to single-word tokens, CanIt keeps track of token pairs, which can greatly increase the accuracy of Bayesian filtering.


2) End users train Hosted CanIt by marking a message as spam or non-spam. Each time a message is marked, CanIt updates counters for each token and token pair in the message. The training statistics are unique for each stream; each stream therefore has its own training set and own notion of what is and isn’t spam. The set of messages on which CanIt is trained is called the training corpus.


3) When size of the training corpus is large enough (see the Global Settings list below), CanIt applies statistical analysis to incoming messages. Each token in the message is looked up to see how many times it appeared in a spam message, and how many times in a non-spam message. The 15 “most interesting” tokens are collected, and a combined probability is computed based on the individual token probability. A token is considered “interesting” if it is either very likely to appear in a spam message, or very likely to appear in a non-spam message. Tokens that can appear in both spam and non-spam messages are not considered interesting.


4) After CanIt computes the combined probability, it consults a table to add points to (or subtract points from) the spam score.