First, let’s recap the key components:
- Prior Odds: These represent our initial belief about whether a message is spam or ham. In our case, we’ll start with a 1:1 odds ratio (equivalent to 50% probability for both spam and ham).
- Likelihood Ratio: This factor tells us how much more likely a certain word or feature is in spam compared to ham. We calculate this by analyzing a large dataset of labeled messages.
Now, let’s consider a single-word message. Suppose we receive the word “discount.” We want to determine whether it’s more likely to appear in spam or ham.
-
Calculate Likelihood Ratio:
- We look at our dataset and find that the word “discount” appears in 80% of spam messages and only 10% of ham messages.
- The likelihood ratio for “discount” is:
frac{{text{{Probability of “discount” in spam”}}}}{{text{{Probability of “discount” in ham”}}}} = frac{{0.8}}{{0.1}} = 8
-
Update Prior Odds:
- We multiply our prior odds (1:1) by the likelihood ratio (8):
- New odds for spam:
1 times 8 = 8
- New odds for ham:
1 times 1 = 1
- New odds for spam:
- We multiply our prior odds (1:1) by the likelihood ratio (8):
-
Normalize the Odds:
- To get probabilities, we normalize the odds:
- Probability of spam:
frac{{text{{New odds for spam}}}}{{text{{Total odds}}}} = frac{{8}}{{8 + 1}} = 0.89
(approximately)
- Probability of ham:
frac{{text{{New odds for ham}}}}{{text{{Total odds}}}} = frac{{1}}{{8 + 1}} = 0.11
(approximately)
- Probability of spam:
- To get probabilities, we normalize the odds:
So, based on the word “discount,” our updated probabilities are approximately 89% for spam and 11% for ham. If we receive more words in the message, we can apply the same process to refine our classification further.