About Lesson
-
Prior Odds for Spam and Ham:
- We start by specifying the prior odds for spam versus ham. In this case, let’s assume a simple 1:1 ratio, meaning that, on average, half of the incoming messages are spam.
- However, in reality, the prevalence of spam is likely much higher.
-
Likelihood Ratios:
- To estimate likelihood ratios, we need probabilities for each word occurring in spam messages and ham messages.
- We obtain these probabilities from actual training data that includes both spam and legitimate messages.
- For each word (e.g., “million”), we calculate its occurrence in spam and ham messages.
- Example:
- Occurrences of “million” in spam messages: 156 out of 95791 words (approximately 1 in 614).
- Occurrences of “million” in ham messages: 98 out of 306438 words (approximately 1 in 3127).
- The likelihood ratio for “million” is (1/614) / (1/3127) = 5.1 (rounded to one decimal digit).
- A likelihood ratio greater than 1 indicates that the word is more likely to appear in spam than in ham.
-
Applying the Method:
- Armed with these likelihood ratios, we can now classify new messages as spam or ham based on their word occurrences.
Remember that this approach assumes independence between words (the “Naïve” part of Naïve Bayes), which simplifies the calculations. If you encounter any difficulties with the math, feel free to review fractions and odds concepts.
word | spam | ham |
---|---|---|
million | 156 | 98 |
dollars | 29 | 119 |
adclick | 51 | 0 |
conferences | 0 | 12 |
total | 95791 | 306438 |
word | likelihood ratio |
---|---|
million | 5.1 |
dollars | 0.8 |
adclick | 53.2 |
conferences | 0.3 |
Join the conversation