Making sense of Bayes rule
February 17, 2019
Bayes rule has always seemed like witchcraft to me. But today I went through the derivation, and then thought through why it is useful. Suddenly it started making a bit more sense.
Thinking about it in a machine learning setting, let’s say that we have a dataset containing cats and dogs, and their respective names. We would like to classify an unknown pet as either a cat or dog, given only its name. We can use Bayes rule to help us.
First, let’s start off by understanding where the rule comes from. We can use conditional probability identities to get some results.
The probability the the pet is called Rover and is a dog, is the probability the the pet is called Rover given that it is a dog, multiplied by the probability that it is a dog.
[latex size=2]P(Rover \cap dog) = P(Rover | dog) \cdot P(dog)[/latex]
Similarly, we have:
[latex size=2]P(dog \cap Rover) = P(dog | Rover) \cdot P(Rover)[/latex]
And now, because the left hand sides of both of those equations are the same, we have:
[latex size=2]\Rightarrow P(dog | Rover) \cdot P(Rover) = P(Rover | dog) \cdot P(dog) [/latex]
Now for the magic – divide both sides by the probability that the pet is called Rover, and — Boom! — Bayes rule.
[latex size=2]\Rightarrow P(dog | Rover) = \frac{P(Rover | dog) \cdot P(dog)}{P(Rover)}[/latex]
Why is this good and useful? Because it allows us to find the probability of a label given a constraint, using known probabilities from the training data!
Replacing dog with the variable class (or label), and Rover with the variable feature, we have derived the following:
[latex size=2]P(class | feature) = \frac{P(feature | class) \cdot P(class)}{P(feature)}[/latex]
i.e. a way of determining the class of an item given one (or more) feature constraints. How does this help with a classification machine learning problem? Well, we could calculate the probability of each class given the constrains, and select the class with the highest recorded probability.
Let’s finish off the pet example. We have the training data in the table below. We are given a mystery pet, which we know is called Rover. We wish to predict whether it’s a cat or a dog using only its name and the probabilities derived from this table.
Kind | Name |
Dog | Rover |
Dog | Max |
Cat | Fluffy |
Dog | Rover |
Cat | Leo |
Dog | Bruiser |
Dog | John |
Cat | Simba |
Cat | Tigger |
Cat | Rover |
Now we have [latex size=2]P(dog) = 0.5[/latex], and [latex size=2]P(cat) = 0.5[/latex]. Of the 5 dogs, two are called Rover, so we have [latex size=2]P(Rover | dog) = 0.4[/latex]. Of the 5 cats, one is called Rover, so we have [latex size=2]P(Rover | cat) = 0.2[/latex]. Finally, of all the pets, three are called Rover, so [latex size=2]P(Rover) = 0.3[/latex]. We now have everything required to make a prediction.
For a cat:
[latex size=2]P(cat | Rover) = \frac{P(Rover | cat) \cdot P(cat)}{P(Rover)} \\ \\
= \frac{0.2 \cdot 0.5}{0.3} \\ \\ = 0.333[/latex]
And for a dog:
[latex size=2]P(dog | Rover) = \frac{P(Rover | dog) \cdot P(dog)}{P(Rover)} \\ \\
= \frac{0.4 \cdot 0.5}{0.3} \\ \\ = 0.667[/latex]
So our model has made a prediction that the pet is a dog – which sounds good to me!
OK, so can we trick it?
From the pet dataset above, it might seem that dog was given the highest probability because there are more dogs called Rover than cats called Rover. OK, so can we fool it by introducing a new category, fish, which are all called Rover?
Let’s try, by using the above dataset, but adding a single fish which belongs to a joker who decided to call it Rover. That means that we now have [latex size=2]P(dog) = 0.45[/latex], [latex size=2]P(cat) = 0.45[/latex], and [latex size=2]P(fish) = 0.09[/latex].
Notice how [latex size=2]P(Rover | dog)[/latex] does not change — given a dog, there is still a probability of 0.4 that it will be called Rover. Similarly, [latex size=2]P(Rover | cat)[/latex] remains the same. Now we can also do the same for a fish [latex size=2]P(Rover | fish) = 1[/latex] — there’s only one fish, and its name is Rover. Finally we have [latex size=2]P(Rover) = 0.36[/latex]
So the question is, will it trick the system? Let’s run the numbers.
For a dog:
[latex size=2]P(dog | Rover) = \frac{P(Rover | dog) \cdot P(dog)}{P(Rover)} \\ \\
= \frac{ 0.4 \cdot 0.45 }{0.36} = 0.5 [/latex]
For a cat:
[latex size=2]P(cat | Rover) = \frac{P(Rover | cat) \cdot P(cat)}{P(Rover)} \\ \\
= \frac{ 0.2 \cdot 0.45 }{0.36} = 0.25 [/latex]
For a fish:
[latex size=2]P(fish | Rover) = \frac{P(Rover | fish) \cdot P(cat)}{P(Rover)} \\ \\
= \frac{ 1 \cdot 0.09 }{0.36} = 0.25 [/latex]
So no! Even though a fish is most likely to be called Rover, there are many more dogs, so there are many more chances for the pet to be a dog than there are for it to be a fish. The model still correctly (according to intuition, at least) predicts that the pet is a dog.
And this is what leads to the often contradictory and surprising results from the theorem. You know the ones – results like if a doctor tests for a rare disease (say, 1 in 10000 people), and the test is 99% accurate, and you test positive, chances are high that you actually do not have the disease. That’s because the disease is so rare, that the number of opportunities for a false positive outweighs the number of opportunities for a true positive.