Clinton Montague

Developer, learner of things, functional programming enthusiast, hacker, and all round inquisitor.

Making sense of Bayes rule

February 17, 2019

Bayes rule has always seemed like witchcraft to me. But today I went through the derivation, and then thought through why it is useful. Suddenly it started making a bit more sense.

Thinking about it in a machine learning setting, let’s say that we have a dataset containing cats and dogs, and their respective names. We would like to classify an unknown pet as either a cat or dog, given only its name. We can use Bayes rule to help us.

First, let’s start off by understanding where the rule comes from. We can use conditional probability identities to get some results.

The probability the the pet is called Rover and is a dog, is the probability the the pet is called Rover given that it is a dog, multiplied by the probability that it is a dog.

P(Rover \cap dog) = P(Rover | dog) \cdot P(dog)

Similarly, we have:

P(dog \cap Rover) = P(dog | Rover) \cdot P(Rover)

And now, because the left hand sides of both of those equations are the same, we have:

\Rightarrow P(dog | Rover) \cdot P(Rover) = P(Rover | dog) \cdot P(dog)

Now for the magic – divide both sides by the probability that the pet is called Rover, and — Boom! — Bayes rule.

\Rightarrow P(dog | Rover) = \frac{P(Rover | dog) \cdot P(dog)}{P(Rover)}

Why is this good and useful? Because it allows us to find the probability of a label given a constraint, using known probabilities from the training data!

Replacing dog with the variable class (or label), and Rover with the variable feature, we have derived the following:

P(class | feature) = \frac{P(feature | class) \cdot P(class)}{P(feature)}

i.e. a way of determining the class of an item given one (or more) feature constraints. How does this help with a classification machine learning problem? Well, we could calculate the probability of each class given the constrains, and select the class with the highest recorded probability.

Let’s finish off the pet example. We have the training data in the table below. We are given a mystery pet, which we know is called Rover. We wish to predict whether it’s a cat or a dog using only its name and the probabilities derived from this table.

KindName
DogRover
DogMax
CatFluffy
DogRover
CatLeo
DogBruiser
DogJohn
CatSimba
CatTigger
CatRover

Now we have P(dog) = 0.5, and P(cat) = 0.5. Of the 5 dogs, two are called Rover, so we have P(Rover | dog) = 0.4. Of the 5 cats, one is called Rover, so we have P(Rover | cat) = 0.2. Finally, of all the pets, three are called Rover, so P(Rover) = 0.3. We now have everything required to make a prediction.

For a cat:

P(cat | Rover) = \frac{P(Rover | cat) \cdot P(cat)}{P(Rover)} \\ \\      = \frac{0.2 \cdot 0.5}{0.3} \\ \\ = 0.333

And for a dog:

P(dog | Rover) = \frac{P(Rover | dog) \cdot P(dog)}{P(Rover)} \\ \\      = \frac{0.4 \cdot 0.5}{0.3} \\ \\ = 0.667

So our model has made a prediction that the pet is a dog – which sounds good to me!

OK, so can we trick it?

From the pet dataset above, it might seem that dog was given the highest probability because there are more dogs called Rover than cats called Rover. OK, so can we fool it by introducing a new category, fish, which are all called Rover?

Let’s try, by using the above dataset, but adding a single fish which belongs to a joker who decided to call it Rover. That means that we now have P(dog) = 0.45, P(cat) = 0.45, and P(fish) = 0.09.

Notice how P(Rover | dog) does not change — given a dog, there is still a probability of 0.4 that it will be called Rover. Similarly, P(Rover | cat) remains the same. Now we can also do the same for a fish P(Rover | fish) = 1 — there’s only one fish, and its name is Rover. Finally we have P(Rover) = 0.36

So the question is, will it trick the system? Let’s run the numbers.

For a dog:

P(dog | Rover) = \frac{P(Rover | dog) \cdot P(dog)}{P(Rover)} \\ \\      = \frac{ 0.4 \cdot 0.45 }{0.36} = 0.5

For a cat:

P(cat | Rover) = \frac{P(Rover | cat) \cdot P(cat)}{P(Rover)} \\ \\      = \frac{ 0.2 \cdot 0.45 }{0.36} = 0.25

For a fish:

P(fish | Rover) = \frac{P(Rover | fish) \cdot P(cat)}{P(Rover)} \\ \\      = \frac{ 1 \cdot 0.09 }{0.36} = 0.25

So no! Even though a fish is most likely to be called Rover, there are many more dogs, so there are many more chances for the pet to be a dog than there are for it to be a fish. The model still correctly (according to intuition, at least) predicts that the pet is a dog.

And this is what leads to the often contradictory and surprising results from the theorem. You know the ones – results like if a doctor tests for a rare disease (say, 1 in 10000 people), and the test is 99% accurate, and you test positive, chances are high that you actually do not have the disease. That’s because the disease is so rare, that the number of opportunities for a false positive outweighs the number of opportunities for a true positive.