What Are the Odds?

An Intro to Probability with R

Abhijit Dasgupta

The probability of an event represents the likelihood of the event to occur. For example, most of us would agree that the probability of getting a heads after flipping a fair coin is 0.5 or that the probability of getting a one on rolling a fair die is 1/6. However, there are many more places where we encounter probabilities in our lives. During election season, we have pundits and polls speaking to the likelihood (probability) of winning for each candidate. Doctors will often state that a patient has low or high risk for heart attacks or breast cancer, reflecting either data or the doctor's belief of how likely a patient is to experience that outcome. Banks use customer data to figure out how likely a customer is to default on a loan - the credit risk. We see probability appear in many guises, being called likelihood, risk, odds, propensity, and other synonyms.

Broadly speaking, we can arrive at probabilities for events in two ways:

Subjective probability: We can decide and assign a probability to an event based on our experience and beliefs. We see this in the way bookmakers set odds (ratio of probabilities for each team to win) for professional sports, or in how political pundits predict winners, or how likely you think your local football team is to win the game of the week. We will only observe these events once, so we have to base the probabilities on our experience and understanding.
Objective probability: Probabilities are assigned based on some physical and repeatedly observable state of affairs. Most commonly, we use frequentist probabilities. Suppose we observe a random event (say, getting heads) as the result of an experiment (flipping a coin). The probability of getting heads is determined by the relative frequency of seeing heads when the experiment is repeated a number of times.

Frequentist probability

We can easily demonstrate frequentist probability using R, by simulating random
events. For example, let's consider the easiest example, flipping a coin. Instead of
heads and tails, we'll call them 1 and 0, since the computer likes them more.

set.seed(1357)
x <- sample(c("tail","head"), size=10000, replace=TRUE)
est_prob <- sum(x=="head")/length(x) 
est_prob

## [1] 0.4964

We see that, after 10000 experiments, the relative frequency of flipping heads is 0.5, as we would expect from a fair coin. But, how quickly could we have figured this out?

library(ggplot2)
cum_prob <- cumsum(x=='head')/(1:length(x))
qplot(1:length(x), cum_prob, geom='line', ylim=c(0,1), 
      xlab = 'Number of experiments',
      ylab = 'Relative frequency for a head')

Silvrback blog image

It turns out that after about 1000 experiments, we're consistently within 1% of the
true value.

Look at the graph below. It is similar to the graph above, but in this one, we have performed 500 coin-tossing experiments. Do you think the coin is fair? What would be your guess about the probability of flipping heads?

Silvrback blog image

Two things are evident here. First, we need to experience quite a number of
independent experiments to ascertain the probability of a particular event
(fortunately, for many, we have collective experience). Second, for a small number of experiments, our estimates can bounce around quite a bit. Probability allows us to figure out how much they can bounce around and with what likelihood. This leads us to the idea of probability distributions.

Sample spaces, events, and distributions

We'll continue the coin-tossing experiment as before. If we toss a fair coin 5 times, we expect to see 2 or 3 heads. But what if we see no heads? Or 5 heads? If we see either of these, we tend to think there's something wrong with the coin. Let's do an experiment to see how often a fair coin would result in 5 heads in 5 tosses. This is a random experiment, and so we can simulate this in R. We'll run 5000 experiments.

no_of_experiments <- 5000
no_of_events <- rep(0,no_of_experiments) # Initializing the output vector
for(i in 1:no_of_experiments){
  x <- sample(c('tail','head'), size=5, replace=TRUE)
  no_of_events[i] <- sum(x=='head')
}
prob_of_5 <- mean(no_of_events==5)
prob_of_5

## [1] 0.0328

Though small, there is still about a 3.3% chance of seeing 5 heads out of 5 tosses. Not 0!

Out of 5 tosses, we can potentially see 0, 1, 2, 3, 4, or 5 heads. How often would we see each possible outcome? Before we go forward, let's establish some definitions:

Sample space: The sample space of an experiment is the set of all possible outcomes for that experiment. This is often denoted by S or Ω.

Elementary event: An event which consists of a single outcome in the sample space.

Here we will define our experiment as number of heads in 5 tosses of a fair coin. The sample space for this experiment is then S = {0,1,2,3,4,5}, with each of {0}, {1}, {2}, {3}, {4} and {5} being elementary events. First, let's see the probability associated with each of the elementary events.

rel_freq <- table(no_of_events)/length(no_of_events)
barplot(rel_freq, ylab='Probability')

Silvrback blog image

round(rel_freq,3)

## no_of_events
##     0     1     2     3     4     5 
## 0.033 0.163 0.316 0.307 0.148 0.033

This gives us a fair idea about how often we'd see each possible outcome.

This is a popular sample experiment, and you can think about real experiments for which this would be a good model. The theoretical distribution that this experiment gives rise to is the binomial distribution. R, being a statistics and probability oriented language, can simulate various standard theoretical distributions like the binomial distribution directly, as well as elicit probabilities directly. Let's see what the binomial distribution says about the probabilities of each elementary event in our sample space.

S <-  c(0,1,2,3,4,5)
binomial_prob <- dbinom(S, size=5, prob=0.5) # dbinom gives the "probability mass function" for binomial distributions
out <- data.frame(outcome=S, probability=binomial_prob, 
                  experimental_probability=unclass(rel_freq),
                  proportional_error = (unclass(rel_freq)-binomial_prob)/binomial_prob)
out

##   outcome probability experimental_probability proportional_error
## 0       0     0.03125                   0.0326            0.04320
## 1       1     0.15625                   0.1632            0.04448
## 2       2     0.31250                   0.3164            0.01248
## 3       3     0.31250                   0.3068           -0.01824
## 4       4     0.15625                   0.1482           -0.05152
## 5       5     0.03125                   0.0328            0.04960

We see that our experimental probabilities are pretty close to the binomial probabilities. Why are they not exactly the same? The binomial distribution is conceptually based on an infinite number of experiments, and we have used only 5000 experiments. The discrepancy will go down as we increase the number of experiments. Also, notice that error is higher for the rarer outcomes (0 and 5) than for the more common outcomes (2 and 3). This makes sense since we'll see fewer occurences of the rarer outcomes, and so the corresponding estimates would be worse.

Challenge: Increase the number of experiments and see how many experiments you need to get closer to the binomial probabilities. For a particular error rate, see how many more experiments you need to achieve it for Prob(0) compared to Prob(3).

Both the experiment-based distribution and the binomial distribution are examples
of probability distributions. The formal definition is a bit involved, but conceptually, we might consider the following definition for sample spaces like the one we've been playing with -- sample spaces that have a finite number of elements.

A probability distribution for a finite sample space is the set of probabilities
associated with each elementary event in the sample space.

There are corresponding definitions for infinite sample spaces (like, for example, the heights of all people in a city, or the distance a crossbow can fire an arrow), but that is a topic for another day (or today, if you like).

There are many standard probability distributions that are used to model real-life events, including:

Normal or Gaussian distribution, or, if you prefer, the Bell Curve.
Geometric distribution, or, number of failures till your first success.
Pareto distribution, the basis of the "80-20 rule."

Probability of Events

In general, events need not be elementary events. We can consider different kinds of
events that are composed of elementary events. Since a lot of the history of Probablity Theory is entangled with gambling, or, "games of chance," let's consider rolling dice. In particular, let's roll two (fair) dice, like in the game of craps. The full sample space S for this experiment has 36 elements, which we'll denote by the pair of numbers that we see from the first and second die:

| (1,1) (1,2) (1,3) (1,4) (1,5) (1,6)
| (2,1) (2,2) (2,3) (2,4) (2,5) (2,6)
| (3,1) (3,2) (3,3) (3,4) (3,5) (3,6)
| (4,1) (4,2) (4,3) (4,4) (4,5) (4,6)
| (5,1) (5,2) (5,3) (5,4) (5,5) (5,6)
| (6,1) (6,2) (6,3) (6,4) (6,5) (6,6)

We can use R to create a dice-rolling function:

dice <- function(no_of_rolls=1){
  x <- sample(1:6, size=no_of_rolls, replace=TRUE)
  y <- sample(1:6, size=no_of_rolls, replace=TRUE)
  return(cbind(x,y))
}
dice()

##      x y
## [1,] 1 3

To compute the probability of different kinds of events, we have several building blocks that make our lives easier than actually counting out the probabilities of elementary events. Imagine if you had to find probabilites of different winning combinations in Powerball by enumeration!

But first, some definitions:

Disjoint events or mutually exclusive events are those that cannot occur at the same time.

The complementary event of an event (call it A) is the event that A does not occur. For example, the complementary event of number of heads is less than 2 is number of heads is at least 2.

The first building block for getting probabilities of composite events is that the probability of seeing at least one of a set of disjoint events is the sum of the individual probabilities. If we want to figure out the probability that we get a 7 on rolling two dice, we have to add up the probabilities (1/36) of the elementary events that give that outcome, so {(1,6),(2,5),...,(6,1)}.

More formally,

If A and B are disjoint events, then Prob(A or B) = Prob(A) + Prob(B)

The next building block is that the probability for the entire sample space is 1. This makes sense since, in an experiment, we necessarily have to see one of the outcomes in its sample space. Formally,

Prob(entire sample space) = 1

This leads to the probability of complementary events. If we look at an event {A} and its complement {not A}, we see that these two events are disjoint and the composite event {A or not A} is the entire sample space. So,

Prob(not A) = 1 - Prob(A)

Using these building blocks, we can compute the probabilities for any composite event once we know the probability distribution. For example, Prob(total is less than 5) = Prob(total is 2) + Prob(total is 3) + Prob(total is 4), and each of these can be computed from the probabilities of the corresponding elementary events. Similarly, Prob(total is not 7) = 1 - Prob(total is 7).

Fortunately, one thing the computer is very good at is enumeration and computation.
We can easily use simulation to generate the probability distribution for the totals.

rolls <- dice(100000) # 100000 experiments
totals <- rowSums(rolls) # Total of the rolls
rel_freq <- table(totals)/length(totals)
barplot(rel_freq, ylab='Relative frequency')

Silvrback blog image

rel_freq

## totals
##       2       3       4       5       6       7       8       9      10 
## 0.02757 0.05470 0.08159 0.11241 0.13864 0.16726 0.13923 0.11193 0.08349 
##      11      12 
## 0.05564 0.02754

We can also find the theoretical distribution rather easily using R.

sums <- outer(1:6,1:6,'+')
sums

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    2    3    4    5    6    7
## [2,]    3    4    5    6    7    8
## [3,]    4    5    6    7    8    9
## [4,]    5    6    7    8    9   10
## [5,]    6    7    8    9   10   11
## [6,]    7    8    9   10   11   12

This gives the totals for all 36 elementary events as a matrix. Since each elementary
event is equally likely with fair dice, each element in the matrix has probability 1/36 to occur. So to find the probability of the total being 7, we just find the number of 7's in the matrix and divide by 36.

prob_7 <- sum(sums==7)/36
prob_7

## [1] 0.1667

We can also find the probability of not getting a 7 using the formula Prob(not 7) = 1-Prob(7). Using the computer, we can also verify this from the matrix above.

prob_not7_1 <- 1-prob_7
prob_not7_2 <- sum(sums != 7)/36
paste('Probability of not getting a 7 is',prob_not7_1)

## [1] "Probability of not getting a 7 is 0.833333333333333"

prob_not7_1 == prob_not7_2

## [1] TRUE

Challenge: Find the probabilities of other events that would be of interest to a craps
player or players of other games with two dice.

Conclusions

Probability plays a central role in many events in our lives. Understanding what probability is, how it is ascertained, and how to figure out the probabilities of different events aids our ability to make sense of many phenomena in our lives, including some of the things that appear in newspaper and TV reports. How probability affects our lives is, in itself, fascinating, and there are several elegant and comprehensible books available for further reading. A few personal favorites are listed below:

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

What Are the Odds?

An Intro to Probability with R

Frequentist probability

Sample spaces, events, and distributions

Probability of Events

Conclusions

Share this article with friends