10 Sampling and Simulation

10.1 Monte Carlo simulations

In the 5-card poker game, the probability of a flush is \(1/508 \approx 0.00197\). It’s a good exercise to derive this number analytically based on basic combinatorics. With a computer, we can approximate this number using Monte Carlo simulations (founded on the Law of Large Numbers) by simply repeatedly generating a poker hand and checking whether it’s a flush or not. Since the number we are estimating is quite small, the number of Monte Carlo replicates has to be quite large to result in a decent approximation. We use some code taken from a probability course at Duke University to generate a poker hand and check whether a hand is a flush.

source("https://www2.stat.duke.edu/courses/Spring12/sta104.1/Homework/poker_simulation.R")
B = 1e5 # number of MC replicates
number_flush = 0
for (b in 1:B) {
  x = deal_hand() # generate a poker hand
  number_flush = number_flush + (what_hand(x) == "Flush")
}
cat(sprintf("fraction of flush hands (out of %i): %f", B, number_flush/B))
fraction of flush hands (out of 100000): 0.001700

10.2 Monte Carlo integration

Simulations, again via the Law of Large Numbers, allow for the computation of integrals, since integrals can be interpreted as expectations. This avenue offers some advantages over numerical integration such as ease of implementation (although code for numerical integration is readily available) and lower computational complexity in higher dimensions. We perform some numerical experiments in dimension one where a Monte Carlo approach is typically less desirable, but where we can more easily quantify its accuracy compared to numerical integration.

f = function(x){sqrt(x)} # function to be integrated
exact_integral = 2/3 # integral over [0, 1]
num_integral = function() {
  integrate(f, 0, 1)$value
} # numerical integration
mc_integral = function() {
  x = runif(1e6)
  return(mean(f(x)))
} # MC integration with 1e6 replicates

First, we look at the accuracy

num_error = num_integral() - exact_integral
num_error
[1] 7.483435e-08
S = 1e2
mc_error = numeric(S)
for (s in 1:S) {
  mc_error[s] = mc_integral() - exact_integral
}
summary(mc_error)
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-6.963e-04 -1.425e-04  8.521e-06 -2.082e-05  1.245e-04  5.113e-04 
# Next, we look at computational complexity
require(benchr)
summary(benchmark(num_integral(), mc_integral(), times = 100, progress = FALSE), digits = 3, row.names = FALSE)
Time units : microseconds 
           expr n.eval    min  lw.qu median    mean   up.qu     max   total
 num_integral()    100   12.1   15.4   38.7    36.8    51.9    91.9    3680
  mc_integral()    100 8680.0 8910.0 9380.0 10000.0 10600.0 31900.0 1000000
 relative
        1
      243

In dimension one, at least for this simple function, numerical integration is much superior in both accuracy and speed. This is as expected. In higher dimensions, however, and in fact already at dimension 4 or 5, numerical integration may not even be a contender as it typically is not computationally viable.

10.3 Rejection sampling

Here is a simple example of rejection sampling, where the goal is to sample uniformly at random from the unit disc. The approach is based on sampling the unit square (which contains the unit disc) uniformly at random (which is straightforward) and only keeping those draws that fall in the disc.

require(plotrix)
plot(0, 0, type = "n", xlim = c(-1,1), ylim = c(-1,1), xlab = "", ylab = "", asp = 1)
draw.circle(0, 0, 1, lty = 2)
rect(-1, -1, 1, 1, lty = 3)
legend("topleft", c("accepted", "rejected"), col = c("green", "red"), pch = 16, xpd = TRUE, inset = c(0, -0.15))

n = 100 # desired sample size
t = 0 # number of samples drawn from the proposal distribution
while (t < n) {
  y = runif(2, -1, 1)
  color = "red"
  if (y[1]^2 + y[2]^2 <= 1) {
    t = t+1
    color = "green"
  }
  points(y[1], y[2], pch = 16, col = color)
}

10.4 Markov chain Monte Carlo

For illustration, we consider the setting of species co-occurrence, important in Ecology, where a presence-absence matrix is available, and we want to draw uniformly at random from the set of binary matrices of same size and with the same row and column totals. This is one of the random models considered in that field meant to represent a situation where the various species have populated the various geographical sites “at random”. The Markov chain consists in swapping \(2 \times 2\) submatrices whenever possible.

require(cooccur)
data(finches) # 13 species over 17 sites
rownames(finches) # species (corresponding to rows)
 [1] "Geospiza magnirostris"    "Geospiza fortis"         
 [3] "Geospiza fuliginosa"      "Geospiza difficilis"     
 [5] "Geospiza scandens"        "Geospiza conirostris"    
 [7] "Camarhynchus psittacula"  "Camarhynchus pauper"     
 [9] "Camarhynchus parvulus"    "Platyspiza crassirostris"
[11] "Cactospiza pallida"       "Cactospiza heliobates"   
[13] "Certhidea olivacea"      
colnames(finches) # sites (corresponding to columns)
 [1] "Seymour"       "Baltra"        "Isabella"      "Fernandina"   
 [5] "Santiago"      "Rabida"        "Pinzon"        "Santa.Cruz"   
 [9] "Santa.Fe"      "San.Cristobal" "Espanola"      "Floreana"     
[13] "Genovesa"      "Marchena"      "Pinta"         "Darwin"       
[17] "Wolf"         
write.table(finches, row = FALSE, col = FALSE)
0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0
1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0
0 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 1
1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 0
0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
0 0 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0
0 0 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0
0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
B = 1e4 # number of steps in the Markov chain
I = diag(2)
J = matrix(c(0, 1, 1, 0), 2, 2)
X = as.matrix(finches)
rownames(X) = NULL
colnames(X) = NULL
for (b in 1:B) {
  rows = sample(1:nrow(X), 2)
  cols = sample(1:ncol(X), 2)
  if (sum(X[rows, cols] == I) == 4) X[rows, cols] = J
  if (sum(X[rows, cols] == J) == 4) X[rows, cols] = I
}
write.table(X, row = FALSE, col = FALSE)
1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1
1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1
0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0
1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0
0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 0
0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0
0 0 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0
0 0 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

This code produces only one approximately uniform random binary matrix (among those with the same row and column totals). Statistical inference typically requires the drawing of multiple realizations.