11 Data Collection

11.1 Survey sampling

The following package accompanies the textbook Sampling Algorithms by Yves Tillé.

require(sampling)

11.1.1 Sampling according to prescribed inclusion probabilities

We assume that the inclusion probabilities are all equal, unless otherwise specified. Note that these inclusion probabilities sum to the (expected) sample size.

N = 1e2 # population size
n = 20 # desired sample size
Pi = n*rep(1/N, N) # inclusion probabilities

Each sampling design that follows is in the form of a binary vector with 1 indicating that the individual is to be sampled.

Systematic sampling This sampling design consists in sampling at a regular interval until the desired sample size is achieved.

UPsystematic(Pi)
  [1] 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0
 [38] 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0
 [75] 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

Poisson sampling This sampling scheme amounts to sampling each individual according to its inclusion probability, independently of everything else. This design has the property that the desired sample size is not always achieved. (Rather, the sample size is random, having the Poisson distribution with mean the target sample size.)

UPpoisson(Pi)
  [1] 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 [38] 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0
 [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0

Conditional Poisson sampling (CPS) This sampling scheme is Poisson sampling conditioned on the sample being of the desired size. This can be implemented by rejection sampling (repeatedly rejecting a Poisson sample until it has the desired size), but there are faster methods. (CPS is known to maximize the entropy given the inclusion probabilities, therefore the name of the function below.)

UPmaxentropy(Pi)
  [1] 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0
 [38] 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
 [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0

11.1.2 Cluster sampling

We use the following dataset regarding Swiss municipalities in the year 2003, where clusters can be naturally defined. In particular, we use the canton (CT) as clustering variable. (There are 26 cantons in total so that CT takes that many different values.)

data(swissmunicipalities)
cl = cluster(swissmunicipalities, "CT", size = 5, description = TRUE) # we select 5 cantons
Number of selected clusters: 5 
Number of units in the population and number of selected units: 2896 460 
plan = getdata(swissmunicipalities, cl) # municipalities to be surveyed

11.1.3 Stratified sampling

We use the same dataset and this time use the region (REG) as the stratification variable. (The function, as implemented, requires the ordering of the dataset according to the values of the stratification variable. This is an implementation flaw.)

Data = swissmunicipalities[order(swissmunicipalities$REG), ]
st = strata(Data, "REG", size = rep(20, 7), description = TRUE) # we select 20 municipalities in each of the 7 regions
Stratum 1 

Population total and number of selected units: 589 20 
Stratum 2 

Population total and number of selected units: 913 20 
Stratum 3 

Population total and number of selected units: 321 20 
Stratum 4 

Population total and number of selected units: 171 20 
Stratum 5 

Population total and number of selected units: 471 20 
Stratum 6 

Population total and number of selected units: 186 20 
Stratum 7 

Population total and number of selected units: 245 20 
Number of strata  7 
Total number of selected units 140 
plan = getdata(swissmunicipalities, st) # municipalities to be surveyed

11.2 Experimental design

Experimental designs were perhaps first developed in the context of agricultural experiments. So it is fitting that use the following package, which gathers a very large number of statistical tools for agricultural research

require(agricolae)
# require(knitr)

The “plots” below stand for the experimental units. (These would be individuals in clinical trials.) The “book” is short for “field book”, and specifies the treatment assignments.

Completely randomized design

treat = c("A", "B", "C") # treatments
r = c(13, 13, 14) # number of replicates per treatment
outdesign = design.crd(treat, r)
book = outdesign$book # field book (treatment assignments)
subset(book, select = c("plots", "treat"))
   plots treat
1    101     A
2    102     A
3    103     A
4    104     B
5    105     B
6    106     C
7    107     A
8    108     A
9    109     B
10   110     A
11   111     B
12   112     A
13   113     A
14   114     A
15   115     C
16   116     B
17   117     B
18   118     A
19   119     C
20   120     C
21   121     C
22   122     C
23   123     B
24   124     A
25   125     C
26   126     B
27   127     B
28   128     B
29   129     C
30   130     C
31   131     A
32   132     A
33   133     C
34   134     C
35   135     B
36   136     C
37   137     B
38   138     B
39   139     C
40   140     C

Randomized complete block design

treat = c("A", "B", "C", "D", "E")
r = 4 # number of blocks
outdesign = design.rcbd(treat, r)
book = outdesign$book
book
   plots block treat
1    101     1     B
2    102     1     D
3    103     1     E
4    104     1     C
5    105     1     A
6    201     2     A
7    202     2     C
8    203     2     E
9    204     2     B
10   205     2     D
11   301     3     B
12   302     3     A
13   303     3     E
14   304     3     D
15   305     3     C
16   401     4     B
17   402     4     D
18   403     4     E
19   404     4     C
20   405     4     A

Matched-pairs design This is a special case, with blocks of size 2 representing the pairs.

treat = c("A", "B") # two treatments
r = 10 # number of pairs
outdesign = design.rcbd(treat, r)
book = outdesign$book
names(book) = c("subject", "pair", "treat")
book
   subject pair treat
1      101    1     A
2      102    1     B
3      201    2     B
4      202    2     A
5      301    3     B
6      302    3     A
7      401    4     A
8      402    4     B
9      501    5     B
10     502    5     A
11     601    6     B
12     602    6     A
13     701    7     A
14     702    7     B
15     801    8     B
16     802    8     A
17     901    9     A
18     902    9     B
19    1001   10     A
20    1002   10     B

Balanced incomplete block design The number of blocks is determined automatically so that the design is balanced.

treat = c("A", "B", "C", "D", "E" )
k = 4 # number of units per block
outdesign = design.bib(treat, k)

Parameters BIB
==============
Lambda     : 3
treatmeans : 5
Block size : 4
Blocks     : 5
Replication: 4 

Efficiency factor 0.9375 

<<< Book >>>
book = outdesign$book
book
   plots block treat
1    101     1     E
2    102     1     B
3    103     1     C
4    104     1     A
5    201     2     C
6    202     2     D
7    203     2     B
8    204     2     A
9    301     3     A
10   302     3     D
11   303     3     E
12   304     3     C
13   401     4     E
14   402     4     A
15   403     4     D
16   404     4     B
17   501     5     C
18   502     5     E
19   503     5     D
20   504     5     B

Split plot design

treat1 = c("A", "B", "C", "D") # treatment at the whole plot level 
treat2 = c("a", "b", "c") # treatment at the subplot level 
outdesign = design.split(treat1, treat2, r = 1)
book = outdesign$book
book
   plots splots block treat1 treat2
1    101      1     1      C      b
2    101      2     1      C      c
3    101      3     1      C      a
4    102      1     1      A      b
5    102      2     1      A      a
6    102      3     1      A      c
7    103      1     1      D      b
8    103      2     1      D      a
9    103      3     1      D      c
10   104      1     1      B      c
11   104      2     1      B      b
12   104      3     1      B      a
13   105      1     2      B      a
14   105      2     2      B      b
15   105      3     2      B      c
16   106      1     2      A      b
17   106      2     2      A      a
18   106      3     2      A      c
19   107      1     2      D      b
20   107      2     2      D      c
21   107      3     2      D      a
22   108      1     2      C      c
23   108      2     2      C      a
24   108      3     2      C      b
25   109      1     1      A      a
26   109      2     1      A      c
27   109      3     1      A      b
28   110      1     1      C      a
29   110      2     1      C      c
30   110      3     1      C      b
31   111      1     1      D      c
32   111      2     1      D      b
33   111      3     1      D      a
34   112      1     1      B      b
35   112      2     1      B      c
36   112      3     1      B      a

There seems to be a bug in that even when \(r = 1\) there appears to be 3 replicates. To remedy the situation, we only keep 1 out of 3 replications.

number_splots = length(treat1) * length(treat2)
book[1:number_splots, -3]
   plots splots treat1 treat2
1    101      1      C      b
2    101      2      C      c
3    101      3      C      a
4    102      1      A      b
5    102      2      A      a
6    102      3      A      c
7    103      1      D      b
8    103      2      D      a
9    103      3      D      c
10   104      1      B      c
11   104      2      B      b
12   104      3      B      a

Latin square design This design can be used for simple repeated measures design or a crossover design. (There are packages dedicated to crossover designs.)

treat = c("A", "B", "C", "D")
outdesign = design.lsd(treat)
book = outdesign$book[, -1]
names(book) = c("subject", "period", "treat") # for a crossover design
book
   subject period treat
1        1      1     D
2        1      2     B
3        1      3     C
4        1      4     A
5        2      1     A
6        2      2     C
7        2      3     D
8        2      4     B
9        3      1     B
10       3      2     D
11       3      3     A
12       3      4     C
13       4      1     C
14       4      2     A
15       4      3     B
16       4      4     D
# or in matrix form
book_matrix = matrix(book$treat, byrow = TRUE, 4, 4)
rownames(book_matrix) = paste("subject", 1:4)
colnames(book_matrix) = paste("period", 1:4)
prmatrix(book_matrix, quote = FALSE)
          period 1 period 2 period 3 period 4
subject 1 D        B        C        A       
subject 2 A        C        D        B       
subject 3 B        D        A        C       
subject 4 C        A        B        D       

11.3 Observational studies

We look at how to perform a matching in R. Such a matching can be used in a matched-pairs experimental design or in the analysis of an observational study.

require(optmatch)

We look at the following dataset on women of Pima Indian heritage. (The Pima Indians in the US state of Arizona are known to have among the highest rates of type 2 diabetes in the world.) We merge the training and test sets into a single dataset with a total of 532 observations.

require(MASS)
Pima = rbind(Pima.tr, Pima.te)

In a case-control framework, we might want to understand what distinguishes women with diabetes (cases) and women without (controls).

cases = which(Pima$type == "Yes") # 177 cases
controls = which(Pima$type == "No") # 355 controls

We match one-to-one according to age. (This is a particularly simple problem as we are matching on a single, one-dimensional variable.)

We first build the matrix of differences in age between cases and controls

D = abs(outer(Pima$age[cases], Pima$age[controls], "-"))
rownames(D) = cases
colnames(D) = controls

We then performing the matching

M = pairmatch(D)
P = print(M, grouped = TRUE) # each row corresponds to a pair (case, control) given in the 2nd column
 Group  Members
   1.1   10, 57
  1.10 118, 413
 1.100 344, 531
 1.101 345, 532
 1.102  35, 479
 1.103 355, 499
 1.104 356, 441
 1.105 357, 474
 1.106 362, 382
 1.107 371, 296
 1.108 372, 449
 1.109 377, 408
  1.11 120, 176
 1.110 378, 192
 1.111 380, 165
 1.112 384, 410
 1.113 385, 395
 1.114 386, 448
 1.115 390, 461
 1.116 391, 456
 1.117 392, 159
 1.118 396, 393
 1.119 398, 369
  1.12 123, 308
 1.120 399, 513
 1.121  41, 212
 1.122 412, 135
 1.123 415, 476
 1.124 421, 419
 1.125  428, 22
 1.126 430, 340
 1.127 431, 324
 1.128 438, 467
 1.129 439, 132
  1.13 125, 432
 1.130 442, 277
 1.131 443, 106
 1.132 457, 163
 1.133 458, 133
 1.134 465, 492
 1.135  468, 29
 1.136 470, 347
 1.137 472, 101
 1.138 478, 333
 1.139 481, 485
  1.14  13, 411
 1.140 482, 379
 1.141 484, 440
 1.142 487, 403
 1.143  488, 82
 1.144  49, 316
 1.145 493, 354
 1.146 497, 376
 1.147 498, 202
 1.148  50, 528
 1.149 501, 447
  1.15 130, 346
 1.150 503, 112
 1.151  505, 23
 1.152 506, 312
 1.153 512, 310
 1.154 517, 274
 1.155 522, 285
 1.156   523, 3
 1.157 526, 469
 1.158 527, 420
 1.159 529, 520
  1.16 131, 226
 1.160  53, 179
 1.161   6, 220
 1.162  60, 405
 1.163  61, 264
 1.164  66, 495
 1.165   67, 15
 1.166  69, 435
 1.167  71, 169
 1.168  72, 287
 1.169    73, 9
  1.17  14, 168
 1.170  75, 404
 1.171  76, 515
 1.172  79, 471
 1.173  83, 241
 1.174  84, 182
 1.175  87, 180
 1.176  93, 417
 1.177   96, 80
  1.18 141, 215
  1.19 142, 450
   1.2 100, 315
  1.20 148, 504
  1.21  152, 46
  1.22 153, 425
  1.23 154, 294
  1.24 156, 273
  1.25 157, 358
  1.26 160, 166
  1.27 161, 359
  1.28 167, 363
  1.29 171, 406
   1.3 102, 305
  1.30  173, 63
  1.31 174, 124
  1.32 175, 288
  1.33  18, 352
  1.34 184, 261
  1.35 186, 302
  1.36 187, 227
  1.37 188, 244
  1.38  19, 185
  1.39 190, 129
   1.4 104, 209
  1.40 193, 486
  1.41 197, 426
  1.42   2, 530
  1.43 200, 509
  1.44 201, 521
  1.45 204, 518
  1.46 205, 507
  1.47 206, 491
  1.48 207, 140
  1.49 210, 373
   1.5 108, 351
  1.50 216, 150
  1.51 217, 275
  1.52 218, 116
  1.53  221, 36
  1.54 222, 466
  1.55 231, 111
  1.56 234, 172
  1.57 243, 423
  1.58 248, 145
  1.59 253, 178
   1.6  11, 290
  1.60 255, 510
  1.61 257, 242
  1.62 258, 323
  1.63  259, 88
  1.64  26, 103
  1.65 260, 381
  1.66  268, 92
  1.67 270, 284
  1.68 272, 214
  1.69 276, 251
   1.7 113, 496
  1.70 278, 266
  1.71 279, 293
  1.72  28, 508
  1.73 280, 208
  1.74 281, 524
  1.75 282, 400
  1.76 286, 292
  1.77 289, 460
  1.78 291, 110
  1.79 295, 463
   1.8 114, 490
  1.80 298, 245
  1.81 300, 151
  1.82 301, 299
  1.83 304, 230
  1.84  306, 64
  1.85 311, 342
  1.86  313, 12
  1.87 317, 366
  1.88 320, 409
  1.89 325, 329
   1.9 117, 239
  1.90 328, 433
  1.91   33, 43
  1.92 330, 444
  1.93 332, 368
  1.94 334, 489
  1.95 335, 402
  1.96 336, 360
  1.97 337, 483
  1.98 339, 427
  1.99 341, 502