Title: | Sequential Poisson Sampling |
---|---|
Description: | Sequential Poisson sampling is a variation of Poisson sampling for drawing probability-proportional-to-size samples with a given number of units, and is commonly used for price-index surveys. This package gives functions to draw stratified sequential Poisson samples according to the method by Ohlsson (1998, ISSN:0282-423X), as well as other order sample designs by Rosén (1997, <doi:10.1016/S0378-3758(96)00186-3>), and generate appropriate bootstrap replicate weights according to the generalized bootstrap method by Beaumont and Patak (2012, <doi:10.1111/j.1751-5823.2011.00166.x>). |
Authors: | Steve Martin [aut, cre, cph] , Justin Francis [ctb] |
Maintainer: | Steve Martin <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.5.4.9001 |
Built: | 2024-11-21 04:24:16 UTC |
Source: | https://github.com/marberts/sps |
Find the expected number of strata covered by ordinary Poisson sampling without stratification. As sequential and ordinary Poisson sampling have the same sample size on average, this gives an approximation for the coverage under sequential Poisson sampling.
This function can also be used to calculate, e.g., the expected number of enterprises covered within a stratum when sampling business establishments.
expected_coverage(x, n, strata, alpha = 0.001, cutoff = Inf)
expected_coverage(x, n, strata, alpha = 0.001, cutoff = Inf)
x |
A positive and finite numeric vector of sizes for units in the population (e.g., revenue for drawing a sample of businesses). |
n |
A positive integer giving the sample size. |
strata |
A factor, or something that can be coerced into one, giving the strata associated with units in the population. The default is to place all units into a single stratum. |
alpha |
A numeric vector with values between 0 and 1 for each stratum,
ordered according to the levels of |
cutoff |
A positive numeric vector of cutoffs for each stratum, ordered
according to the levels of |
The expected number of strata covered by the sample design.
prop_allocation()
for generating proportional-to-size allocations.
# Make a population with units of different size x <- c(rep(1:9, each = 3), 100, 100, 100) # ... and 10 strata s <- rep(letters[1:10], each = 3) # Should get about 7 to 8 strata in a sample on average expected_coverage(x, 15, s)
# Make a population with units of different size x <- c(rep(1:9, each = 3), 100, 100, 100) # ... and 10 strata s <- rep(letters[1:10], each = 3) # Should get about 7 to 8 strata in a sample on average expected_coverage(x, 15, s)
Calculate stratified (first-order) inclusion probabilities.
inclusion_prob(x, n, strata = gl(1, length(x)), alpha = 0.001, cutoff = Inf) becomes_ta(x, alpha = 0.001, cutoff = Inf)
inclusion_prob(x, n, strata = gl(1, length(x)), alpha = 0.001, cutoff = Inf) becomes_ta(x, alpha = 0.001, cutoff = Inf)
x |
A positive and finite numeric vector of sizes for units in the population (e.g., revenue for drawing a sample of businesses). |
n |
A positive integer vector giving the sample size for each stratum,
ordered according to the levels of |
strata |
A factor, or something that can be coerced into one, giving the strata associated with units in the population. The default is to place all units into a single stratum. |
alpha |
A numeric vector with values between 0 and 1 for each stratum,
ordered according to the levels of |
cutoff |
A positive numeric vector of cutoffs for each stratum, ordered
according to the levels of |
Within a stratum, the inclusion probability for a unit is given by
. These values can be greater
than 1 in practice, and so they are constructed iteratively by taking units
with
(from largest to smallest)
and assigning these units an inclusion probability of 1, with the remaining
inclusion probabilities recalculated at each step. If
, then
any ties among units with the same size are broken by their position.
The becomes_ta()
function reverses this operations and finds the critical
sample size at which a unit enters the take-all stratum. This value is
undefined for units that are always included in the sample (because their
size exceeds cutoff
) or never included.
inclusion_prob()
returns a numeric vector of inclusion probabilities for
each unit in the population.
becomes_ta()
returns an integer vector giving the sample size at which a
unit enters the take-all stratum.
sps()
for drawing a sequential Poisson sample.
# Make inclusion probabilities for a population with units # of different size x <- c(1:10, 100) (pi <- inclusion_prob(x, 5)) # The last unit is sufficiently large to be included in all # samples with two or more units becomes_ta(x) # Use the inclusion probabilities to calculate the variance of the # sample size for Poisson sampling sum(pi * (1 - pi))
# Make inclusion probabilities for a population with units # of different size x <- c(1:10, 100) (pi <- inclusion_prob(x, 5)) # The last unit is sufficiently large to be included in all # samples with two or more units becomes_ta(x) # Use the inclusion probabilities to calculate the variance of the # sample size for Poisson sampling sum(pi * (1 - pi))
Generate a proportional-to-size allocation for stratified sampling.
prop_allocation( x, n, strata, initial = 0L, divisor = function(a) a + 1, ties = c("largest", "first") )
prop_allocation( x, n, strata, initial = 0L, divisor = function(a) a + 1, ties = c("largest", "first") )
x |
A positive and finite numeric vector of sizes for units in the population (e.g., revenue for drawing a sample of businesses). |
n |
A positive integer giving the sample size. |
strata |
A factor, or something that can be coerced into one, giving the strata associated with units in the population. The default is to place all units into a single stratum. |
initial |
A positive integer vector giving the initial (or minimal)
allocation for each stratum, ordered according to the levels of
|
divisor |
A divisor function for the divisor (highest-averages) apportionment method. The default uses the Jefferson (D'Hondt) method. See details for other possible functions. |
ties |
Either 'largest' to break ties in favor of the stratum with the
largest size, or 'first' to break ties in favor of the ordering of
|
The prop_allocation()
function gives a sample size for each level in
strata
that is proportional to the sum of x
across strata and
adds up to n
. This is done using the divisor (highest-averages)
apportionment method (Balinksi and Young, 1982, Appendix A), for which there
are a number of different divisor functions:
\(a) a + 1
\(a) a + 0.5
\(a) a + 2
\(a) sqrt(a * (a + 1))
\(a) a + 1 / 3
\(a) a
\(a) a * (a + 1) / (a + 0.5)
Note that a divisor function with (i.e., Huntington-Hill,
Adams, Dean) should have an initial allocation of at least 1 for all strata.
In all cases, ties are broken according to the sum of
x
if
ties = 'largest'
; otherwise, if ties = 'first'
, then ties are broken
according to the levels of strata
.
In cases where the number of units with non-zero size in a stratum is
smaller than its allocation, the allocation for that stratum is set to the
number of available units, with the remaining sample size reallocated to
other strata proportional to x
. This is similar to PROC
SURVEYSELECT
in SAS with ALLOC = PROPORTIONAL
.
Passing a single integer for the initial allocation first checks that recycling this value for each stratum does not result in an allocation larger than the sample size. If it does, then the value is reduced so that recycling does not exceed the sample size. This recycled vector can be further reduced in cases where it exceeds the number of units in a stratum, the result of which is the initial allocation. This special recycling ensures that the initial allocation is feasible.
A named integer vector of sample sizes for each stratum in strata
.
Balinksi, M. L. and Young, H. P. (1982). Fair Representation: Meeting the Ideal of One Man, One Vote. Yale University Press.
sps()
for stratified sequential Poisson sampling.
expected_coverage()
to calculate the expected number of strata in a sample
without stratification.
strAlloc()
in the PracTools package for other allocation methods.
# Make a population with units of different size x <- c(rep(1:9, each = 3), 100, 100, 100) # ... and 10 strata s <- rep(letters[1:10], each = 3) # Generate an allocation prop_allocation(x, 15, s, initial = 1)
# Make a population with units of different size x <- c(rep(1:9, each = 3), 100, 100, 100) # ... and 10 strata s <- rep(letters[1:10], each = 3) # Generate an allocation prop_allocation(x, 15, s, initial = 1)
Draw a stratified probability-proportional-to-size sample using the sequential and ordinary Poisson methods, and generate other order sampling schemes.
sps(x, n, strata = gl(1, length(x)), prn = NULL, alpha = 0.001, cutoff = Inf) ps(x, n, strata = gl(1, length(x)), prn = NULL, alpha = 0.001, cutoff = Inf) order_sampling(dist)
sps(x, n, strata = gl(1, length(x)), prn = NULL, alpha = 0.001, cutoff = Inf) ps(x, n, strata = gl(1, length(x)), prn = NULL, alpha = 0.001, cutoff = Inf) order_sampling(dist)
x |
A positive and finite numeric vector of sizes for units in the population (e.g., revenue for drawing a sample of businesses). |
n |
A positive integer vector giving the sample size for each stratum,
ordered according to the levels of |
strata |
A factor, or something that can be coerced into one, giving the strata associated with units in the population. The default is to place all units into a single stratum. |
prn |
A numeric vector of permanent random numbers for units in the population, distributed uniform between 0 and 1. The default does not use permanent random numbers, instead generating a random vector when the function is called. |
alpha |
A numeric vector with values between 0 and 1 for each stratum,
ordered according to the levels of |
cutoff |
A positive numeric vector of cutoffs for each stratum, ordered
according to the levels of |
dist |
A function giving the fixed order distribution shape for an order sampling scheme. See details. |
The sps()
function draws a sample according to the sequential Poisson
procedure, the details of which are given by Ohlsson (1998). It is also
called uniform order sampling, as it is a type of order sampling; see Rosén
(1997, 2000) for a more general presentation of the method. This is the same
method used by PROC SURVEYSELECT
in SAS with METHOD =
SEQ_POISSON
.
For each stratum, the sequential Poisson procedure starts by stratifying
units in the population based on their (target) inclusion probabilities
. Units with
are placed into a take-none stratum,
units with
are placed into a take-some stratum, and units
with
are placed into a take-all stratum. As noted by
Ohlsson (1998), it can be useful to set
to a small positive
value when calculating inclusion probabilities, and this is the default
behavior.
After units are appropriately stratified, a sample of take-some units is
drawn by assigning each unit a value , where
is a
random deviate from the uniform distribution between 0 and 1. The units with
the smallest values for
are included in the sample, along with the
take-all units. (Ties in
are technically a measure-zero event—in
practice these are broken by position.) This results in a fixed sample size
at the expense of the sampling procedure being only approximately
probability-proportional-to-size (i.e., the inclusion probabilities from the
sample design are close but not exactly equal to
; see Matei and
Tillé, 2007, for details on the exact computation).
Ordinary Poisson sampling follows the same procedure as above, except that
all units with are included in the sample; consequently, while
it does not contain a fixed number of units, the procedure is strictly
probability-proportional-to-size. Despite this difference, the standard
Horvitz-Thompson estimator for the total (of the take-some stratum) is
asymptotically unbiased, normally distributed, and equally efficient under
both procedures. The
ps()
function draws a sample using the ordinary
Poisson method.
A useful feature of sequential and ordinary Poisson sampling is the ability
to coordinate samples by using permanent random numbers for . Keeping
fixed when updating a sample retains a larger number of overlapping
units, whereas switching
for
or
, for some
between 0 and
1, when drawing different samples from the same frame reduces the number of
overlapping units.
Despite the focus on sequential Poisson sampling, all order sampling
procedures follow the same approach as sequential Poisson sampling. The
order_sampling()
function can be used to generate other order
sampling functions by passing an appropriate function to make the ranking
variable :
\(x) x
\(x) log(1 - x)
\(x) x / (1 - x)
sps()
and ps()
return an object of class sps_sample
.
This is an integer vector of indices for the units in the population that
form the sample, along with a weights
attribute that gives the design
(inverse probability) weights for each unit in the sample (keeping in mind
that sequential Poisson sampling is only approximately
probability-proportional-to-size). weights()
can be used to access
the design weights attribute of an sps_sample
object, and levels()
can
be used to determine which units are in the take-all or take-some
strata. Mathematical and binary/unary operators strip
attributes, as does replacement.
order_sampling
returns a function the with the same interface as
sps()
and ps()
.
Matei, A., and Tillé, Y. (2007). Computational aspects of order
ps sampling schemes. Computational Statistics & Data Analysis,
51: 3703-3717.
Ohlsson, E. (1998). Sequential Poisson Sampling. Journal of Official Statistics, 14(2): 149-162.
Rosén, B. (1997). On sampling with probability proportional to size. Journal of Statistical Planning and Inference, 62(2): 159-191.
Rosén, B. (2000). On inclusion probabilities for order ps sampling.
Journal of Statistical Planning and Inference, 90(1): 117-143.
prop_allocation()
for generating proportional-to-size allocations.
inclusion_prob()
for calculating the inclusion probabilities.
sps_repweights()
for generating bootstrap replicate weights.
The UPpoisson()
and UPopips()
functions in the sampling
package for ordinary and sequential Poisson sampling, respectively. Note
that the algorithm for order sampling in the UPopips()
function is
currently incorrect, giving a worse approximation for the inclusion
probabilities than it should.
The UP*
functions in the sampling package, the S.*
functions in the TeachingSampling package, and the pps package
for other probability-proportional-to-size sampling methods.
The pps()
function in the prnsamplr package for Pareto order
sampling with permanent random numbers.
# Make a population with units of different size x <- c(1:10, 100) #---- Sequential Poisson sampling ---- # Draw a sequential Poisson sample (samp <- sps(x, 5)) # Get the design (inverse probability) weights weights(samp) # All units except 11 are in the take-some (TS) stratum levels(samp) # Ensure that the top 10% of units are in the sample sps(x, 5, cutoff = quantile(x, 0.9)) #---- Ordinary Poisson sampling ---- # Ordinary Poisson sampling gives a random sample size for the # take-some stratum ps(x, 5) #---- Stratified Sequential Poisson sampling ---- # Draw a stratified sample with a proportional allocation strata <- rep(letters[1:4], each = 5) (allocation <- prop_allocation(1:20, 12, strata)) (samp <- sps(1:20, allocation, strata)) # Use the Horvitz-Thompson estimator to estimate the total y <- runif(20) * 1:20 sum(weights(samp) * y[samp]) #---- Useful properties of Sequential Poisson sampling ---- # It can be useful to set 'prn' in order to extend the sample # to get a fixed net sample u <- runif(11) (samp <- sps(x, 6, prn = u)) # Removing unit 5 gives the same net sample sps(x[-samp[5]], 6, prn = u[-samp[5]]) # Also useful for topping up a sample all(samp %in% sps(x, 7, prn = u)) #---- Other order-sampling methods ---- # Generate new order-sampling functions from the parameters of # the inverse generalized Pareto distribution igpd <- function(shape, scale = 1, location = 0) { if (shape == 0) { function(x) -scale * log(1 - x) + location } else { function(x) scale * (1 - (1 - x)^shape) / shape + location } } order_sampling2 <- function(x) order_sampling(igpd(x)) order_sampling2(1)(x, 6, prn = u) # sequential Poisson order_sampling2(0)(x, 6, prn = u) # successive order_sampling2(-1)(x, 6, prn = u) # Pareto
# Make a population with units of different size x <- c(1:10, 100) #---- Sequential Poisson sampling ---- # Draw a sequential Poisson sample (samp <- sps(x, 5)) # Get the design (inverse probability) weights weights(samp) # All units except 11 are in the take-some (TS) stratum levels(samp) # Ensure that the top 10% of units are in the sample sps(x, 5, cutoff = quantile(x, 0.9)) #---- Ordinary Poisson sampling ---- # Ordinary Poisson sampling gives a random sample size for the # take-some stratum ps(x, 5) #---- Stratified Sequential Poisson sampling ---- # Draw a stratified sample with a proportional allocation strata <- rep(letters[1:4], each = 5) (allocation <- prop_allocation(1:20, 12, strata)) (samp <- sps(1:20, allocation, strata)) # Use the Horvitz-Thompson estimator to estimate the total y <- runif(20) * 1:20 sum(weights(samp) * y[samp]) #---- Useful properties of Sequential Poisson sampling ---- # It can be useful to set 'prn' in order to extend the sample # to get a fixed net sample u <- runif(11) (samp <- sps(x, 6, prn = u)) # Removing unit 5 gives the same net sample sps(x[-samp[5]], 6, prn = u[-samp[5]]) # Also useful for topping up a sample all(samp %in% sps(x, 7, prn = u)) #---- Other order-sampling methods ---- # Generate new order-sampling functions from the parameters of # the inverse generalized Pareto distribution igpd <- function(shape, scale = 1, location = 0) { if (shape == 0) { function(x) -scale * log(1 - x) + location } else { function(x) scale * (1 - (1 - x)^shape) / shape + location } } order_sampling2 <- function(x) order_sampling(igpd(x)) order_sampling2(1)(x, 6, prn = u) # sequential Poisson order_sampling2(0)(x, 6, prn = u) # successive order_sampling2(-1)(x, 6, prn = u) # Pareto
Produce bootstrap replicate weights that are appropriate for Poisson sampling, and therefore approximately correct for sequential Poisson sampling.
sps_repweights(w, replicates = 1000L, tau = 1, dist = NULL)
sps_repweights(w, replicates = 1000L, tau = 1, dist = NULL)
w |
A numeric vector of design (inverse probability) weights for a (sequential) Poisson sample. |
replicates |
A positive integer that gives the number of bootstrap replicates (1,000 by default). Non-integers are truncated towards 0. |
tau |
A number greater than or equal to 1 that gives the rescale factor for the bootstrap weights. Setting to 1 (the default) does not rescale the weights. |
dist |
A function that produces random deviates with mean 0 and
standard deviation 1, such as |
Replicate weights are constructed using the generalized bootstrap method by
Beaumont and Patak (2012). Their method takes a vector of design weights
, finds a vector of adjustments
for each bootstrap replicate,
and calculates the replicate weights as
.
There are two ways to calculate the adjustments . The default
pseudo-population method randomly rounds
for each replicate to
produce a collection of integer weights
that are used to generate a
random vector
from the binomial distribution. The vector of
adjustments is then
. Specifying a
deviates-generating function for
dist
uses this function to produce a
random vector that is then used to make an adjustment
.
The adjustments can be rescaled by a value to
prevent negative replicate weights. With this rescaling, the adjustment
becomes
. If
then the resulting
bootstrap variance estimator should be multiplied by
.
A matrix of bootstrap replicate weights with replicates
columns (one for
each replicate) and length(w)
rows (one for each unit in the sample), with
the value of tau
as an attribute.
As an alternative to the bootstrap, Ohlsson (1998, equations 2.13)
proposes an analytic estimator for the variance of the total (for the take-some units) under sequential Poisson
sampling:
See Rosén (1997, equation 3.11) for a
more general version of this estimator that can be applied to other order
sampling schemes. Replacing the left-most correction by ,
where
is the number of units in the sample, gives a similar
estimator for the total under ordinary Poisson sampling,
.
Beaumont, J.-F. and Patak, Z. (2012). On the Generalized Bootstrap for Sample Surveys with Special Attention to Poisson Sampling. International Statistical Review, 80(1): 127-148.
Ohlsson, E. (1998). Sequential Poisson Sampling. Journal of Official Statistics, 14(2): 149-162.
Rosén, B. (1997). On sampling with probability proportional to size. Journal of Statistical Planning and Inference, 62(2): 159-191.
sps()
for drawing a sequential Poisson sample.
bootstrapFP()
(with method = "wGeneralised"
) in the bootstrapFP
package for calculating the variance of Horvitz-Thompson estimators using
the generalized bootstrap.
# Make a population with units of different size x <- c(1:10, 100) # Draw a sequential Poisson sample (samp <- sps(x, 5)) # Make some bootstrap replicates dist <- list( pseudo_population = NULL, standard_normal = rnorm, exponential = \(x) rexp(x) - 1, uniform = \(x) runif(x, -sqrt(3), sqrt(3)) ) lapply(dist, sps_repweights, w = weights(samp), replicates = 5, tau = 2)
# Make a population with units of different size x <- c(1:10, 100) # Draw a sequential Poisson sample (samp <- sps(x, 5)) # Make some bootstrap replicates dist <- list( pseudo_population = NULL, standard_normal = rnorm, exponential = \(x) rexp(x) - 1, uniform = \(x) runif(x, -sqrt(3), sqrt(3)) ) lapply(dist, sps_repweights, w = weights(samp), replicates = 5, tau = 2)