9
Treatment of Non-response
Non-response is an inevitable phenomenon in surveys. We distinguish total
non-response, which affects individuals for which we do not have available any
workable collected information, and partial non-response, which corresponds
to ‘holes’ in the information collected for a given individual (certain variables
yk are known, but others are not). In all cases, this phenomenon generates a
bias and increases the variance that varies more or less explicitly as a function
of the inverse of the sample size of the respondents. There exist two large
classes of methods to correct the non-response: reweighting and imputation.
9.1 Reweighting methods
We denote φk as the probability of response of individual k: this entire ap-
proach rests on the idea that the decision of whether or not to respond is
random and is formalised by a probability, which we consider here, to sim-
plify, that it only depends on individual k (indeed, it could very likely depend
on the set of identifiers sampled). If φk is known, before an eventual calibra-
tion, we estimate without bias the total Y by:
yk
Yφ = ,
πk φk
k∈r
where πk is the regular inclusion probability, and r indicates the sample of re-
spondents (r ⊂ S). In practice, we try to model the probability φk (unknown)
to be able to estimate it subsequently. The leads are then multiple, but often
we try to partition the population U into sub-populations Uc inside of which
the φk are supposedly constant:
φk = φc when k ∈ Uc .
We are speaking of a homogeneous response model. We can also model φk by a
logistic function (for example) if we have available quantitative or qualitative
,320 9 Treatment of Non-response
auxiliary information that is sufficiently reliable. Reweighting is essentially
used to treat total non-response.
9.2 Imputation methods
Contrary to the case of the method of reweighting, we directly model the
behaviour yk by using a vector of auxiliary information xk . For example, we
denote (model called ‘superpopulation’):
yk = ψ(xk b) + zk ,
where ψ is a known function and zk is a random variable of null expected
value and variance σ 2 . We use the information on the respondents to estimate
b and σ 2 and we predict yk , for each non-respondent k, with yk∗ . Lastly, we
calculate:
yk y∗
YI = + k
,
πk πk
k∈r k∈S
k∈r
/
which allows for the conservation of the initial weights. If, within any sub-
population, we believe in the model yk = b + zk , we can impute yk∗ = y ,
where is an identifier selected at random in the respondent sub-population:
this is a technique called ‘hot deck’. The study of the quality of YI is performed
by bringing into play the random variable zk . Imputation is essentially used
to treat partial non-response.
EXERCISES
Exercise 9.1 Weight of an aeroplane
We wish to estimate the total weight of 250 passengers on a charter flight. For
that, we select a simple random sample of 25 people for whom we intend to
ask their height (in centimetres) and their weight (in kilograms). Five people
refuse to respond, but we can all the same note their gender (1: male and 2:
female). Among the others, five have given their height but did not want to
say their weight. The collected data is finally presented in Table 9.1.
1. What methods can we use to correct the effects of non-response? Justify
your decisions in a precise way, by explaining the models that you use.
Perform the numerical applications.
2. You learn that 130 passengers are men and 120 are women. Would you
modify your estimation method? Why?
3. Among the 10 non-responses for weight, we select a simple random sample
comprised of individuals b, g, w, x. Using a particularly persuasive inter-
viewer, we get them to admit their height and their weight. This com-
plementary information is given in Table 9.2. How can we take this into
consideration?
, Exercise 9.1 321
Table 9.1. Sample of 25 selected individuals: Exercise 9.1
Individual Gender Height Weight
a 1 170 60
b 1 170
c 1 180 70
d 1 190 80
e 1 190 80
f 1 170 70
g 1 170
h 1 180 80
i 1 180 80
j 1 180 80
k 1 180
l 1 190
m 1 190 90
n 2 150 40
o 2 160 50
p 2 170 60
q 2 150 50
r 2 160 60
s 2 180 70
t 2 180
u 1
v 1
w 2
x 2
y 2
Table 9.2. Complementary information for four individuals: Exercise 9.1
Individual Gender Weight Height
b 1 80 170
g 1 100 170
w 2 90 180
x 2 60 150
Solution
1. Two types of non-response appear: total non-response for individuals u to
y and partial non-response for b, g, k, l and t. The total non-response is
treated in general by modifying the weights of the respondents (technique
of ‘reweighting’). Since only the gender variable is known, we can con-
struct, at best, cells based on the gender variable. To justify this practice,
we can have two points of view:
, 322 9 Treatment of Non-response
• A ‘probabilistic’ point of view, which postulates that the non-respon-
dents of one given gender in fact account for a simple random sub-
sample of the initially selected sample (gender by gender), whose size
is equal to the number of respondents for the gender considered. A
second approach, equivalent in terms of the estimator, depends on a
Bernoulli type of response model: all individuals of a given gender have
the same probability of response, estimated by the response rate char-
acterising the gender (maximum likelihood estimator). A third way,
equivalent in terms of the estimator, of adhering to this point of view,
consists of saying that, conditionally on the gender, the weight variable
and the ‘response’ variable are independent (the fact of deciding not to
respond does not depend on the weight). With these three approaches,
the reweighting estimator is:
nh
Y φ = Y hr ,
n
h=1,2
where nh is the number of selected people of gender h (h = 1, 2) and
Y hr is the average weight of the respondents of gender h. If we treat
the partial non-responses as total non-responses, it is theoretically un-
biased if the probabilistic model is exact.
• A more ‘modellistic’ point of view, which is less interested in the pro-
cess of selecting the non-respondents but which postulates a statistical
model of type:
yhi = µh + εhi ,
where yhi is the weight of individual i of gender h, µh is ‘mean’ of the
weight characteristic of gender h and εhi is a random variable whose
expected value is 0 (it is a classical approach in statistics: everything
happens as if a random process had generated yhi according to this
model). The estimator is still Y φ , but this time we are interested in
its expected value under the model:
nh nh
E(Y φ ) = E(Y hr ) = µh .
n n
h=1,2 h=1,2
Therefore,
Nh
E E(Y φ ) = µh = E(Y ) = E E(Y ).
N
h=1,2
We have E E(Y φ − Y ) = 0, and therefore Y φ remains ‘unbiased’ if we
bring into play the expected value under the model.
The partial non-response is treated in general by imputation, using a
behaviour model. In every case, we use the auxiliary information given by
the variable ‘size’, which is strongly linked to weight. To treat the partial
Treatment of Non-response
Non-response is an inevitable phenomenon in surveys. We distinguish total
non-response, which affects individuals for which we do not have available any
workable collected information, and partial non-response, which corresponds
to ‘holes’ in the information collected for a given individual (certain variables
yk are known, but others are not). In all cases, this phenomenon generates a
bias and increases the variance that varies more or less explicitly as a function
of the inverse of the sample size of the respondents. There exist two large
classes of methods to correct the non-response: reweighting and imputation.
9.1 Reweighting methods
We denote φk as the probability of response of individual k: this entire ap-
proach rests on the idea that the decision of whether or not to respond is
random and is formalised by a probability, which we consider here, to sim-
plify, that it only depends on individual k (indeed, it could very likely depend
on the set of identifiers sampled). If φk is known, before an eventual calibra-
tion, we estimate without bias the total Y by:
yk
Yφ = ,
πk φk
k∈r
where πk is the regular inclusion probability, and r indicates the sample of re-
spondents (r ⊂ S). In practice, we try to model the probability φk (unknown)
to be able to estimate it subsequently. The leads are then multiple, but often
we try to partition the population U into sub-populations Uc inside of which
the φk are supposedly constant:
φk = φc when k ∈ Uc .
We are speaking of a homogeneous response model. We can also model φk by a
logistic function (for example) if we have available quantitative or qualitative
,320 9 Treatment of Non-response
auxiliary information that is sufficiently reliable. Reweighting is essentially
used to treat total non-response.
9.2 Imputation methods
Contrary to the case of the method of reweighting, we directly model the
behaviour yk by using a vector of auxiliary information xk . For example, we
denote (model called ‘superpopulation’):
yk = ψ(xk b) + zk ,
where ψ is a known function and zk is a random variable of null expected
value and variance σ 2 . We use the information on the respondents to estimate
b and σ 2 and we predict yk , for each non-respondent k, with yk∗ . Lastly, we
calculate:
yk y∗
YI = + k
,
πk πk
k∈r k∈S
k∈r
/
which allows for the conservation of the initial weights. If, within any sub-
population, we believe in the model yk = b + zk , we can impute yk∗ = y ,
where is an identifier selected at random in the respondent sub-population:
this is a technique called ‘hot deck’. The study of the quality of YI is performed
by bringing into play the random variable zk . Imputation is essentially used
to treat partial non-response.
EXERCISES
Exercise 9.1 Weight of an aeroplane
We wish to estimate the total weight of 250 passengers on a charter flight. For
that, we select a simple random sample of 25 people for whom we intend to
ask their height (in centimetres) and their weight (in kilograms). Five people
refuse to respond, but we can all the same note their gender (1: male and 2:
female). Among the others, five have given their height but did not want to
say their weight. The collected data is finally presented in Table 9.1.
1. What methods can we use to correct the effects of non-response? Justify
your decisions in a precise way, by explaining the models that you use.
Perform the numerical applications.
2. You learn that 130 passengers are men and 120 are women. Would you
modify your estimation method? Why?
3. Among the 10 non-responses for weight, we select a simple random sample
comprised of individuals b, g, w, x. Using a particularly persuasive inter-
viewer, we get them to admit their height and their weight. This com-
plementary information is given in Table 9.2. How can we take this into
consideration?
, Exercise 9.1 321
Table 9.1. Sample of 25 selected individuals: Exercise 9.1
Individual Gender Height Weight
a 1 170 60
b 1 170
c 1 180 70
d 1 190 80
e 1 190 80
f 1 170 70
g 1 170
h 1 180 80
i 1 180 80
j 1 180 80
k 1 180
l 1 190
m 1 190 90
n 2 150 40
o 2 160 50
p 2 170 60
q 2 150 50
r 2 160 60
s 2 180 70
t 2 180
u 1
v 1
w 2
x 2
y 2
Table 9.2. Complementary information for four individuals: Exercise 9.1
Individual Gender Weight Height
b 1 80 170
g 1 100 170
w 2 90 180
x 2 60 150
Solution
1. Two types of non-response appear: total non-response for individuals u to
y and partial non-response for b, g, k, l and t. The total non-response is
treated in general by modifying the weights of the respondents (technique
of ‘reweighting’). Since only the gender variable is known, we can con-
struct, at best, cells based on the gender variable. To justify this practice,
we can have two points of view:
, 322 9 Treatment of Non-response
• A ‘probabilistic’ point of view, which postulates that the non-respon-
dents of one given gender in fact account for a simple random sub-
sample of the initially selected sample (gender by gender), whose size
is equal to the number of respondents for the gender considered. A
second approach, equivalent in terms of the estimator, depends on a
Bernoulli type of response model: all individuals of a given gender have
the same probability of response, estimated by the response rate char-
acterising the gender (maximum likelihood estimator). A third way,
equivalent in terms of the estimator, of adhering to this point of view,
consists of saying that, conditionally on the gender, the weight variable
and the ‘response’ variable are independent (the fact of deciding not to
respond does not depend on the weight). With these three approaches,
the reweighting estimator is:
nh
Y φ = Y hr ,
n
h=1,2
where nh is the number of selected people of gender h (h = 1, 2) and
Y hr is the average weight of the respondents of gender h. If we treat
the partial non-responses as total non-responses, it is theoretically un-
biased if the probabilistic model is exact.
• A more ‘modellistic’ point of view, which is less interested in the pro-
cess of selecting the non-respondents but which postulates a statistical
model of type:
yhi = µh + εhi ,
where yhi is the weight of individual i of gender h, µh is ‘mean’ of the
weight characteristic of gender h and εhi is a random variable whose
expected value is 0 (it is a classical approach in statistics: everything
happens as if a random process had generated yhi according to this
model). The estimator is still Y φ , but this time we are interested in
its expected value under the model:
nh nh
E(Y φ ) = E(Y hr ) = µh .
n n
h=1,2 h=1,2
Therefore,
Nh
E E(Y φ ) = µh = E(Y ) = E E(Y ).
N
h=1,2
We have E E(Y φ − Y ) = 0, and therefore Y φ remains ‘unbiased’ if we
bring into play the expected value under the model.
The partial non-response is treated in general by imputation, using a
behaviour model. In every case, we use the auxiliary information given by
the variable ‘size’, which is strongly linked to weight. To treat the partial