The Malawi Integrated Household Survey (IHS) series uses complex
sampling designs rather than Simple Random Sampling (SRS). To obtain
unbiased, population-representative estimates, survey weights,
stratification, and clustering must be accounted for. This vignette
describes how to set up and use survey designs with
ihsMW.
1. Why Survey Weights Matter
To reduce fieldwork costs and improve accuracy, the Malawi National Statistical Office (NSO) designs the IHS using a stratified two-stage cluster sample. - Strata: Usually defined by districts split into urban and rural areas. - Primary Sampling Units (PSUs): The Enumeration Areas (EAs) selected in the first stage. - Survey Weights: Inverted probabilities of selection, adjusted for non-response.
Because different households have different probabilities of selection (e.g. rural households might be over- or under-sampled relative to urban ones), unweighted statistics will be biased. Standard errors computed without accounting for clustering and stratification will also be incorrectly narrow.
2. Setting Up the Design
The ihs_svydesign() function creates a survey design
object by wrapping survey::svydesign(). It automatically
detects standard weight, strata, and PSU columns inside your harmonised
dataset:
library(ihsMW)
library(haven)
# Load and harmonise IHS5 data
raw_data <- read_dta("path/to/IHS5/hh_mod_a_filt.dta")
harmonised_data <- ihs_harmonise(raw_data, round = "IHS5")
# Create survey design object
# Automatically detects: hh_wgt/hhweight, stratum/strata, and ea_id/psu
design <- ihs_svydesign(harmonised_data)If the standard columns are named differently in your data, you can specify them explicitly:
design <- ihs_svydesign(
data = harmonised_data,
weight_col = "custom_weight",
strata_col = "custom_strata",
psu_col = "custom_ea"
)3. Weighted Analysis
Once you have the survey design object, you can compute
representative statistics using the survey package:
library(survey)
# Nationally representative mean of household size
svymean(~hhsize, design = design, na.rm = TRUE)
# Nationally representative total of expenditure
svytotal(~food_exp, design = design, na.rm = TRUE)
# Calculate means grouped by a factor variable (e.g., region)
svyby(~food_exp, ~region, design = design, svymean, na.rm = TRUE)4. Using srvyr
If you prefer dplyr-like syntax, the srvyr
package works seamlessly with the survey design objects generated by
ihs_svydesign():
library(srvyr)
# Convert to srvyr design object
srvyr_design <- as_survey(design)
# Calculate summary statistics using dplyr verbs
summary_stats <- srvyr_design |>
group_by(region) |>
summarise(
mean_exp = survey_mean(food_exp, na.rm = TRUE),
total_exp = survey_total(food_exp, na.rm = TRUE)
)5. Summary Statistics
ihsMW provides ihs_report(), which computes
a clean summary statistics table for publication. It supports survey
weights directly:
# Generate a summary statistics table with survey weights
report_tbl <- ihs_report(
data = harmonised_data,
vars = c("hhsize", "food_exp", "nonfood_exp"),
weights = "hh_wgt"
)
print(report_tbl)You can also compute these weighted tables grouped by another variable:
# Grouped weighted summary statistics
report_grouped <- ihs_report(
data = harmonised_data,
vars = c("hhsize", "food_exp"),
by = "region",
weights = "hh_wgt"
)6. Common Pitfalls
-
Subsetting Data Naively: Never subset your
dataframe with standard
[ ]ordplyr::filter()before creating the survey design, as this breaks the cluster/strata structure and results in incorrect standard errors. Instead, define the design on the full dataset first, and then usesubset()from thesurveypackage orsrvyr::filter()on the design object. -
Ignoring Strata or PSU: Using only survey weights
(e.g.
weighted.mean()) without specifying clustering (PSUs) and stratification will yield correct point estimates but incorrect (usually too small) standard errors. Always use a full survey design object for inference.