Working with survey weights

The Malawi Integrated Household Survey (IHS) series uses complex sampling designs rather than Simple Random Sampling (SRS). To obtain unbiased, population-representative estimates, survey weights, stratification, and clustering must be accounted for. This vignette describes how to set up and use survey designs with ihsMW.

1. Why Survey Weights Matter

To reduce fieldwork costs and improve accuracy, the Malawi National Statistical Office (NSO) designs the IHS using a stratified two-stage cluster sample. - Strata: Usually defined by districts split into urban and rural areas. - Primary Sampling Units (PSUs): The Enumeration Areas (EAs) selected in the first stage. - Survey Weights: Inverted probabilities of selection, adjusted for non-response.

Because different households have different probabilities of selection (e.g. rural households might be over- or under-sampled relative to urban ones), unweighted statistics will be biased. Standard errors computed without accounting for clustering and stratification will also be incorrectly narrow.

2. Setting Up the Design

The ihs_svydesign() function creates a survey design object by wrapping survey::svydesign(). It automatically detects standard weight, strata, and PSU columns inside your harmonised dataset:

library(ihsMW)
library(haven)

# Load and harmonise IHS5 data
raw_data <- read_dta("path/to/IHS5/hh_mod_a_filt.dta")
harmonised_data <- ihs_harmonise(raw_data, round = "IHS5")

# Create survey design object
# Automatically detects: hh_wgt/hhweight, stratum/strata, and ea_id/psu
design <- ihs_svydesign(harmonised_data)

If the standard columns are named differently in your data, you can specify them explicitly:

design <- ihs_svydesign(
  data = harmonised_data,
  weight_col = "custom_weight",
  strata_col = "custom_strata",
  psu_col = "custom_ea"
)

3. Weighted Analysis

Once you have the survey design object, you can compute representative statistics using the survey package:

library(survey)

# Nationally representative mean of household size
svymean(~hhsize, design = design, na.rm = TRUE)

# Nationally representative total of expenditure
svytotal(~food_exp, design = design, na.rm = TRUE)

# Calculate means grouped by a factor variable (e.g., region)
svyby(~food_exp, ~region, design = design, svymean, na.rm = TRUE)

4. Using srvyr

If you prefer dplyr-like syntax, the srvyr package works seamlessly with the survey design objects generated by ihs_svydesign():

library(srvyr)

# Convert to srvyr design object
srvyr_design <- as_survey(design)

# Calculate summary statistics using dplyr verbs
summary_stats <- srvyr_design |>
  group_by(region) |>
  summarise(
    mean_exp = survey_mean(food_exp, na.rm = TRUE),
    total_exp = survey_total(food_exp, na.rm = TRUE)
  )

5. Summary Statistics

ihsMW provides ihs_report(), which computes a clean summary statistics table for publication. It supports survey weights directly:

# Generate a summary statistics table with survey weights
report_tbl <- ihs_report(
  data = harmonised_data,
  vars = c("hhsize", "food_exp", "nonfood_exp"),
  weights = "hh_wgt"
)
print(report_tbl)

You can also compute these weighted tables grouped by another variable:

# Grouped weighted summary statistics
report_grouped <- ihs_report(
  data = harmonised_data,
  vars = c("hhsize", "food_exp"),
  by = "region",
  weights = "hh_wgt"
)

6. Common Pitfalls

Subsetting Data Naively: Never subset your dataframe with standard [ ] or dplyr::filter() before creating the survey design, as this breaks the cluster/strata structure and results in incorrect standard errors. Instead, define the design on the full dataset first, and then use subset() from the survey package or srvyr::filter() on the design object.
Ignoring Strata or PSU: Using only survey weights (e.g. weighted.mean()) without specifying clustering (PSUs) and stratification will yield correct point estimates but incorrect (usually too small) standard errors. Always use a full survey design object for inference.