Skip to content

Linear models with weighted observations

2015 September 4
by Michał

In data analysis it happens sometimes that it is neccesary to use weights. Contexts
that come to mind include:

  • Analysis of data from complex surveys, e.g. stratified samples. Sample inclusion probabilities might have been unequal and thus observations from different strata should have different weights.
  • Application of propensity score weighting e.g. to correct data being Missing At Random (MAR).
  • Inverse-variance weighting (https://en.wikipedia.org/wiki/Inverse-variance_weighting) when different observations have been measured with different precision which is known apriori.
  • We are analyzing data in an aggregated form such that the weight variable encodes how many original observations each row in the aggregated data represents.
  • We are given survey data with post-stratification weights.

If you use, or have been using, SPSS you probably know about the possibility to define one of the variables as weights. This information is used when producing cross-tabulations (cells include sums of weights), regression models and so on. SPSS weights are frequency weights in the sense that $w_i$ is the number of observations particular case $i$ represents.

On the other hand, in R lm and glm functions have weights argument that serves a related purpose.

suppressMessages(local({
  library(dplyr)
  library(ggplot2)
  library(survey)
  library(knitr)
  library(tidyr)
  library(broom)
}))

Let’s compare different ways in which a linear model can be fitted to data with weights. We start by generating some artificial data:

set.seed(666)

N <- 30 # number of observations

# Aggregated data
aggregated <- data.frame(x=1:5) %>%
  mutate( y = round(2 * x + 2 + rnorm(length(x)) ),
          freq = as.numeric(table(sample(1:5, N, 
                replace=TRUE, prob=c(.3, .4, .5, .4, .3))))
          )
aggregated
##   x  y freq
## 1 1  5    4
## 2 2  8    5
## 3 3  8    8
## 4 4 12    8
## 5 5 10    5
# Disaggregated data
individuals <- aggregated[ rep(1:5, aggregated$freq) , c("x", "y") ]

Visually:

ggplot(aggregated, aes(x=x, y=y, size=freq)) + geom_point() + theme_bw()

data_plot-1

Let’s fit some models:

models <- list( 
               ind_lm = lm(y ~ x, data=individuals),
               raw_agg = lm( y ~ x, data=aggregated),
               ind_svy_glm = svyglm(y~x, design=svydesign(id=~1, data=individuals),
                                 family=gaussian() ),
               ind_glm = glm(y ~ x, family=gaussian(), data=individuals),
               wei_lm = lm(y ~ x, data=aggregated, weight=freq),
               wei_glm = glm(y ~ x, data=aggregated, family=gaussian(), weight=freq),
               svy_glm = svyglm(y ~ x, design=svydesign(id=~1, weights=~freq, data=aggregated),
                                family=gaussian())
               )
## Warning in svydesign.default(id = ~1, data = individuals): No weights or
## probabilities supplied, assuming equal probability

In short, we have the following linear models:

  • ind_lm is a OLS fit to individual data (the true model).
  • ind_agg is a OLS fit to aggregated data (definitely wrong).
  • ind_glm is a ML fit to individual data
  • ind_svy_glm is a ML fit to individual data using simple random sampling with replacement design.
  • wei_lm is OLS fit to aggregated data with frequencies as weights
  • wei_glm is a ML fit to aggregated data with frequencies as weights
  • svy_glm is a ML fit to aggregated using “survey” package and using frequencies as weights in the sampling design.

We would expect that models ind_lm, ind_glm, and ind_svy_glm will be identical.

Summarise and gather in long format

results <- do.call("rbind", lapply( names(models), function(n) cbind(model=n, tidy(models[[n]])) )) %>%
                                      gather(stat, value, -model, -term)

Check if point estimates of model coefficients are identical:

results %>% filter(stat=="estimate") %>% 
  select(model, term, value) %>%
  spread(term, value)
##         model (Intercept)        x
## 1      ind_lm     4.33218 1.474048
## 2     raw_agg     4.40000 1.400000
## 3 ind_svy_glm     4.33218 1.474048
## 4     ind_glm     4.33218 1.474048
## 5      wei_lm     4.33218 1.474048
## 6     wei_glm     4.33218 1.474048
## 7     svy_glm     4.33218 1.474048

Apart from the “wrong” raw_agg model, the coefficients are identical across models.

Let’s check the inference:

# Standard Errors
results %>% filter(stat=="std.error") %>%
  select(model, term, value) %>%
  spread(term, value)
##         model (Intercept)         x
## 1      ind_lm    0.652395 0.1912751
## 2     raw_agg    1.669331 0.5033223
## 3 ind_svy_glm    0.500719 0.1912161
## 4     ind_glm    0.652395 0.1912751
## 5      wei_lm    1.993100 0.5843552
## 6     wei_glm    1.993100 0.5843552
## 7     svy_glm    1.221133 0.4926638
# p-values
results %>% filter(stat=="p.value") %>%
  mutate(p=format.pval(value)) %>%
  select(model, term, p) %>%
  spread(term, p)
##         model (Intercept)          x
## 1      ind_lm  3.3265e-07 2.1458e-08
## 2     raw_agg    0.077937   0.068904
## 3 ind_svy_glm  2.1244e-09 2.1330e-08
## 4     ind_glm  3.3265e-07 2.1458e-08
## 5      wei_lm    0.118057   0.085986
## 6     wei_glm    0.118057   0.085986
## 7     svy_glm    0.038154   0.058038

Recall, that the correct model is ind_lm. Observations:

  • raw_agg is clearly wrong, as expected.
  • Should the weight argument to lm and glm implement frequency weights, the results for wei_lm and wei_glm will be identical to that from ind_lm. Only the point estimates are correct, all the inference stats are not correct.
  • The model using design with sampling weights svy_glm gives correct point estimates, but incorrect inference.
  • Suprisingly, the model fit with “survey” package to the individual data using simple random sampling design (ind_svy_glm) does not give identical inference stats to those from ind_lm. They are close though.

Functions weights lm and glm implement precision weights: inverse-variance weights that can be used to model differential precision with which the outcome variable was estimated.

Functions in the “survey” package implement sampling weights: inverse of the probability of particular observation to be selected from the population to the sample.

Frequency weights are a different animal.

However, it is possible get correct inference statistics for the model fitted to aggregated data using lm with frequency weights supplied as weights. What needs correcting is the degrees of freedom (see also http://stackoverflow.com/questions/10268689/weighted-regression-in-r).

models$wei_lm_fixed <- models$wei_lm
models$wei_lm_fixed$df.residual <- with(models$wei_lm_fixed, sum(weights) - length(coefficients))

results <- do.call("rbind", lapply( names(models), function(n) cbind(model=n, tidy(models[[n]])) )) %>%
                                      gather(stat, value, -model, -term)
## Warning in summary.lm(x): residual degrees of freedom in object suggest
## this is not an "lm" fit
# Coefficients
results %>% filter(stat=="estimate") %>% 
  select(model, term, value) %>%
  spread(term, value)
##          model (Intercept)        x
## 1       ind_lm     4.33218 1.474048
## 2      raw_agg     4.40000 1.400000
## 3  ind_svy_glm     4.33218 1.474048
## 4      ind_glm     4.33218 1.474048
## 5       wei_lm     4.33218 1.474048
## 6      wei_glm     4.33218 1.474048
## 7      svy_glm     4.33218 1.474048
## 8 wei_lm_fixed     4.33218 1.474048
# Standard Errors
results %>% filter(stat=="std.error") %>%
  select(model, term, value) %>%
  spread(term, value)
##          model (Intercept)         x
## 1       ind_lm    0.652395 0.1912751
## 2      raw_agg    1.669331 0.5033223
## 3  ind_svy_glm    0.500719 0.1912161
## 4      ind_glm    0.652395 0.1912751
## 5       wei_lm    1.993100 0.5843552
## 6      wei_glm    1.993100 0.5843552
## 7      svy_glm    1.221133 0.4926638
## 8 wei_lm_fixed    0.652395 0.1912751

See model wei_lm_fixed. Thus, correcting the degrees of freedom manually gives correct coefficient estimates as well as inference statistics.

Performance

Aggregating data and using frequency weights can save you quite some time. To illustrate it, let’s generate large data set in a disaggregated and aggregated form.

N <- 10^4 # number of observations

# Aggregated data
big_aggregated <- data.frame(x=1:5) %>%
  mutate( y = round(2 * x + 2 + rnorm(length(x)) ),
          freq = as.numeric(table(sample(1:5, N, replace=TRUE, prob=c(.3, .4, .5, .4, .3))))
          )

# Disaggregated data
big_individuals <- aggregated[ rep(1:5, big_aggregated$freq) , c("x", "y") ]

… and fit lm models weighting the model on aggregated data. Benchmarking:

library(microbenchmark)

speed <- microbenchmark(
  big_individual = lm(y ~ x, data=big_individuals),
  big_aggregated = lm(y ~ x, data=big_aggregated, weights=freq)
)

speed %>% group_by(expr) %>% summarise(median=median(time / 1000)) %>%
  mutate( ratio = median / median[1])
## Source: local data frame [2 x 3]
## 
##             expr   median     ratio
## 1 big_individual 7561.158 1.0000000
## 2 big_aggregated 1492.057 0.1973319

So quite an improvement.

The improvement is probably the bigger, the more we are able to aggregate the data.

8 Responses leave one →
  1. anspiess permalink
    September 5, 2015

    A very important topic that you touched! I think it must be conveyed more that in OLS, maximum likelihood (and minimum residual sum-of-squares) only holds when weights = 1/var…
    In my field (reproductive science), you see at least 10 new publications per month where the data is extremely heteroscedastic (mostly increasing variance with increasing magnitude) but linear regressions are based on unweighted fits…

    Cheers,
    Andrej

    • September 7, 2015

      Right. I was wondering, in your practice, are these 1/var weights specified “ex ante” (based on information known a priori) or rather “post hoc” to improve the model in a data-driven manner?

      • anspiess permalink
        September 7, 2015

        Always post-hoc! Data come from biological replicates, so in linear regressions, there are rarely more than 3-5 replicates per predictor value. The question is, of course, how reliable the estimation of var is with such small n…

        Cheers!

        • September 8, 2015

          OK. Thanks.

          When I was first thought about the use of 1/var weights it stroke me as a sort of technique of “data dredging”, i.e. improving the model by adding parameters (weights) for which we have no explanation whatsoever. I guess its OK if you only care about prediction (machine learning-like) and not so much about getting a handle on/estimating causal effects… I also recall having the same kind of doubts in the case of Ridge regression.

          Anyway, meta-analysis seems to be a somewhat different context.

  2. Maciej permalink
    December 14, 2015

    The difference between glm and svyglm for individual data is due to different way of variance-covariance matrix calculation. By default svyglm uses sandwich variance estimator which is robust (survey:::svy.varcoef).

    Examples are given here: https://cran.r-project.org/web/packages/sandwich/vignettes/sandwich-OOP.pdf

    • December 16, 2015

      Thanks. Good to know. That explains why the results for `svyglm` were slightly different. Perhaps it should be documented somewhere, because, I believe it is not.

Trackbacks and Pingbacks

  1. Linear models with weighted observations | Mubashir Qasim
  2. Distilled News | Data Analytics & R

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS