Skip to content

Some new presentations

2015 September 16
by Michał

Within last few weeks the website of the RECON project have been updated. Among other things, we have uploaded a couple of presentations that were given in 2014 in 2015. Below is a short list. See the Publications page on RECONs webpage for a complete list with abstracts.

  • Czerniawska D., Fenrich W., Bojanowski M. (2015), How does scholarly cooperation occur and how does it manifest itself? Evidence from Poland Presentation at ESA 2015 conference. PDF slides
  • Czerniawska D. (2015), Paths to interdisciplinarity: How do scholars start working on the edges of disciplines? Presentation at ‘What makes interdisciplinarity work? Crossing academic boundaries in real life’ Ustinov College, Durham University. HTML slides
  • Fenrich W., Czerniawska D., Bojanowski M. (2015) The story behind the graph: a mixed method study of scholarly collaboration networks in Poland. Presentation at Sunbelt XXXV. HTML slides

Linear models with weighted observations

2015 September 4
by Michał

In data analysis it happens sometimes that it is neccesary to use weights. Contexts
that come to mind include:

  • Analysis of data from complex surveys, e.g. stratified samples. Sample inclusion probabilities might have been unequal and thus observations from different strata should have different weights.
  • Application of propensity score weighting e.g. to correct data being Missing At Random (MAR).
  • Inverse-variance weighting ( when different observations have been measured with different precision which is known apriori.
  • We are analyzing data in an aggregated form such that the weight variable encodes how many original observations each row in the aggregated data represents.
  • We are given survey data with post-stratification weights.

If you use, or have been using, SPSS you probably know about the possibility to define one of the variables as weights. This information is used when producing cross-tabulations (cells include sums of weights), regression models and so on. SPSS weights are frequency weights in the sense that $w_i$ is the number of observations particular case $i$ represents.

On the other hand, in R lm and glm functions have weights argument that serves a related purpose.


Let’s compare different ways in which a linear model can be fitted to data with weights. We start by generating some artificial data:


N <- 30 # number of observations

# Aggregated data
aggregated <- data.frame(x=1:5) %>%
  mutate( y = round(2 * x + 2 + rnorm(length(x)) ),
          freq = as.numeric(table(sample(1:5, N, 
                replace=TRUE, prob=c(.3, .4, .5, .4, .3))))
##   x  y freq
## 1 1  5    4
## 2 2  8    5
## 3 3  8    8
## 4 4 12    8
## 5 5 10    5
# Disaggregated data
individuals <- aggregated[ rep(1:5, aggregated$freq) , c("x", "y") ]


ggplot(aggregated, aes(x=x, y=y, size=freq)) + geom_point() + theme_bw()


Let’s fit some models:

models <- list( 
               ind_lm = lm(y ~ x, data=individuals),
               raw_agg = lm( y ~ x, data=aggregated),
               ind_svy_glm = svyglm(y~x, design=svydesign(id=~1, data=individuals),
                                 family=gaussian() ),
               ind_glm = glm(y ~ x, family=gaussian(), data=individuals),
               wei_lm = lm(y ~ x, data=aggregated, weight=freq),
               wei_glm = glm(y ~ x, data=aggregated, family=gaussian(), weight=freq),
               svy_glm = svyglm(y ~ x, design=svydesign(id=~1, weights=~freq, data=aggregated),
## Warning in svydesign.default(id = ~1, data = individuals): No weights or
## probabilities supplied, assuming equal probability

In short, we have the following linear models:

  • ind_lm is a OLS fit to individual data (the true model).
  • ind_agg is a OLS fit to aggregated data (definitely wrong).
  • ind_glm is a ML fit to individual data
  • ind_svy_glm is a ML fit to individual data using simple random sampling with replacement design.
  • wei_lm is OLS fit to aggregated data with frequencies as weights
  • wei_glm is a ML fit to aggregated data with frequencies as weights
  • svy_glm is a ML fit to aggregated using “survey” package and using frequencies as weights in the sampling design.

We would expect that models ind_lm, ind_glm, and ind_svy_glm will be identical.

Summarise and gather in long format

results <-"rbind", lapply( names(models), function(n) cbind(model=n, tidy(models[[n]])) )) %>%
                                      gather(stat, value, -model, -term)

Check if point estimates of model coefficients are identical:

results %>% filter(stat=="estimate") %>% 
  select(model, term, value) %>%
  spread(term, value)
##         model (Intercept)        x
## 1      ind_lm     4.33218 1.474048
## 2     raw_agg     4.40000 1.400000
## 3 ind_svy_glm     4.33218 1.474048
## 4     ind_glm     4.33218 1.474048
## 5      wei_lm     4.33218 1.474048
## 6     wei_glm     4.33218 1.474048
## 7     svy_glm     4.33218 1.474048

Apart from the “wrong” raw_agg model, the coefficients are identical across models.

Let’s check the inference:

# Standard Errors
results %>% filter(stat=="std.error") %>%
  select(model, term, value) %>%
  spread(term, value)
##         model (Intercept)         x
## 1      ind_lm    0.652395 0.1912751
## 2     raw_agg    1.669331 0.5033223
## 3 ind_svy_glm    0.500719 0.1912161
## 4     ind_glm    0.652395 0.1912751
## 5      wei_lm    1.993100 0.5843552
## 6     wei_glm    1.993100 0.5843552
## 7     svy_glm    1.221133 0.4926638
# p-values
results %>% filter(stat=="p.value") %>%
  mutate(p=format.pval(value)) %>%
  select(model, term, p) %>%
  spread(term, p)
##         model (Intercept)          x
## 1      ind_lm  3.3265e-07 2.1458e-08
## 2     raw_agg    0.077937   0.068904
## 3 ind_svy_glm  2.1244e-09 2.1330e-08
## 4     ind_glm  3.3265e-07 2.1458e-08
## 5      wei_lm    0.118057   0.085986
## 6     wei_glm    0.118057   0.085986
## 7     svy_glm    0.038154   0.058038

Recall, that the correct model is ind_lm. Observations:

  • raw_agg is clearly wrong, as expected.
  • Should the weight argument to lm and glm implement frequency weights, the results for wei_lm and wei_glm will be identical to that from ind_lm. Only the point estimates are correct, all the inference stats are not correct.
  • The model using design with sampling weights svy_glm gives correct point estimates, but incorrect inference.
  • Suprisingly, the model fit with “survey” package to the individual data using simple random sampling design (ind_svy_glm) does not give identical inference stats to those from ind_lm. They are close though.

Functions weights lm and glm implement precision weights: inverse-variance weights that can be used to model differential precision with which the outcome variable was estimated.

Functions in the “survey” package implement sampling weights: inverse of the probability of particular observation to be selected from the population to the sample.

Frequency weights are a different animal.

However, it is possible get correct inference statistics for the model fitted to aggregated data using lm with frequency weights supplied as weights. What needs correcting is the degrees of freedom (see also

models$wei_lm_fixed <- models$wei_lm
models$wei_lm_fixed$df.residual <- with(models$wei_lm_fixed, sum(weights) - length(coefficients))

results <-"rbind", lapply( names(models), function(n) cbind(model=n, tidy(models[[n]])) )) %>%
                                      gather(stat, value, -model, -term)
## Warning in summary.lm(x): residual degrees of freedom in object suggest
## this is not an "lm" fit
# Coefficients
results %>% filter(stat=="estimate") %>% 
  select(model, term, value) %>%
  spread(term, value)
##          model (Intercept)        x
## 1       ind_lm     4.33218 1.474048
## 2      raw_agg     4.40000 1.400000
## 3  ind_svy_glm     4.33218 1.474048
## 4      ind_glm     4.33218 1.474048
## 5       wei_lm     4.33218 1.474048
## 6      wei_glm     4.33218 1.474048
## 7      svy_glm     4.33218 1.474048
## 8 wei_lm_fixed     4.33218 1.474048
# Standard Errors
results %>% filter(stat=="std.error") %>%
  select(model, term, value) %>%
  spread(term, value)
##          model (Intercept)         x
## 1       ind_lm    0.652395 0.1912751
## 2      raw_agg    1.669331 0.5033223
## 3  ind_svy_glm    0.500719 0.1912161
## 4      ind_glm    0.652395 0.1912751
## 5       wei_lm    1.993100 0.5843552
## 6      wei_glm    1.993100 0.5843552
## 7      svy_glm    1.221133 0.4926638
## 8 wei_lm_fixed    0.652395 0.1912751

See model wei_lm_fixed. Thus, correcting the degrees of freedom manually gives correct coefficient estimates as well as inference statistics.


Aggregating data and using frequency weights can save you quite some time. To illustrate it, let’s generate large data set in a disaggregated and aggregated form.

N <- 10^4 # number of observations

# Aggregated data
big_aggregated <- data.frame(x=1:5) %>%
  mutate( y = round(2 * x + 2 + rnorm(length(x)) ),
          freq = as.numeric(table(sample(1:5, N, replace=TRUE, prob=c(.3, .4, .5, .4, .3))))

# Disaggregated data
big_individuals <- aggregated[ rep(1:5, big_aggregated$freq) , c("x", "y") ]

… and fit lm models weighting the model on aggregated data. Benchmarking:


speed <- microbenchmark(
  big_individual = lm(y ~ x, data=big_individuals),
  big_aggregated = lm(y ~ x, data=big_aggregated, weights=freq)

speed %>% group_by(expr) %>% summarise(median=median(time / 1000)) %>%
  mutate( ratio = median / median[1])
## Source: local data frame [2 x 3]
##             expr   median     ratio
## 1 big_individual 7561.158 1.0000000
## 2 big_aggregated 1492.057 0.1973319

So quite an improvement.

The improvement is probably the bigger, the more we are able to aggregate the data.

P-values deemed no longer significant

2015 March 3
by Michał

Journal Basic and Applied Social Psychology (BASP) bans the use of statistical hypothesis testing.

The BASP editorial by Trafimow and Marks here.

The story have also been covered by:

And discussed in/by, among others:

Where this will go, I wonder…

Word-processing Wars

2015 January 8
by Michał

Three days ago Nature published a note commenting on an recent heated social media discussions whether MS Word is better than LaTeX for writing scientific papers. The note refers to a PLOS article by Knauf & Nejasmic reporting a study on word-processor use. The overall result of that study is that participants who used Word took less time and made less mistakes in reproducing the probe text as compared to people who used LaTeX.

I find it rather funny that Nature picked-up the topic. Such discussions always seemed rather futile to me (de gustibus non disputandum est and the fact that some solution A is better or more “efficient” than B does not necessarily lead to A becoming accepted, as is the case with QWERTY vs Dvorak keyboard layouts) and far away from anything scientific.

As it goes for myself, I do not like Word nor its Linux counterparts (LibreOffice, Abiword etc), let’s call them WYSIWYGs. First and foremost because I believe they are very poor text editors (as compared to Vim or Emacs): it is cumbersome to navigate longer texts, search. The fact that it is convenient to read a piece of text in, say, Times New Roman does not mean that it is convenient to write using it. Second, when writing in WYSIWYGs I always have an impression that I am handcrafting something: formatting, styles and so on. It is like sculpturing: if you don’t like the result you need to get another piece of wood and start from the beginning. All that seems to counter the main purpuse for which the computers were developed in the first place, which is taking over “mechanistic” tasks and leave “creative” ones to the user.

I like that the Nature note referred to Markdown as an emerging technology for writing [scientific] texts. If do not know, Markdown is a lightweight plain text format, not unlike Wikipedia markup. Texts written in Markdown can be processed to PDF, HTML, MSWord and so on. More and more people are using for writing articles or even books. It is simple (plain text) and allows to focus on writing.

Last, the note still contains a popular misconception that one of the downsides of LaTeX is a lack of spell checker…

Praktyki w ICM – oferta

2014 May 13

ICM, jak co roku, organizuje praktyki dla studentów. W tym roku poszukuję osoby, która byłaby zainteresowana pracą nad stworzeniem aplikacji umożliwiającej interaktywną wizualizację danych sieciowych.

Oferujemy pracę w młodej i dynamicznej grupie badaczy sieci oraz nawiązanie kontaktów z zagranicznym zespołem naukowym.

Wymagania (pierwsze jest warunkiem koniecznym, pozostałe będą dodatkowymi atutami):

  • Programowanie w R
  • Programowanie w JavaScript
  • Tworzenie aplikacji Shiny
  • Znajomość biblioteki D3js
  • Znajomość metod Social Network Analysis (SNA)

Jeżeli jesteś zainteresowany, wypełnij formularz na stronie ICM! Mój temat ma numer 22.

Alluvial diagrams

2014 March 27
by Michał

Parallel coordinates plot is one of the tools for visualizing multivariate data. Every observation in a dataset is represented with a polyline that crosses a set of parallel axes corresponding to variables in the dataset. You can create such plots in R using a function parcoord in package MASS. For example, we can create such plot for the built-in dataset mtcars:

k <- blue2red(100)
x <- cut( mtcars$mpg, 100)
op <- par(mar=c(3, rep(.1, 3)))
parcoord(mtcars, col=k[as.numeric(x)])

This produces the plot below. The lines are colored using a blue-to-red color ramp according to the miles-per-gallon variable.


What to do if some of the variables are categorical? One approach is to use polylines with different width. Another approach is to add some random noise (jitter) to the values. Titanic data is a crossclassification of Titanic passengers according to class, gender, age, and survival status (survived or not). Consequently, all variables are categorical. Let’s try the jittering approach. After converting the crossclassification (R table) to data frame we “blow it up” by repeating observations according to their frequency in the table.

# convert to data frame of numeric variables
titdf <-, as.numeric))
# repeat obs. according to their frequency
titdf2 <- titdf[ rep(1:nrow(titdf), titdf$Freq) , ]
# new columns with jittered values
titdf2[,6:9] <- lapply(titdf2[,1:4], jitter)
# colors according to survival status, with some transparency
k <- adjustcolor(brewer.pal(3, "Set1")[titdf2$Survived], alpha=.2)
op <- par(mar=c(3, 1, 1, 1))
parcoord(titdf2[,6:9], col=k)

This produces the following (red lines are for passengers who did not survive):


It is not so easy to read, is it. Did the majority of 1st class passengers (bottom category on leftmost axis) survived or not? Definitely most of women from that class did, but in aggregate?

At this point it would be nice to, instead of drawing a bunch of lines, to draw segments for different groups of passengers. Later I learned that such plot exists and even has a name: alluvial diagram. They seem to be related to Sankey diagrams blogged about on R-bloggers recently, e.g. here. What is more, I was not alone in thinking how to create such a thing with R, see for example here. Later I found that what I need is a “parallel set” plot, as it was called, and implemented, on CrossValidated here. Thats look terrific to me, nevertheless, I still would prefer to:

  • The axes to be vertical. If the variables correspond to measurements on different points in time, then we should have nice flows from left to right.
  • If only the segments could be smooth curves, e.g. splines or Bezier curves…

And so I wrote a prototype function alluvial (tadaaa!), now in a package alluvial on Github. I strongy relied on code by Aaron from his answer on CrossValidated (hat tip).

See the following examples of using alluvial on Titanic data:

First, just using two variables Class and Survival, and with stripes being simple polygons.


This was produced with the code below.

# load packages and prepare data
tit <-
# only two variables: class and survival status
tit2d <- aggregate( Freq ~ Class + Survived, data=tit, sum)
alluvial( tit2d[,1:2], freq=tit2d$Freq, xw=0.0, alpha=0.8,
         gap.width=0.1, col= "steelblue", border="white",
         layer = tit2d$Survived != "Yes" )

The function accepts data as (collection of) vectors or data frames. The xw argument specifies the position of the knots of xspline relative to the axes. If positive, the knot is further away from the axis, which will make the stripes go horizontal longer before turning towards the other axis. Argument gap.width specifies distances between categories on the axes.

Another example is showing the whole Titanic data. Red stripes for those who did not survive.


Now its possible to see that, e.g.:

  • A bit more than 50% of 1st class passangers survived
  • Women who did not survive come almost exclusively from 3rd class
  • etc.

The plot was produced with:

alluvial( tit[,1:4], freq=tit$Freq, border=NA,
         hide = tit$Freq < quantile(tit$Freq, .50),
         col=ifelse( tit$Survived == "No", "red", "gray") )

In this variant the stripes have no borders, color transparency is at 0.5, and for the purpose of the example the plot shows only “thickest” 50% of the stripes (argument hide).

As compared to the parallel set solution mentioned earlier, the main differences are:

  • Axes are vertical instead of horizontal
  • I used xspline to draw the “stripes”
  • with argument hide you can skip plotting of selected groups of cases

If you have suggestions or ideas for extensions/modifications, let me know on Github!

Stay tuned for more examples from panel data.

Gadka na SERze

2014 March 10
by Michał


These are slides from the very first SER meeting – an R user group in Warsaw – that took place on February 27, 2014. I talked about various “lifehacking” tricks for R and focused how to use R with GNU make effectively. I will post some detailed examples in forthcoming posts.

Oto moje slajdy z pierwszego Spotkania Entuzjastów R (SER) , które odbyło się  27 lutego w ICM. Niebawem w oddzielnym poście opiszę szerzej prezentowane rozwiązanie GNU make + R.

Slides from Sunbelt 2014 talk on collaboration in science

2014 March 9

Here are the slides from my Sunbelt 2014 talk on collaboration in science. I talked about:

  • Some general considerations regarding collaboration or the lack of it. I have an impression that we are quite good at formulating arguments able to explain why people would like to collaborate. It’s much less understood why we do not observe as much collaboration as those arguments might suggest.
  • Some general considerations about potential data sources and their utility for studying collaboration and other types of social processes among scientists. In particular, I believe this can be usefully framed as a network boundary problem (Lauman & Marsden, 1989).
  • Finally, I showed some preliminary results from studying co-authorship network of employees of the University of Warsaw. Among other things, we see quite some differences between departments in terms of propensity to co-author (also depending on the type of co-authored work) and network transitivity.

Comments welcome.

As if life after PhD was not hard enough…

2013 November 5
by Michał

And so I wrote a post on the Future of ___ PhD yesterday. Today I just learned about this shocking story about a political science PhD looking to be employed as an assistant professor at the University of Wrocław and facing shady realities of (parts of) of Polish higher education… Share and beware.


The Future of ___ PhD

2013 November 4
by Michał

Fill-in the blank in the title of this post with a name of scientific discipline of choice. Nov 1 issue of NYT features a piece “The Repurposed Ph.D. Finding Life After Academia — and Not Feeling Bad About It”. The gloomy state of affairs described in the article mostly applies to humanities and social sciences, at least in the U.S., but I’m sure it applies to other countries as well. I’m sure it does to Poland too. More and more people are entering the job market with a PhD (at least in Poland as evidence shows). At the same time, available positions are scarce and the pays are low. It is somewhat heart-warming to know that people are self-organizing into groups like “Versatile Ph.D” to support each other in such difficult situation.

The article links to several interesting pieces including the “The Future of the Humanities Ph.D. at Stanford” discussing the ways of modifying humanities PhD programs so that humanities training will remain relevant in the society and economy of today. Definitely a worthy read for higher education administrators and decision makers in Poland.