Skip to content

Praktyki w ICM – oferta

2014 May 13

ICM, jak co roku, organizuje praktyki dla studentów. W tym roku poszukuję osoby, która byłaby zainteresowana pracą nad stworzeniem aplikacji umożliwiającej interaktywną wizualizację danych sieciowych.

Oferujemy pracę w młodej i dynamicznej grupie badaczy sieci oraz nawiązanie kontaktów z zagranicznym zespołem naukowym.

Wymagania (pierwsze jest warunkiem koniecznym, pozostałe będą dodatkowymi atutami):

  • Programowanie w R
  • Programowanie w JavaScript
  • Tworzenie aplikacji Shiny
  • Znajomość biblioteki D3js
  • Znajomość metod Social Network Analysis (SNA)

Jeżeli jesteś zainteresowany, wypełnij formularz na stronie ICM! Mój temat ma numer 22.

Alluvial diagrams

2014 March 27
by Michał

Parallel coordinates plot is one of the tools for visualizing multivariate data. Every observation in a dataset is represented with a polyline that crosses a set of parallel axes corresponding to variables in the dataset. You can create such plots in R using a function parcoord in package MASS. For example, we can create such plot for the built-in dataset mtcars:

library(MASS)
library(colorRamps)
 
data(mtcars)
k <- blue2red(100)
x <- cut( mtcars$mpg, 100)
 
op <- par(mar=c(3, rep(.1, 3)))
parcoord(mtcars, col=k[as.numeric(x)])
par(op)

This produces the plot below. The lines are colored using a blue-to-red color ramp according to the miles-per-gallon variable.

cars

What to do if some of the variables are categorical? One approach is to use polylines with different width. Another approach is to add some random noise (jitter) to the values. Titanic data is a crossclassification of Titanic passengers according to class, gender, age, and survival status (survived or not). Consequently, all variables are categorical. Let’s try the jittering approach. After converting the crossclassification (R table) to data frame we “blow it up” by repeating observations according to their frequency in the table.

data(Titanic)
# convert to data frame of numeric variables
titdf <- as.data.frame(lapply(as.data.frame(Titanic), as.numeric))
# repeat obs. according to their frequency
titdf2 <- titdf[ rep(1:nrow(titdf), titdf$Freq) , ]
# new columns with jittered values
titdf2[,6:9] <- lapply(titdf2[,1:4], jitter)
# colors according to survival status, with some transparency
k <- adjustcolor(brewer.pal(3, "Set1")[titdf2$Survived], alpha=.2)
op <- par(mar=c(3, 1, 1, 1))
parcoord(titdf2[,6:9], col=k)
par(op)

This produces the following (red lines are for passengers who did not survive):

titanic_pc

It is not so easy to read, is it. Did the majority of 1st class passengers (bottom category on leftmost axis) survived or not? Definitely most of women from that class did, but in aggregate?

At this point it would be nice to, instead of drawing a bunch of lines, to draw segments for different groups of passengers. Later I learned that such plot exists and even has a name: alluvial diagram. They seem to be related to Sankey diagrams blogged about on R-bloggers recently, e.g. here. What is more, I was not alone in thinking how to create such a thing with R, see for example here. Later I found that what I need is a “parallel set” plot, as it was called, and implemented, on CrossValidated here. Thats look terrific to me, nevertheless, I still would prefer to:

  • The axes to be vertical. If the variables correspond to measurements on different points in time, then we should have nice flows from left to right.
  • If only the segments could be smooth curves, e.g. splines or Bezier curves…

And so I wrote a prototype function alluvial (tadaaa!), now in a package alluvial on Github. I strongy relied on code by Aaron from his answer on CrossValidated (hat tip).

See the following examples of using alluvial on Titanic data:

First, just using two variables Class and Survival, and with stripes being simple polygons.

titanic1

This was produced with the code below.

# load packages and prepare data
library(alluvial)
tit <- as.data.frame(Titanic)
 
# only two variables: class and survival status
tit2d <- aggregate( Freq ~ Class + Survived, data=tit, sum)
 
alluvial( tit2d[,1:2], freq=tit2d$Freq, xw=0.0, alpha=0.8,
         gap.width=0.1, col= "steelblue", border="white",
         layer = tit2d$Survived != "Yes" )

The function accepts data as (collection of) vectors or data frames. The xw argument specifies the position of the knots of xspline relative to the axes. If positive, the knot is further away from the axis, which will make the stripes go horizontal longer before turning towards the other axis. Argument gap.width specifies distances between categories on the axes.

Another example is showing the whole Titanic data. Red stripes for those who did not survive.

titanic4

Now its possible to see that, e.g.:

  • A bit more than 50% of 1st class passangers survived
  • Women who did not survive come almost exclusively from 3rd class
  • etc.

The plot was produced with:

alluvial( tit[,1:4], freq=tit$Freq, border=NA,
         hide = tit$Freq < quantile(tit$Freq, .50),
         col=ifelse( tit$Survived == "No", "red", "gray") )

In this variant the stripes have no borders, color transparency is at 0.5, and for the purpose of the example the plot shows only “thickest” 50% of the stripes (argument hide).

As compared to the parallel set solution mentioned earlier, the main differences are:

  • Axes are vertical instead of horizontal
  • I used xspline to draw the “stripes”
  • with argument hide you can skip plotting of selected groups of cases

If you have suggestions or ideas for extensions/modifications, let me know on Github!

Stay tuned for more examples from panel data.

Gadka na SERze

2014 March 10
by Michał

These are slides from the very first SER meeting – an R user group in Warsaw – that took place on February 27, 2014. I talked about various “lifehacking” tricks for R and focused how to use R with GNU make effectively. I will post some detailed examples in forthcoming posts.

Oto moje slajdy z pierwszego Spotkania Entuzjastów R (SER) , które odbyło się  27 lutego w ICM. Niebawem w oddzielnym poście opiszę szerzej prezentowane rozwiązanie GNU make + R.

Slides from Sunbelt 2014 talk on collaboration in science

2014 March 9

Here are the slides from my Sunbelt 2014 talk on collaboration in science. I talked about:

  • Some general considerations regarding collaboration or the lack of it. I have an impression that we are quite good at formulating arguments able to explain why people would like to collaborate. It’s much less understood why we do not observe as much collaboration as those arguments might suggest.
  • Some general considerations about potential data sources and their utility for studying collaboration and other types of social processes among scientists. In particular, I believe this can be usefully framed as a network boundary problem (Lauman & Marsden, 1989).
  • Finally, I showed some preliminary results from studying co-authorship network of employees of the University of Warsaw. Among other things, we see quite some differences between departments in terms of propensity to co-author (also depending on the type of co-authored work) and network transitivity.

Comments welcome.

As if life after PhD was not hard enough…

2013 November 5
tags:
by Michał

And so I wrote a post on the Future of ___ PhD yesterday. Today I just learned about this shocking story about a political science PhD looking to be employed as an assistant professor at the University of Wrocław and facing shady realities of (parts of) of Polish higher education… Share and beware.

Links:

The Future of ___ PhD

2013 November 4
tags:
by Michał

Fill-in the blank in the title of this post with a name of scientific discipline of choice. Nov 1 issue of NYT features a piece “The Repurposed Ph.D. Finding Life After Academia — and Not Feeling Bad About It”. The gloomy state of affairs described in the article mostly applies to humanities and social sciences, at least in the U.S., but I’m sure it applies to other countries as well. I’m sure it does to Poland too. More and more people are entering the job market with a PhD (at least in Poland as evidence shows). At the same time, available positions are scarce and the pays are low. It is somewhat heart-warming to know that people are self-organizing into groups like “Versatile Ph.D” to support each other in such difficult situation.

The article links to several interesting pieces including the “The Future of the Humanities Ph.D. at Stanford” discussing the ways of modifying humanities PhD programs so that humanities training will remain relevant in the society and economy of today. Definitely a worthy read for higher education administrators and decision makers in Poland.

Reading internet

2013 June 21
by Michał

Google Reader was one of my main way of reading Internet. It was great to read news and updates from many websites. For example, I had my own “R bloggers” folder within Google Reader long before Tal Galili created R-bloggers.com. Unfortunately, Google is killing the Reader on July 1. There are  several alternatives to the Reader, just search for “google reader alternative”. Meanwhile, I switched to Feedly. It’s pretty cool, although there is a couple of things that annoy me a lot, e.g.: too many content (feed/item) recommendations and keyboard shortcuts are different than in Google Reader. The mobile app (I use Android) is also great although a bit heavy for my Samsung Ace. Nice features include being able to (1) push feed items to Instapaper or Evernote, (2) save selected items for later reading.

And so, I just browsed my Feedly “Saved for later” folder and here are a couple of interesting items from last 30 days:

Recent issue of Science brings a very cool paper by Luís M. A. Bettencourt explaining the scaling properties of cities: how things like GDP, crime, traffic congestion etc. depend on city size. Descriptively the relationships seem to follow a simple power-law relation (see this presentation by Geoffrey West). However, as the paper shows, explaining it is not that simple and involves considering many types of interactions and interdependencies.

To finish on a somewhat less geeky note, Warsaw National Museum has a temporary exhibition of Mark Rothko featuring his works from National Gallery of Art in Washington DC, which is a first Polish exhibition of Rothko’s works ever. Accompanying the exhibition, there is a lovely childrend’s guide by Zosia Dzierżawska.

Package intergraph goes 2.0

2013 May 10
by Michał

Yesterday I submitted a new version (marked 2.0-0) of package ‘intergraph’ to CRAN. There are some major changes and bug fixes. Here is a summary:

  • The package supports “igraph” objects created with ‘igraph’ version 0.6-0 and newer (vertex indexing starting from 1, not 0) only!
  • Main functions for converting network data between object classes “igraph” and “network” are now called asIgraph and asNetwork.
  • There is a generic function asDF that converts network object to a list of two data frames containing (1) edge list with edge attributes and (2) vertex database with vertex attributes
  • Functions asNetwork and asIgraph allow for creating network objects from data frames (edgelists with edge attributes and vertex databases with vertex attributes).

I have written a short tutorial on using the package. It is available on package home page on R-Forge. Here is the direct link.

Usage experiences and bug reports are more than welcome.

R Color Reference Sheet

2013 April 17
by Michał

R has a built-in collection of 657 colors that you can use in plotting functions by using color names. There are also various facilities to select color sequences more systematically:

  • Color palettes and ramps available in packages RColorBrewer and colorRamps.
  • R base functions colorRamp and colorRampPalette that you can use to create your own color sequences by interpolating a set of colors that you provide.
  • R base functions rgb, hsv, and hcl that you can use to generate (almost) any color you want.

When producing data visualizations, the choice of proper colors is often a compromise between the requirements dictated by the data visualisation itself and the overall style and color of the article/book/report that the visualization is going to be an element of. Choosing an optimal color palette is not so easy and its handy to have some reference. Inspired by a this sheet by Przemek Biecek I created a variant of an R color reference sheet showing different ways in which you can use and call colors in R when creating visualizations. The sheet fits A4 paper (two pages). On the first page it shows a matrix of all the 657 colors with their names. On the second page, on the left, all palettes from RColorBrewer package are displayed. On the right, selected color ramps available in base R (base package grDevices) and in the contributed package colorRamps. Miniatures below:

rcolorsheet-0 rcolorsheet-1

You can download the sheet as PDF from here.

Below is a gist with the code creating the sheet as a PDF “rcolorsheet.pdf”. Instead of directly reusing the Przemek’s code I have rewritten the parts that produce the first page (built-in color names) and the part with the ramps using the image function. I think it is much simpler, less low-level for-looping and a bit more extensible. For example, it is easy to extend the collection of color ramps by providing just additional function name in the form packagename::functionname to the funnames vector (any extra package would have to be loaded at the top of the script).

Assorted links

2013 April 4
tags:
by Michał

Some assorted links collected this week:

There are discussions in various places about merits, pitfalls, and misunderstandings related to buzzwords “bigdata”, “data science” (what a useless term it is…) etc., analysis being “data-driven” or “evidence-based” etc. Perhaps I will make a separate post on that at some point… For now: