Skip to content

Alluvial diagrams

2014 March 27
by Michał

Parallel coordinates plot is one of the tools for visualizing multivariate data. Every observation in a dataset is represented with a polyline that crosses a set of parallel axes corresponding to variables in the dataset. You can create such plots in R using a function parcoord in package MASS. For example, we can create such plot for the built-in dataset mtcars:

library(MASS)
library(colorRamps)
 
data(mtcars)
k <- blue2red(100)
x <- cut( mtcars$mpg, 100)
 
op <- par(mar=c(3, rep(.1, 3)))
parcoord(mtcars, col=k[as.numeric(x)])
par(op)

This produces the plot below. The lines are colored using a blue-to-red color ramp according to the miles-per-gallon variable.

cars

What to do if some of the variables are categorical? One approach is to use polylines with different width. Another approach is to add some random noise (jitter) to the values. Titanic data is a crossclassification of Titanic passengers according to class, gender, age, and survival status (survived or not). Consequently, all variables are categorical. Let’s try the jittering approach. After converting the crossclassification (R table) to data frame we “blow it up” by repeating observations according to their frequency in the table.

data(Titanic)
# convert to data frame of numeric variables
titdf <- as.data.frame(lapply(as.data.frame(Titanic), as.numeric))
# repeat obs. according to their frequency
titdf2 <- titdf[ rep(1:nrow(titdf), titdf$Freq) , ]
# new columns with jittered values
titdf2[,6:9] <- lapply(titdf2[,1:4], jitter)
# colors according to survival status, with some transparency
k <- adjustcolor(brewer.pal(3, "Set1")[titdf2$Survived], alpha=.2)
op <- par(mar=c(3, 1, 1, 1))
parcoord(titdf2[,6:9], col=k)
par(op)

This produces the following (red lines are for passengers who did not survive):

titanic_pc

It is not so easy to read, is it. Did the majority of 1st class passengers (bottom category on leftmost axis) survived or not? Definitely most of women from that class did, but in aggregate?

At this point it would be nice to, instead of drawing a bunch of lines, to draw segments for different groups of passengers. Later I learned that such plot exists and even has a name: alluvial diagram. They seem to be related to Sankey diagrams blogged about on R-bloggers recently, e.g. here. What is more, I was not alone in thinking how to create such a thing with R, see for example here. Later I found that what I need is a “parallel set” plot, as it was called, and implemented, on CrossValidated here. Thats look terrific to me, nevertheless, I still would prefer to:

  • The axes to be vertical. If the variables correspond to measurements on different points in time, then we should have nice flows from left to right.
  • If only the segments could be smooth curves, e.g. splines or Bezier curves…

And so I wrote a prototype function alluvial (tadaaa!), now in a package alluvial on Github. I strongy relied on code by Aaron from his answer on CrossValidated (hat tip).

See the following examples of using alluvial on Titanic data:

First, just using two variables Class and Survival, and with stripes being simple polygons.

titanic1

This was produced with the code below.

# load packages and prepare data
library(alluvial)
tit <- as.data.frame(Titanic)
 
# only two variables: class and survival status
tit2d <- aggregate( Freq ~ Class + Survived, data=tit, sum)
 
alluvial( tit2d[,1:2], freq=tit2d$Freq, xw=0.0, alpha=0.8,
         gap.width=0.1, col= "steelblue", border="white",
         layer = tit2d$Survived != "Yes" )

The function accepts data as (collection of) vectors or data frames. The xw argument specifies the position of the knots of xspline relative to the axes. If positive, the knot is further away from the axis, which will make the stripes go horizontal longer before turning towards the other axis. Argument gap.width specifies distances between categories on the axes.

Another example is showing the whole Titanic data. Red stripes for those who did not survive.

titanic4

Now its possible to see that, e.g.:

  • A bit more than 50% of 1st class passangers survived
  • Women who did not survive come almost exclusively from 3rd class
  • etc.

The plot was produced with:

alluvial( tit[,1:4], freq=tit$Freq, border=NA,
         hide = tit$Freq < quantile(tit$Freq, .50),
         col=ifelse( tit$Survived == "No", "red", "gray") )

In this variant the stripes have no borders, color transparency is at 0.5, and for the purpose of the example the plot shows only “thickest” 50% of the stripes (argument hide).

As compared to the parallel set solution mentioned earlier, the main differences are:

  • Axes are vertical instead of horizontal
  • I used xspline to draw the “stripes”
  • with argument hide you can skip plotting of selected groups of cases

If you have suggestions or ideas for extensions/modifications, let me know on Github!

Stay tuned for more examples from panel data.

20 Responses leave one →
  1. Bozeman Bill permalink
    June 9, 2014

    LOVE your alluvial package! I am very interested in making a graph like this: http://www.nature.com/srep/2012/120801/srep00551/fig_tab/srep00551_F7.html. This graph has a series of categories for each year and these categories may change in abundance across year with some frequency. I can seem to arrange my data by year and abundance in such a way that your package will work. Any suggestions?

    Here is what my data looks like.
    Class, 1991, 2003, 2005, 2009, 2011, 2013
    1, 818, 604, 601, 570, 563, 556
    2, 183, 147, 145, 143, 142, 150
    3, 40, 55, 55, 55, 55, 50
    4, 48, 70, 81, 76, 85, 99
    5, 126, 140, 142, 148, 155, 153
    6, 396, 566, 568, 566, 525, 508
    7, 158, 189, 153, 98, 118, 123
    8, 244, 206, 238, 296, 269, 247
    9, 83, 91, 85, 90, 86, 92
    10, 76, 88, 88, 89, 89, 91
    11, 28, 30, 30, 30, 30, 30
    12, 0, 14, 14, 39, 83, 101

    Apologies if this is not the write platform for such a question but with your package in development and all I wasn’t sure where else to go with my questions. THANKS!

    • August 7, 2014

      Thanks for compliments! I’m sorry for a slow reply, I did not get any notification about your comment.

      I think your data does not really fit an alluvial diagram because it is cross-sectional. We do not know all the flows between the classes. For example, we do not know in what class 881 – 604 = 214 people who left class 1 between 1991 and 2003 end up in…

      • Bozeman Bill permalink
        August 14, 2014

        Michal,

        Thanks for the replies. There may ave been some confusion in my earlier post. Because I was looking at vegetation at fixed points across time (n=3700 points), I know how each class was transitioned. I ended up using your tool to create transitions between each pair of consecutive sample years then I merged each image in Adobe Illustrator. With a little bit of Illustrator magic I created a nice graph. Again thanks!! I do have a question about how you would like it referenced. I currently have it like this:

        “These inter-annual transition tables were developed into an alluvial diagram to visually examine the type and intensity of disturbance and recovery pathways of habitat shift between the series of sampling events. The alluvial tool is in development in R (https://github.com/mbojan/alluvial).”

        Do you have something better for a reference?

        • November 24, 2014

          Hi Bill

          The package has just been updated with a function I’ve written to handle this kind of data. Is this what you were after?

          http://imgur.com/AmJy8b7
          https://gist.github.com/geotheory/f284caa4584fa50764cf

          Robin

          • Bozeman Bill permalink
            December 19, 2014

            That is great! (just got this update via email last week). I will use your new bit of code in the future.

            I ended up gluing the individual transitions between years in Illustrator to make my publication figure. The paper is currently in review, but I want to make sure I reference you properly before it goes final. This is all I have currently: “The alluvial graphical tool is currently in development in R (https://github.com/mbojan/alluvial).” Would you like something better?

            Keep up the great work!

            Bill

  2. Bozeman Bill permalink
    June 11, 2014

    Sorry for the typos in my last post, it was a late night. I played with the apps in http://www.mapequation.org/ they add on one year at a time and build the alluvial charts by adding together each section, but not with the frequency type data in my example in my last post or in the Titanic data set I ended up doing the same thing with your code and my annual frequency data set by creating multiple graphs and attaching them in Photoshop. You have a great tool here and I am sure the folks at iGraph (http://igraph.org/redirect.html) would include it in their package. It would be nice if we could make a multi-year graphs tho. Maybe you were trying to get there in the POLPAN example, but I could not get that to work. The POLPAN data was not included and when I found it on the GESIS site, I couldn’t make it work then either. THANKS for you efforts!!

  3. June 13, 2014

    Hi, Michał!

    That’s an awesome work from you (as usual)!

    The third picture reminded me a similar task, I need to solve.

    I need to visualize the results of my project on regrouping students. In particular, I need to show the connections between “old” and “new” groups to demonstrate how many people from each of the “old” groups moved to each of the new groups. The only difference from your graph is that I want to show the numbers of people in groups (squares) and numbers on the “ties”.

    How easy is it to modify your graph for that purpose? Or I should better use some SNA visualization package/software like Gephi or igraph?

  4. A.C. Bronner permalink
    July 7, 2014

    Hi,
    I wanted to test your package alluvial in order to make some graphs, but as I have no really competences with R, I got problems to install him.
    I can make the install “Installer depuis le fichier .zip”, it appears in the win-library, but when I call library(alluvial) or library(), it says “‘alluvial’ n’est pas un nom correct de package installé” (‘alluvial’ is not the correct name). It seems to be present otherwise it would say “aucun package trouvé” (no package found).
    Could the problem come from the fact I use a french version of R ?
    I try to find an answer, but I do not find something understandable.
    I try to change ‘alluvial-master’ in ‘alluvial’, but even with this change, it doesn’t work.
    Best regards
    A.C. Bronner

    • August 7, 2014

      Install package ‘devtools’ and then

      library(devtools)
      install_github(“mbojan/alluvial”)

  5. Uriel permalink
    November 30, 2015

    Hi, great example
    i tried to download the package and run it but is not working with R 3.2.2 any chance for an update?
    thanks in advance!
    Uriel

    • November 30, 2015

      Hi,

      I will need more details from you as it works with R 3.2.2 for me.

      Can you please file an issue on Github here https://github.com/mbojan/alluvial/issues, but give more details how are you installing the package and what errors are you getting?

      Thanks

      • Uriel permalink
        November 30, 2015

        Hi Michal, Thanks for the fast reply.
        I added an issue on github
        BR

        Uriel

  6. Daniel permalink
    December 4, 2015

    Just stoped by to say Great work! Many thanks.

  7. June 22, 2016

    Awesome Job, thank you! Just what I was looking for ….

    • June 23, 2016

      Thanks. If your graph will be published somewhere (paper/web etc.) can you send me a link? I’m curious about in what types problems and data people use `alluvial()` for.

  8. pattern project permalink
    October 2, 2016

    Hi,

    Great work. Any chance you would integrate it with ggplot as an extension

    Br / Burhan (@patternproject)

Trackbacks and Pingbacks

  1. R вкусности | Pokrovka11's Blog

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS