ICM, jak co roku, organizuje praktyki dla studentów. W tym roku poszukuję osoby, która byłaby zainteresowana pracą nad stworzeniem aplikacji umożliwiającej interaktywną wizualizację danych sieciowych.
Oferujemy pracę w młodej i dynamicznej grupie badaczy sieci oraz nawiązanie kontaktów z zagranicznym zespołem naukowym.
Wymagania (pierwsze jest warunkiem koniecznym, pozostałe będą dodatkowymi atutami):
- Programowanie w R
- Tworzenie aplikacji Shiny
- Znajomość biblioteki D3js
- Znajomość metod Social Network Analysis (SNA)
Jeżeli jesteś zainteresowany, wypełnij formularz na stronie ICM! Mój temat ma numer 22.
Parallel coordinates plot is one of the tools for visualizing multivariate data. Every observation in a dataset is represented with a polyline that crosses a set of parallel axes corresponding to variables in the dataset. You can create such plots in R using a function
parcoord in package MASS. For example, we can create such plot for the built-in dataset mtcars:
library(MASS) library(colorRamps) data(mtcars) k <- blue2red(100) x <- cut( mtcars$mpg, 100) op <- par(mar=c(3, rep(.1, 3))) parcoord(mtcars, col=k[as.numeric(x)]) par(op)
This produces the plot below. The lines are colored using a blue-to-red color ramp according to the miles-per-gallon variable.
What to do if some of the variables are categorical? One approach is to use polylines with different width. Another approach is to add some random noise (jitter) to the values. Titanic data is a crossclassification of Titanic passengers according to class, gender, age, and survival status (survived or not). Consequently, all variables are categorical. Let’s try the jittering approach. After converting the crossclassification (R
table) to data frame we “blow it up” by repeating observations according to their frequency in the table.
data(Titanic) # convert to data frame of numeric variables titdf <- as.data.frame(lapply(as.data.frame(Titanic), as.numeric)) # repeat obs. according to their frequency titdf2 <- titdf[ rep(1:nrow(titdf), titdf$Freq) , ] # new columns with jittered values titdf2[,6:9] <- lapply(titdf2[,1:4], jitter) # colors according to survival status, with some transparency k <- adjustcolor(brewer.pal(3, "Set1")[titdf2$Survived], alpha=.2) op <- par(mar=c(3, 1, 1, 1)) parcoord(titdf2[,6:9], col=k) par(op)
This produces the following (red lines are for passengers who did not survive):
It is not so easy to read, is it. Did the majority of 1st class passengers (bottom category on leftmost axis) survived or not? Definitely most of women from that class did, but in aggregate?
At this point it would be nice to, instead of drawing a bunch of lines, to draw segments for different groups of passengers. Later I learned that such plot exists and even has a name: alluvial diagram. They seem to be related to Sankey diagrams blogged about on R-bloggers recently, e.g. here. What is more, I was not alone in thinking how to create such a thing with R, see for example here. Later I found that what I need is a “parallel set” plot, as it was called, and implemented, on CrossValidated here. Thats look terrific to me, nevertheless, I still would prefer to:
- The axes to be vertical. If the variables correspond to measurements on different points in time, then we should have nice flows from left to right.
- If only the segments could be smooth curves, e.g. splines or Bezier curves…
See the following examples of using
alluvial on Titanic data:
First, just using two variables Class and Survival, and with stripes being simple polygons.
This was produced with the code below.
# load packages and prepare data library(alluvial) tit <- as.data.frame(Titanic) # only two variables: class and survival status tit2d <- aggregate( Freq ~ Class + Survived, data=tit, sum) alluvial( tit2d[,1:2], freq=tit2d$Freq, xw=0.0, alpha=0.8, gap.width=0.1, col= "steelblue", border="white", layer = tit2d$Survived != "Yes" )
The function accepts data as (collection of) vectors or data frames. The
xw argument specifies the position of the knots of xspline relative to the axes. If positive, the knot is further away from the axis, which will make the stripes go horizontal longer before turning towards the other axis. Argument
gap.width specifies distances between categories on the axes.
Another example is showing the whole Titanic data. Red stripes for those who did not survive.
Now its possible to see that, e.g.:
- A bit more than 50% of 1st class passangers survived
- Women who did not survive come almost exclusively from 3rd class
The plot was produced with:
alluvial( tit[,1:4], freq=tit$Freq, border=NA, hide = tit$Freq < quantile(tit$Freq, .50), col=ifelse( tit$Survived == "No", "red", "gray") )
In this variant the stripes have no borders, color transparency is at 0.5, and for the purpose of the example the plot shows only “thickest” 50% of the stripes (argument
As compared to the parallel set solution mentioned earlier, the main differences are:
- Axes are vertical instead of horizontal
- I used
xsplineto draw the “stripes”
- with argument
hideyou can skip plotting of selected groups of cases
If you have suggestions or ideas for extensions/modifications, let me know on Github!
Stay tuned for more examples from panel data.
These are slides from the very first SER meeting – an R user group in Warsaw – that took place on February 27, 2014. I talked about various “lifehacking” tricks for R and focused how to use R with GNU make effectively. I will post some detailed examples in forthcoming posts.
And so I wrote a post on the Future of ___ PhD yesterday. Today I just learned about this shocking story about a political science PhD looking to be employed as an assistant professor at the University of Wrocław and facing shady realities of (parts of) of Polish higher education… Share and beware.
Fill-in the blank in the title of this post with a name of scientific discipline of choice. Nov 1 issue of NYT features a piece “The Repurposed Ph.D. Finding Life After Academia — and Not Feeling Bad About It”. The gloomy state of affairs described in the article mostly applies to humanities and social sciences, at least in the U.S., but I’m sure it applies to other countries as well. I’m sure it does to Poland too. More and more people are entering the job market with a PhD (at least in Poland as evidence shows). At the same time, available positions are scarce and the pays are low. It is somewhat heart-warming to know that people are self-organizing into groups like “Versatile Ph.D” to support each other in such difficult situation.
The article links to several interesting pieces including the “The Future of the Humanities Ph.D. at Stanford” discussing the ways of modifying humanities PhD programs so that humanities training will remain relevant in the society and economy of today. Definitely a worthy read for higher education administrators and decision makers in Poland.
Google Reader was one of my main way of reading Internet. It was great to read news and updates from many websites. For example, I had my own “R bloggers” folder within Google Reader long before Tal Galili created R-bloggers.com. Unfortunately, Google is killing the Reader on July 1. There are several alternatives to the Reader, just search for “google reader alternative”. Meanwhile, I switched to Feedly. It’s pretty cool, although there is a couple of things that annoy me a lot, e.g.: too many content (feed/item) recommendations and keyboard shortcuts are different than in Google Reader. The mobile app (I use Android) is also great although a bit heavy for my Samsung Ace. Nice features include being able to (1) push feed items to Instapaper or Evernote, (2) save selected items for later reading.
And so, I just browsed my Feedly “Saved for later” folder and here are a couple of interesting items from last 30 days:
- Nice R illustrations of multicollinearity.
- Almost like self-analytics in the spirit of Stephen Wolfram, here is a great analysis of infant feeding schedule.
- If you read recently Signal and Noise by Nate Silver, have a look at TED talk by Didier Sornette about predicitng financial crises.
- Computerworld published a nice short introduction to R. Although probably it is not ideal if you never programmed a computer before.
- If you own a computer with Intel processor and are willing to buy Intel MKL, Flavio Barros shows how to integrate it with R. MKL can speed-up many elementary operations, like matrix multiplication, by a factor of 3!
Recent issue of Science brings a very cool paper by Luís M. A. Bettencourt explaining the scaling properties of cities: how things like GDP, crime, traffic congestion etc. depend on city size. Descriptively the relationships seem to follow a simple power-law relation (see this presentation by Geoffrey West). However, as the paper shows, explaining it is not that simple and involves considering many types of interactions and interdependencies.
To finish on a somewhat less geeky note, Warsaw National Museum has a temporary exhibition of Mark Rothko featuring his works from National Gallery of Art in Washington DC, which is a first Polish exhibition of Rothko’s works ever. Accompanying the exhibition, there is a lovely childrend’s guide by Zosia Dzierżawska.
Yesterday I submitted a new version (marked 2.0-0) of package ‘intergraph’ to CRAN. There are some major changes and bug fixes. Here is a summary:
- The package supports “igraph” objects created with ‘igraph’ version 0.6-0 and newer (vertex indexing starting from 1, not 0) only!
- Main functions for converting network data between object classes “igraph” and “network” are now called
- There is a generic function
asDFthat converts network object to a list of two data frames containing (1) edge list with edge attributes and (2) vertex database with vertex attributes
asIgraphallow for creating network objects from data frames (edgelists with edge attributes and vertex databases with vertex attributes).
Usage experiences and bug reports are more than welcome.
R has a built-in collection of 657 colors that you can use in plotting functions by using color names. There are also various facilities to select color sequences more systematically:
- Color palettes and ramps available in packages RColorBrewer and colorRamps.
- R base functions
colorRampPalettethat you can use to create your own color sequences by interpolating a set of colors that you provide.
- R base functions
hclthat you can use to generate (almost) any color you want.
When producing data visualizations, the choice of proper colors is often a compromise between the requirements dictated by the data visualisation itself and the overall style and color of the article/book/report that the visualization is going to be an element of. Choosing an optimal color palette is not so easy and its handy to have some reference. Inspired by a this sheet by Przemek Biecek I created a variant of an R color reference sheet showing different ways in which you can use and call colors in R when creating visualizations. The sheet fits A4 paper (two pages). On the first page it shows a matrix of all the 657 colors with their names. On the second page, on the left, all palettes from RColorBrewer package are displayed. On the right, selected color ramps available in base R (base package grDevices) and in the contributed package colorRamps. Miniatures below:
You can download the sheet as PDF from here.
Below is a gist with the code creating the sheet as a PDF “rcolorsheet.pdf”. Instead of directly reusing the Przemek’s code I have rewritten the parts that produce the first page (built-in color names) and the part with the ramps using the
image function. I think it is much simpler, less low-level for-looping and a bit more extensible. For example, it is easy to extend the collection of color ramps by providing just additional function name in the form
packagename::functionname to the
funnames vector (any extra package would have to be loaded at the top of the script).
Some assorted links collected this week:
- A new interestingly looking book “Web Social Science” by Robert Ackland coming out in July 2013.
- In recent issue of Nature (Match 28): a special on the future of scientific publishing.
- An interesting TEDtalk by Colin Camerer on neuroscience and experimental economics
- Nice paper analyzing world email traffic, co-authored by Michael Macy. Another example of using ‘igraph’ package for network analysis.
- Gary King and Stuart Shieber on Open Access science and publishing.
There are discussions in various places about merits, pitfalls, and misunderstandings related to buzzwords “bigdata”, “data science” (what a useless term it is…) etc., analysis being “data-driven” or “evidence-based” etc. Perhaps I will make a separate post on that at some point… For now:
- “Let the Data Speak for themselves”, a guest post by Joseph Rickert on Revolutions blog
- Echoes and comments of Nate Silver’s acclaimed book “The Signal and the Noise”, for example:
- David Brooks at NYT
- Petr Keil on data-driven science