## Project opportunity in Natural Language Processing of historical texts in Portuguese

*Update 2017-07-13***We received a lot of great responses. Thank you very much for the interest in contributing to our project! This also turned out to be a great way to get in touch with people who do very interesting and relevant research. We need to review the responses now and get in touch with people individually. Thank you very much indeed.**

We are looking for a person with good skills and experience in Natural Language Processing for our project on the boundary of history and social network analysis. The project involves analysis of historical documents and written communication (e.g. letters) in Portuguese.

Skills/experience:

- Skills and experience in text mining and natural language processing, especially solving Entity Recognition problems
- Ability to work with text data in Portuguese
- Tools: preferably R or Python
- Good communication skills in English
- Experience in Social Network Analysis will be an asset

Time frame: We would like to start working within July-September 2017, but the sooner the better.

What we offer:

- Opportunity to work on an interesting topic on the boundary of history and social network analysis
- Collaboration with an interdisciplinary team
- Monetary compensation

Please send serious inquires to m.bojanowski@uw.edu.pl . We welcome examples of your work demonstrating your skills in the relevant areas/topics.

I have just published a small package `lspline`

on CRAN that implements linear splines using convenient parametrisations such that

- coefficients are slopes of consecutive segments
- coefficients capture slope change at consecutive knots

Knot locations can be specified

- manually (with
`lspline()`

) - at breaks dividing the range of
`x`

into`q`

equal-frequency intervals (with`qlspline()`

) - at breaks dividing the range of
`x`

into`n`

equal-width intervals (with`elspline()`

)

The implementation follows Greene (2003, chapter 7.5.2).

The package sources are on GitHub here.

# Examples

We will use the following artificial data with knots at `x=5`

and `x=10`

:

```
set.seed(666)
n <- 200
d <- data.frame(
x = scales::rescale(rchisq(n, 6), c(0, 20))
)
d$interval <- findInterval(d$x, c(5, 10), rightmost.closed = TRUE) + 1
d$slope <- c(2, -3, 0)[d$interval]
d$intercept <- c(0, 25, -5)[d$interval]
d$y <- with(d, intercept + slope * x + rnorm(n, 0, 1))
```

Plotting `y`

against `x`

:

```
library(ggplot2)
fig <- ggplot(d, aes(x=x, y=y)) +
geom_point(aes(shape=as.character(slope))) +
scale_shape_discrete(name="Slope") +
theme_bw()
fig
```

The slopes of the consecutive segments are 2, -3, and 0.

## Setting knot locations manually

We can parametrize the spline with slopes of individual segments (default `marginal=FALSE`

):

```
library(lspline)
m1 <- lm(y ~ lspline(x, c(5, 10)), data=d)
knitr::kable(broom::tidy(m1))
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 0.1343204 | 0.2148116 | 0.6252941 | 0.5325054 |

lspline(x, c(5, 10))1 | 1.9435458 | 0.0597698 | 32.5171747 | 0.0000000 |

lspline(x, c(5, 10))2 | -2.9666750 | 0.0503967 | -58.8664832 | 0.0000000 |

lspline(x, c(5, 10))3 | -0.0335289 | 0.0518601 | -0.6465255 | 0.5186955 |

Or parametrize with coeficients measuring change in slope (with `marginal=TRUE`

):

```
m2 <- lm(y ~ lspline(x, c(5,10), marginal=TRUE), data=d)
knitr::kable(broom::tidy(m2))
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 0.1343204 | 0.2148116 | 0.6252941 | 0.5325054 |

lspline(x, c(5, 10), marginal = TRUE)1 | 1.9435458 | 0.0597698 | 32.5171747 | 0.0000000 |

lspline(x, c(5, 10), marginal = TRUE)2 | -4.9102208 | 0.0975908 | -50.3143597 | 0.0000000 |

lspline(x, c(5, 10), marginal = TRUE)3 | 2.9331462 | 0.0885445 | 33.1262479 | 0.0000000 |

The coefficients are

`lspline(x, c(5, 10), marginal = TRUE)1`

– the slope of the first segment`lspline(x, c(5, 10), marginal = TRUE)2`

– the change in slope at knot*x*= 5; it is changing from 2 to -3, so by -5`lspline(x, c(5, 10), marginal = TRUE)3`

– tha change in slope at knot*x*= 10; it is changing from -3 to 0, so by 3

The two parametrisations (obviously) give identical predicted values:

```
all.equal( fitted(m1), fitted(m2) )
## [1] TRUE
```

graphically

```
fig +
geom_smooth(method="lm", formula=formula(m1), se=FALSE) +
geom_vline(xintercept = c(5, 10), linetype=2)
```

## Knots at `n`

equal-length intervals

Function `elspline()`

sets the knots at points dividing the range of `x`

into `n`

equal length intervals.

```
m3 <- lm(y ~ elspline(x, 3), data=d)
knitr::kable(broom::tidy(m3))
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 3.5484817 | 0.4603827 | 7.707678 | 0.00e+00 |

elspline(x, 3)1 | 0.4652507 | 0.1010200 | 4.605529 | 7.40e-06 |

elspline(x, 3)2 | -2.4908385 | 0.1167867 | -21.328105 | 0.00e+00 |

elspline(x, 3)3 | 0.9475630 | 0.2328691 | 4.069080 | 6.84e-05 |

Graphically

```
fig +
geom_smooth(aes(group=1), method="lm", formula=formula(m3), se=FALSE, n=200)
```

## Knots at `q`

uantiles of `x`

Function `qlspline()`

sets the knots at points dividing the range of `x`

into `q`

equal-frequency intervals.

```
m4 <- lm(y ~ qlspline(x, 4), data=d)
knitr::kable(broom::tidy(m4))
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 0.0782285 | 0.3948061 | 0.198144 | 0.8431388 |

qlspline(x, 4)1 | 2.0398804 | 0.1802724 | 11.315548 | 0.0000000 |

qlspline(x, 4)2 | 1.2675186 | 0.1471270 | 8.615132 | 0.0000000 |

qlspline(x, 4)3 | -4.5846478 | 0.1476810 | -31.044273 | 0.0000000 |

qlspline(x, 4)4 | -0.4965858 | 0.0572115 | -8.679818 | 0.0000000 |

Graphically

```
fig +
geom_smooth(method="lm", formula=formula(m4), se=FALSE, n=200)
```

# Installation

Stable version from CRAN or development version from GitHub with

```
devtools::install_github("mbojan/lspline", build_vignettes=TRUE)
```

# Acknowledgements

Inspired by Stata command `mkspline`

and function `ares::lspline`

from Junger & Ponce de Leon (2011). As such, the implementation follows Greene (2003), chapter 7.5.2.

- Greene, William H. (2003)
*Econometric analysis*. Pearson Education - Junger & Ponce de Leon (2011)
. R package version 0.7.2 retrieved from CRAN archives.`ares`

: Environment air pollution epidemiology: a library for timeseries analysis

*English summary: On 2016-09-28 I presented an application of methods of estimating Exponential-family Random Graph Models from ego-centrically sampled data. Conference was in Polish. See the program and my slides for further details.*

W zeszłym tygodniu (28-30.09.2016) miała miejsce konferencja *Metodologiczne Inspiracje 2016. Badania ilościowe w naukach społecznych: wyzwania i problemy* organizowana przez Instytut Filozofii i Socjologii PAN oraz Komitet Socjologii PAN. Program konferencji można podejrzeć tutaj.

W swojej prezentacji (*Coś z niczego, czyli czego możemy się dowiedzieć o sieciach społecznych na podstawie danych sondażowych*) opowiadałem o metodach szacowania modeli ERG na podstawie danych zebranych w sposób ego-centryczny. Slajdy do pobrania tutaj (PDF).

As part of the ReCoN project we are currently conducting a series of in-depth interviews with Polish scholars about their collaborations. We decided to go for a board-and-pins technique to build respondent’s collaboration ego-networks on the fly during the interview. Here is one example network (names of alters are hidden):

Dominika wrote a short post on our experiences here.

I would like to (re)announce the workshop I will be giving on using R and ‘igraph’ for Social Network Analysis on the upcoming Sunbelt 2016 conference in Newport Beach. My goal is to provide *gentle* and *practical* tour through the SNA functionality of R package “igraph”. The exact date of the Sunbelt workshop is Tuesday April 5th, 8:00am – 2:30pm. Consult the Sunbelt workshop program or further details.

**Do note that March 7 is the deadline for Sunbelt workshop registrations!**

If you are not attending Sunbelt I will be very happy to meet you on the workshop at the European Social Networks Conference (EUSN 2016) which will take place in Paris (14-17th of June, 2016), see http://eusn2016.sciencesconf.org/84811 for further details.

See the bottom of this post for some more details on the workshop.

I hope to see you on Sunbelt or EUSN!

As a side note, on EUSN I will be also co-teaching two ‘statnet’ workshops which will be announced separately.

## Using R and igraph for Social Network Analysis

The workshop introduces R and package igraph for social network data manipulation, isualization, and analysis. Package igraph is a collection of efficient tools for storing, manipulating, visualizing, and analyzing network data. Igraph is in part an alternative, in part a complement to other SNA-related R packages (e.g. statnet, tnet). It is an alternative as it goes for network data manipulation and visualization. It is a complement because of a large and growing collection of algorithms, including community detection problems, unavailable elsewhere. The material will cover:

- Brief introduction to R.
- Creating and manipulating network data objects.
- Working with node and tie attributes.
- Creating network visualizations.
- A tour through computing selected SNA methods including: degree distribution, centrality measures, shortest paths, connected components, quantifying homophily/segregation, network community detection.
- Connections to other R packages for SNA, e.g.: statnet, RSiena, egonetR.

The focus is on analysis of complete network data and providing prerequisites for other workshops including two on ego-network analysis: “Introduction to ego-network analysis” by Raffaele Vacca and “Simplifying advanced ego-network analysis in R with egonetR” by Till Krenz and Andreas Herz.

The workshop have been successfully organized on earlier Sunbelt conferences (since Sunbelt 2011) and on European Social Networks conference (EUSN 2014). The workshop attracted a lot of attention (total of over 130 participants since 2011) and positive feedback (80% report being satisfied, 75% would recommend the workshop to a colleague). The earlier workshop title was “Introduction to Social Network Analysis with R”. The content have been updated to catch up with newest developments in igraph and related packages.

Target audience and requirements:

The workshop is designed to be accessible for people who have limited experience with R. The participants are expected to be familiar with basic R objects (e.g. matrices and data frames) and functions (e.g., reading data, computing basic statistics, basic visualization). Some brief introduction to R will be provided. To be absolutely on the safe side we recommend taking an internet course on the level of R programming course on Coursera (https://www.coursera.org/course/rprog), which you can take every month, or skimming through a book on the level of initial eight sections of Roger D. Peng book “R programming” (https://leanpub.com/rprogramming). Participants are encouraged to bring own laptops. We have prepared examples and exercises to be completed during the workshop. Detailed instructions about how to prepare will be distributed in due time.

W ostatni wtorek (2015-11-10) miałem prezentację na Meet-up’ie Data Science Warsaw na temat analizy sieci społecznych. Data Science Warsaw do regularne spotkania osób interesujących się analizą danych tak w biznesie jak i w nauce. Polecam wszystkim zainteresowanym.

Slajdy z mojego (krótkiego) wystąpienia można znaleźć tutaj.

**Summary in English**: We are organizing a two-day workshop on network analysis in R. The dates are 2-3 of December, 2015. The workshop will be in Polish. For more information and registration see this page.

Zapraszamy na szkolenie z analizy sieciowej w R w dniach 2-3 grudnia 2015.

Analiza sieci społecznych (ang. Social Network Analysis, SNA) to podejście do badania zbiorowości ludzi lub organizacji poprzez analizowanie relacji lub powiązań pomiędzy nimi. Relacje te tworzą złożone sieci. Ludzie lub organizacje mogą wchodzić w różnego rodzaju powiązania, np. pokrewieństwo, przyjaźń, współpraca, poszukiwanie informacji, władza, ale również współuczestnictwo w wydarzeniach itp. Celem SNA jest analiza struktury sieci relacji i ich znaczenia dla funkcjonowania interesującej nas zbiorowości.

Celem warsztatu jest wprowadzenie do SNA oraz zdobycie podstawowych technik analizy i wizualizacji grafów za pomocą R. Wśród poruszanych tematów znajdują się: metody zbierania danych sieciowych, reprezentacja danych sieciowych, wizualizacja oraz przegląd miar opisowych charakteryzujących pozycję w sieci, metody identyfikacji grupy i pozycji oraz własności sieci jako całości, a także analizy łączące dane o relacjach i atrybutach węzłów.

Więcej informacji oraz szczegóły dotyczące rejestracji na stronie http://www.icm.edu.pl/web/guest/wprowadzenie-do-analizy-sieciowej-w-r.

Dla rejestrujących się przed 23 listopada zniżka 20%!

Zapraszamy!

Within last few weeks the website of the RECON project have been updated. Among other things, we have uploaded a couple of presentations that were given in 2014 in 2015. Below is a short list. See the Publications page on RECONs webpage for a complete list with abstracts.

- Czerniawska D., Fenrich W., Bojanowski M. (2015),
*How does scholarly cooperation occur and how does it manifest itself? Evidence from Poland Presentation*at ESA 2015 conference. PDF slides - Czerniawska D. (2015),
*Paths to interdisciplinarity: How do scholars start working on the edges of disciplines?*Presentation at ‘What makes interdisciplinarity work? Crossing academic boundaries in real life’ Ustinov College, Durham University. HTML slides - Fenrich W., Czerniawska D., Bojanowski M. (2015)
*The story behind the graph: a mixed method study of scholarly collaboration networks in Poland*. Presentation at Sunbelt XXXV. HTML slides

In data analysis it happens sometimes that it is neccesary to use *weights*. Contexts

that come to mind include:

- Analysis of data from complex surveys, e.g. stratified samples. Sample inclusion probabilities might have been unequal and thus observations from different strata should have different weights.
- Application of propensity score weighting e.g. to correct data being Missing At Random (MAR).
- Inverse-variance weighting (https://en.wikipedia.org/wiki/Inverse-variance_weighting) when different observations have been measured with different precision which is known apriori.
- We are analyzing data in an aggregated form such that the weight variable encodes how many original observations each row in the aggregated data represents.
- We are given survey data with post-stratification weights.

If you use, or have been using, SPSS you probably know about the possibility to define one of the variables as weights. This information is used when producing cross-tabulations (cells include sums of weights), regression models and so on. SPSS weights are *frequency weights* in the sense that $w_i$ is the number of observations particular case $i$ represents.

On the other hand, in R `lm`

and `glm`

functions have `weights`

argument that serves a related purpose.

```
suppressMessages(local({
library(dplyr)
library(ggplot2)
library(survey)
library(knitr)
library(tidyr)
library(broom)
}))
```

Let’s compare different ways in which a linear model can be fitted to data with weights. We start by generating some artificial data:

```
set.seed(666)
N <- 30 # number of observations
# Aggregated data
aggregated <- data.frame(x=1:5) %>%
mutate( y = round(2 * x + 2 + rnorm(length(x)) ),
freq = as.numeric(table(sample(1:5, N,
replace=TRUE, prob=c(.3, .4, .5, .4, .3))))
)
aggregated
```

```
## x y freq
## 1 1 5 4
## 2 2 8 5
## 3 3 8 8
## 4 4 12 8
## 5 5 10 5
```

```
# Disaggregated data
individuals <- aggregated[ rep(1:5, aggregated$freq) , c("x", "y") ]
```

Visually:

```
ggplot(aggregated, aes(x=x, y=y, size=freq)) + geom_point() + theme_bw()
```

Let’s fit some models:

```
models <- list(
ind_lm = lm(y ~ x, data=individuals),
raw_agg = lm( y ~ x, data=aggregated),
ind_svy_glm = svyglm(y~x, design=svydesign(id=~1, data=individuals),
family=gaussian() ),
ind_glm = glm(y ~ x, family=gaussian(), data=individuals),
wei_lm = lm(y ~ x, data=aggregated, weight=freq),
wei_glm = glm(y ~ x, data=aggregated, family=gaussian(), weight=freq),
svy_glm = svyglm(y ~ x, design=svydesign(id=~1, weights=~freq, data=aggregated),
family=gaussian())
)
```

```
## Warning in svydesign.default(id = ~1, data = individuals): No weights or
## probabilities supplied, assuming equal probability
```

In short, we have the following linear models:

`ind_lm`

is a OLS fit to individual data (the*true*model).`ind_agg`

is a OLS fit to aggregated data (definitely wrong).`ind_glm`

is a ML fit to individual data`ind_svy_glm`

is a ML fit to individual data using simple random sampling with replacement design.`wei_lm`

is OLS fit to aggregated data with frequencies as weights`wei_glm`

is a ML fit to aggregated data with frequencies as weights`svy_glm`

is a ML fit to aggregated using “survey” package and using frequencies as weights in the sampling design.

We would expect that models `ind_lm`

, `ind_glm`

, and `ind_svy_glm`

will be identical.

Summarise and gather in long format

```
results <- do.call("rbind", lapply( names(models), function(n) cbind(model=n, tidy(models[[n]])) )) %>%
gather(stat, value, -model, -term)
```

Check if point estimates of model coefficients are identical:

```
results %>% filter(stat=="estimate") %>%
select(model, term, value) %>%
spread(term, value)
```

```
## model (Intercept) x
## 1 ind_lm 4.33218 1.474048
## 2 raw_agg 4.40000 1.400000
## 3 ind_svy_glm 4.33218 1.474048
## 4 ind_glm 4.33218 1.474048
## 5 wei_lm 4.33218 1.474048
## 6 wei_glm 4.33218 1.474048
## 7 svy_glm 4.33218 1.474048
```

Apart from the “wrong” `raw_agg`

model, the coefficients are identical across models.

Let’s check the inference:

```
# Standard Errors
results %>% filter(stat=="std.error") %>%
select(model, term, value) %>%
spread(term, value)
```

```
## model (Intercept) x
## 1 ind_lm 0.652395 0.1912751
## 2 raw_agg 1.669331 0.5033223
## 3 ind_svy_glm 0.500719 0.1912161
## 4 ind_glm 0.652395 0.1912751
## 5 wei_lm 1.993100 0.5843552
## 6 wei_glm 1.993100 0.5843552
## 7 svy_glm 1.221133 0.4926638
```

```
# p-values
results %>% filter(stat=="p.value") %>%
mutate(p=format.pval(value)) %>%
select(model, term, p) %>%
spread(term, p)
```

```
## model (Intercept) x
## 1 ind_lm 3.3265e-07 2.1458e-08
## 2 raw_agg 0.077937 0.068904
## 3 ind_svy_glm 2.1244e-09 2.1330e-08
## 4 ind_glm 3.3265e-07 2.1458e-08
## 5 wei_lm 0.118057 0.085986
## 6 wei_glm 0.118057 0.085986
## 7 svy_glm 0.038154 0.058038
```

Recall, that the correct model is `ind_lm`

. Observations:

`raw_agg`

is clearly wrong, as expected.- Should the
`weight`

argument to`lm`

and`glm`

implement frequency weights, the results for`wei_lm`

and`wei_glm`

will be identical to that from`ind_lm`

. Only the point estimates are correct, all the inference stats are not correct. - The model using design with sampling weights
`svy_glm`

gives correct point estimates, but incorrect inference. - Suprisingly, the model fit with “survey” package to the individual data using simple random sampling design (
`ind_svy_glm`

) does not give identical inference stats to those from`ind_lm`

. They are close though.

Functions weights `lm`

and `glm`

implement *precision weights*: inverse-variance weights that can be used to model differential precision with which the outcome variable was estimated.

Functions in the “survey” package implement *sampling weights*: inverse of the probability of particular observation to be selected from the population to the sample.

Frequency weights are a different animal.

However, it is possible get correct inference statistics for the model fitted to aggregated data using `lm`

with frequency weights supplied as `weights`

. What needs correcting is the degrees of freedom (see also http://stackoverflow.com/questions/10268689/weighted-regression-in-r).

```
models$wei_lm_fixed <- models$wei_lm
models$wei_lm_fixed$df.residual <- with(models$wei_lm_fixed, sum(weights) - length(coefficients))
results <- do.call("rbind", lapply( names(models), function(n) cbind(model=n, tidy(models[[n]])) )) %>%
gather(stat, value, -model, -term)
```

```
## Warning in summary.lm(x): residual degrees of freedom in object suggest
## this is not an "lm" fit
```

```
# Coefficients
results %>% filter(stat=="estimate") %>%
select(model, term, value) %>%
spread(term, value)
```

```
## model (Intercept) x
## 1 ind_lm 4.33218 1.474048
## 2 raw_agg 4.40000 1.400000
## 3 ind_svy_glm 4.33218 1.474048
## 4 ind_glm 4.33218 1.474048
## 5 wei_lm 4.33218 1.474048
## 6 wei_glm 4.33218 1.474048
## 7 svy_glm 4.33218 1.474048
## 8 wei_lm_fixed 4.33218 1.474048
```

```
# Standard Errors
results %>% filter(stat=="std.error") %>%
select(model, term, value) %>%
spread(term, value)
```

```
## model (Intercept) x
## 1 ind_lm 0.652395 0.1912751
## 2 raw_agg 1.669331 0.5033223
## 3 ind_svy_glm 0.500719 0.1912161
## 4 ind_glm 0.652395 0.1912751
## 5 wei_lm 1.993100 0.5843552
## 6 wei_glm 1.993100 0.5843552
## 7 svy_glm 1.221133 0.4926638
## 8 wei_lm_fixed 0.652395 0.1912751
```

See model `wei_lm_fixed`

. Thus, correcting the degrees of freedom manually gives correct coefficient estimates as well as inference statistics.

# Performance

Aggregating data and using frequency weights can save you quite some time. To illustrate it, let’s generate large data set in a disaggregated and aggregated form.

```
N <- 10^4 # number of observations
# Aggregated data
big_aggregated <- data.frame(x=1:5) %>%
mutate( y = round(2 * x + 2 + rnorm(length(x)) ),
freq = as.numeric(table(sample(1:5, N, replace=TRUE, prob=c(.3, .4, .5, .4, .3))))
)
# Disaggregated data
big_individuals <- aggregated[ rep(1:5, big_aggregated$freq) , c("x", "y") ]
```

… and fit `lm`

models weighting the model on aggregated data. Benchmarking:

```
library(microbenchmark)
speed <- microbenchmark(
big_individual = lm(y ~ x, data=big_individuals),
big_aggregated = lm(y ~ x, data=big_aggregated, weights=freq)
)
speed %>% group_by(expr) %>% summarise(median=median(time / 1000)) %>%
mutate( ratio = median / median[1])
```

```
## Source: local data frame [2 x 3]
##
## expr median ratio
## 1 big_individual 7561.158 1.0000000
## 2 big_aggregated 1492.057 0.1973319
```

So quite an improvement.

The improvement is probably the bigger, the more we are able to aggregate the data.

Journal Basic and Applied Social Psychology (BASP) bans the use of statistical hypothesis testing.

The BASP editorial by Trafimow and Marks here.

The story have also been covered by:

And discussed in/by, among others:

Where this will go, I wonder…