Statistical Analysis of Corpus Data with R is an online course by Marco Baroni and Stefan Evert. It is based on a number of previous courses on similar topics taught together by the authors, in particular the course on R Programming for (Computational) Linguists given at the DGfS Fall School in Computational Linguistics (Potsdam, 2007).

 News:  The SIGIL course is currently being restructured – a new Web page will be launched when a stable state has been reached. You can download updated versions of most of the course units below, but there may be some inconsistencies while the support packages and data sets are reorganised.

New slide sets (work in progress)

R code examples in the slide sets below make use of functions and data sets included in a supporting R package or available as separate files, depending on whether the slides have been updated yet. Please install the following software and data:

It is recommended that you put all data and code files in an RStudio project directory (or your current working directory). All code examples in the slides and exercises will make this assumption.

SIGIL course units

Old version of the SIGIL course

Data sets

Download ZIP archive with all data sets (2.9 MB).

* These files contain Unicode strings with accented characters. If you are running R on a Windows computer, specify the option encoding="UTF-8" when loading the files with read.delim() in order to handle such strings correctly.

Exercises