Statistical Analysis of Corpus Data with R is an online course by Marco Baroni and Stefan Evert. It is based on a number of previous courses on similar topics taught together by the authors, in particular the course on R Programming for (Computational) Linguists given at the DGfS Fall School in Computational Linguistics (Potsdam, 2007).

 News:  The SIGIL course is currently being restructured – a new Web page will be launched soon. You can download updated versions of some of the slides below, but they need new versions of the support packages that are not available on CRAN yet. Please follow the instructions below.

New slide sets (work in progress)

 Notice:  R code examples in the slide sets below make use of functions and data sets in two support packages as well as data sets available as separate files, depending on whether the slides have been updated yet. Please install the following software and data:

SIGIL course units

Old version of the SIGIL course

Data sets

Download ZIP archive with all data sets (2.9 MB).

* These files contain Unicode strings with accented characters. If you are running R on a Windows computer, specify the option encoding="UTF-8" when loading the files with read.delim() in order to handle such strings correctly.