Statistical Analysis of Corpus Data with R is an online course by Marco Baroni and Stefan Evert. It is based on a number of previous courses on similar topics taught together by the authors, in particular the course on R Programming for (Computational) Linguists given at the DGfS Fall School in Computational Linguistics (Potsdam, 2007).
The SIGIL course is currently being restructured – a new Web page will be launched when a stable state has been reached. You can download updated versions of most of the course units below, but there may be some inconsistencies while the support packages and data sets are reorganised.
New slide sets (work in progress)
back to top
R code examples in the slide sets below make use of functions and data sets included in a supporting R package or available as separate files, depending on whether the slides have been updated yet. Please install the following software and data:
corpora package (version 0.5), which is available on CRAN and can be installed with any standard R package manager.
- Any additional data and code files required by the unit you're studying. These are listed together with the handouts and exercises below. You can also download a ZIP archive with most data sets (2.9 MiB).
Some slides may still refer to data sets in the
SIGIL package, which was rejected by CRAN. Please use the
corpora package instead, making sure that you have installed version 0.5 or newer.
It is recommended that you put all data and code files in an RStudio project directory (or your current working directory). All code examples in the slides and exercises will make this assumption.
SIGIL course units
Unit 1: General introduction / First steps in R
(updated on 12.07.2015)
Unit 2: Corpus frequency data & statistical inference
(updated on 20.06.2016)
Unit 3: Descriptive and inferential statistics for continuous data
Unit 4: Collocations & contingency tables
Unit 5: Word frequency distributions and Zipf's law: Using add-on packages
(updated on 23.06.2016)
Unit 6: Regression and the general linear model
Unit 7: Multivariate analysis
(new on 05.09.2018) new
slides (13 MiB) –
handout (14 MiB) –
powerpoint (72 MiB, with embedded animations)
slides (0.8 MiB) –
print version (0.8 MiB) (updated on 16.09.2018)
(updated on 27.10.2018) new
07_project.zip (5.0 MiB; ZIP archive including all code & data) (updated on 27.10.2018) new
Unit 8: The non-randomness of corpus data & generalised linear models
(updated on 26.03.2010)
Unit 9: Inter-annotator agreement
Old version of the SIGIL course
back to top
- Hypothesis tests for corpus frequency data
- Word frequency distributions with zipfR
- Clustering and dimensionality reduction
- Using statistical association measures for collocation extraction
- Part 1: contingency tables and association scores
- Part 2: large-scale processing and evaluation
- The limitations of random sampling methods
- A short introduction to the mathematics of regression and linear models
- Statistical models
- Collected R code (ZIP archive) from handouts
- Some other sample R scripts (ZIP archive) with detailed comments
back to top
- brown.stats.txt (basic type-token statistics for the Brown corpus)
- lob.stats.txt (basic type-token statistics for the LOB corpus)
- bnc_metadata.tbl* (metadata information from the British National Corpus)
- bigrams.100k.spc (frequency spectrum of bigrams from the first 100k tokens of Brown)
- bigrams.100k.tfl (type frequency list of bigrams from the first 100k tokens of Brown)
- bigrams.vgc (vocabulary growth curve of bigrams in the Brown corpus)
- comp.stats.txt* (distributional information for different types of Italian noun-noun compounds)
- brown_bigrams.tbl (bigram collocations in the Brown corpus, with full contingency tables)
- krenn_pp_verb.tbl* (German PP-verb collocations with manual MWE annotation)
- bnc_gender_small.tbl (data set for identification of author gender in the BNC)
Download ZIP archive with all data sets (2.9 MB).
* These files contain Unicode strings with accented characters. If you are running R on a Windows computer, specify the option
encoding="UTF-8" when loading the files with
read.delim() in order to handle such strings correctly.
back to top