Statistical Analysis of Corpus Data with R is an online course by Marco Baroni and Stephanie Evert. It is based on a number of previous courses on similar topics taught together by the authors, in particular the course on R Programming for (Computational) Linguists given at the DGfS Fall School in Computational Linguistics (Potsdam, 2007).

News: The SIGIL course is currently being restructured – a new Web page will be launched when a stable state has been reached. You can already download updated versions of most of the course units below.

New slide sets (work in progress)

R code examples in the slide sets below make use of functions and data sets included in a supporting R package or available as separate files, depending on whether the slides have been updated yet. Please install the following software and data:

The corpora package (version 0.6), which is available on CRAN and can be installed with any standard R package manager.
Any additional data and code files required by the unit you're studying. These are listed together with the handouts and exercises below. You can also download a ZIP archive with most data sets (2.9 MiB).
Notice: Some slides may still refer to data sets in the SIGIL package, which was rejected by CRAN. Please use the corpora package instead, making sure that you have installed version 0.6 or newer.

It is recommended that you put all data and code files in an RStudio project directory (or your current working directory). All code examples in the slides and exercises will make this assumption.

SIGIL course units

Unit 1: General introduction / First steps in R (updated on 12.07.2015)
- handouts: slides (1.2 MiB) – print version (0.8 MiB)
- exercise sheet – solution notes
- data: brown.stats.txt – lob.stats.txt
Unit 2: Corpus frequency data & statistical inference (updated on 20.06.2016)
- handouts: slides (7.6 MiB) – print version (4.7 MiB)
- worked example on BNC frequency comparisons: RMarkdown – PDF (new on 29.11.2015)
- exercise sheet – solution notes
- data: passives.brown.csv – passives.lob.csv – bnc_queries.tbl – bnc_metadata_utf8.tbl (1.2 MiB)
Unit 3: Descriptive and inferential statistics for continuous data
- part 1: slides (1.0 MiB) – print version (0.8 MiB) (minor update on 12.07.2015)
- part 2: slides (0.5 MiB) – print version (0.4 MiB) preliminary version (updated on 08.07.2016)
- worked example on the effectiveness of a corpus-driven language course: RMarkdown – PDF (new on 09.07.2016)
- exercise sheet – solution notes (updated on 12.07.2015)
Unit 4: Collocations, keywords & contingency tables
- measuring keyness: slides (5.3 MiB) – print version (4.9 MiB) – screencast (MP4, 195 MiB) (new on 11.09.2023) new
- worked example on keyword analysis: RStudio project 04_keyness_hands_on.zip (5.8 MiB) – HTML report (new on 13.09.2023) new
- screencast from hand-on session: part 1 (MP4, 348 MiB) – part 2 (MP4, 217 MiB) (new on 17.09.2023) new
- collocations part 1: slides (0.8 MiB) – print version (0.5 MiB)
- collocations part 2: slides (1.0 MiB) – print version (0.7 MiB)
- worked example on collocations and keywords: RMarkdown – PDF – PNG figure (for typesetting) (new on 23.06.2016)
- exercise sheet – solution notes (updated on 23.06.2016)
- data: brown_bigrams.tbl – krenn_pp_verb.tbl
Unit 5: Word frequency distributions and Zipf's law: Using add-on packages (updated on 23.06.2016)
- handouts: slides (1.6 MiB) – print version (1.4 MiB)
- worked example in the zipfR package tutorial: PDF
- exercise sheet – solution notes
- data: bigrams.100k.tfl (1.1 MiB) – bigrams.100k.spc
Unit 6: Regression and the general linear model
Unit 7: Multivariate analysis (update on 06.04.2023)
- overview talk: slides (5.8 MB) – handout (5.8 MB) – powerpoint (86 MB, with embedded animations) (updated on 04.04.2023)
- mathematical background: slides (0.8 MiB) – print version (0.8 MiB) (updated on 16.09.2018) update pending
- worked examples: Multivariate analysis in R – Geometric Multivariate Analysis (updated on 06.04.2023)
- RStudio project: 07_project.zip (7.3 MB; ZIP archive including all code & data) (updated on 06.04.2023)
Unit 8: The non-randomness of corpus data & generalised linear models (updated on 26.03.2010)
- handouts: slides (6.0 MiB) – print version (4.5 MiB)
- Worked example on The frequency of passives: RMarkdown – PDF (new on 29.11.2015)
- data: passives_by_text.tbl
Unit 9: Inter-annotator agreement

Old version of the SIGIL course

Introduction (slides, handout)
Hypothesis tests for corpus frequency data (slides, handout)
Word frequency distributions with zipfR (slides, handout)
Clustering and dimensionality reduction (slides, handout, data sets)
- Exploring a data set of Italian compounds (slides, handout)
Using statistical association measures for collocation extraction
- Part 1: contingency tables and association scores (slides, handout)
- Part 2: large-scale processing and evaluation (slides, handout)
The limitations of random sampling methods (slides, handout)
A short introduction to the mathematics of regression and linear models (slides, handout, R examples)
Statistical models
- Part 1: Linear regression (slides, handout)
- Part 2: Mixed-effects models (slides, handout)
- Part 3: Logistic regression (slides, handout)

Collected R code (ZIP archive) from handouts
Some other sample R scripts (ZIP archive) with detailed comments

Data sets

brown.stats.txt (basic type-token statistics for the Brown corpus)
lob.stats.txt (basic type-token statistics for the LOB corpus)
bnc_metadata.tbl* (metadata information from the British National Corpus)
bigrams.100k.spc (frequency spectrum of bigrams from the first 100k tokens of Brown)
bigrams.100k.tfl (type frequency list of bigrams from the first 100k tokens of Brown)
bigrams.vgc (vocabulary growth curve of bigrams in the Brown corpus)
comp.stats.txt* (distributional information for different types of Italian noun-noun compounds)
brown_bigrams.tbl (bigram collocations in the Brown corpus, with full contingency tables)
krenn_pp_verb.tbl* (German PP-verb collocations with manual MWE annotation)
bnc_gender_small.tbl (data set for identification of author gender in the BNC)

Download ZIP archive with all data sets (2.9 MB).

* These files contain Unicode strings with accented characters. If you are running R on a Windows computer, specify the option encoding="UTF-8" when loading the files with read.delim() in order to handle such strings correctly.

Exercises

Exercise sheet 1 (analysing the BNC metadata) and our solutions
Exercise sheet 2 (type richness and vocabulary growth) and our solutions
Exercise sheet 3 (multidimensional scaling) and our solutions
Exercise sheet 4 (surface collocations and written/spoken keywords) and our solutions

imprint & privacy