Statistical Analysis of Corpus Data with R
A Gentle Introduction for Computational Linguists and Similar Creatures
Course Materials –
Old Version –
Data Sets –
Exercises –
SIGIL Main Page
Statistical Analysis of Corpus Data with R is an online course by Marco Baroni and Stefan Evert. It is based on a number of previous courses on similar topics taught together by the authors, in particular the course on R Programming for (Computational) Linguists given at the DGfS Fall School in Computational Linguistics (Potsdam, 2007).
News:
The SIGIL course is currently being restructured – a new Web page will be launched soon. You can download updated versions of some of the slides below, but they need new versions of the support packages that are not available on CRAN yet. Please follow the instructions below.
New slide sets (work in progress)
back to top
Notice:
R code examples in the slide sets below make use of functions and data sets in two support packages as well as data sets available as separate files, depending on whether the slides have been updated yet. Please install the following software and data:
 The
SIGIL
package (version 0.10), which is not yet available on CRAN:
Windows –
Mac OS X –
source code
(Linux users should download and install the source code version). This package is required for all updated slide sets.
 The
corpora
package (version 0.43), available on CRAN for R 2.15.0 and newer. This package is required for old slide sets and provides additional functionality used in some new slide sets. It is recommended to install version 0.5 (not yet available on CRAN) in order to avoid duplicates of data sets now included in the SIGIL package:
Windows –
Mac OS X –
source code
 If you're going to work with separate data files (as some of the old slide sets do), you may want to get the ZIP archive with most data sets (2.9 MiB) instead of downloading each file separately below.
SIGIL course units

Unit 1: General introduction / First steps in R
(updated on 12.07.2015) new

Unit 2: Corpus frequency data & statistical inference
(updated on 20.06.2016) new

Unit 3: Descriptive and inferential statistics for continuous data

Unit 4: Collocations & contingency tables

Unit 5: Word frequency distributions and Zipf's law: Using addon packages
(updated on 23.06.2016) new

Unit 6: Regression and the general linear model

Unit 7: Exploratory data analysis: Clustering, visualisation & machine learning

Unit 8: The nonrandomness of corpus data & generalised linear models
(updated on 26.03.2010)

Unit 9: Interannotator agreement
Old version of the SIGIL course
back to top
 Introduction
(slides,
handout)
 Hypothesis tests for corpus frequency data
(slides,
handout)
 Word frequency distributions with zipfR
(slides,
handout)
 Clustering and dimensionality reduction
(slides,
handout,
data sets)
 Using statistical association measures for collocation extraction
 Part 1: contingency tables and association scores
(slides,
handout)
 Part 2: largescale processing and evaluation
(slides,
handout)
 The limitations of random sampling methods
(slides,
handout)
 A short introduction to the mathematics of regression and linear models
(slides,
handout,
R examples)
 Statistical models
 Collected R code (ZIP archive) from handouts
 Some other sample R scripts (ZIP archive) with detailed comments
Data sets
back to top
 brown.stats.txt (basic typetoken statistics for the Brown corpus)
 lob.stats.txt (basic typetoken statistics for the LOB corpus)
 bnc_metadata.tbl* (metadata information from the British National Corpus)
 bigrams.100k.spc (frequency spectrum of bigrams from the first 100k tokens of Brown)
 bigrams.100k.tfl (type frequency list of bigrams from the first 100k tokens of Brown)
 bigrams.vgc (vocabulary growth curve of bigrams in the Brown corpus)
 comp.stats.txt* (distributional information for different types of Italian nounnoun compounds)
 brown_bigrams.tbl (bigram collocations in the Brown corpus, with full contingency tables)
 krenn_pp_verb.tbl* (German PPverb collocations with manual MWE annotation)
 bnc_gender_small.tbl (data set for identification of author gender in the BNC)
Download ZIP archive with all data sets (2.9 MB).
* These files contain Unicode strings with accented characters. If you are running R on a Windows computer, specify the option encoding="UTF8"
when loading the files with read.delim()
in order to handle such strings correctly.
Exercises
back to top