perldoc Stefan::Evert •
Computational Corpus Linguistics •
Stefan Evert - Research - Teaching - CV - Publications - Software - Private Life
Erlangen - Darmstadt - Osnabrück - Stuttgart - Summer schools - Other courses - Students
An introduction to computational linguistics as well as software tools for corpus linguistics and digital humanities, with a strong practical component. This module is mainly attended by students in the Master programmes Literary and Linguistic Computing and Computer Science. It consists of three coordinated courses, ranging from lectures on theoretical background to student presentations, software demos, practical hands-on corpus work, and a student research workshop in the last week of term.
Introduction to Computational Linguistics is a compulsory lecture for first-year students in the Cognitive Science programme (2005-2006 with Graham Katz, 2008 with Peter Bosch, 2010 with Maria Cieschinger, 2011 with Stefan Hinterwimmer).
This course introduces the fundamental probabilistic techniques used in natural language processing. Topics include Markov models, weighted finite-state automata and transducers, probabilistic context-free grammars, the EM algorithm, statistical machine translation, collocations, and maximum entropy models. Video recordings of the lectures are available on the course homepage.
[video recordings: WS 2008/09, WS 2009/10 (incomplete)]
In this course, participants gain hands-on experience of grammar engineering, locating and using resources, the implementation of (statistical or symbolic) NLP algorithms, and the many practical problems involved in building a real-life system that introductory courses tend to gloss over.
Quantitative linguistic data - whether from a corpus, an eye-tracking study, some other psycholinguistic experiment, or a survey of speaker intuitions - have to be analyzed and explored with statistical tools in order to assess their significance, understand their structure, and reveal the properties and interconnections of the underlying phenomena. This seminar explores the most useful statistical methods available for this purpose, including hypothesis tests and correlation measures, clustering and classification algorithms, linear and generalized statistical models, and data visualization techniques. Participants will gain hands-on experience with real-world linguistic data, using the open-source statistical software R.
An interdisciplinary practicum organised together with the Neuroinformatics group (Martin Lauer), in which students gain hands-on experience in the application of supervised and unsupervised machine learning techniques to real-life problems (including natural language processing and time series prediction).
Interdisciplinary course held together with the Neuroinformatics group (Martin Lauer), focussing on vector space representations, data processing and dimensionality reduction techniques (SVD, PCA, LSA), which are used both in machine learning and in statistical natural language processing.
Quantifying Linguistic Experience is a hands-on introduction to statistical methods for the quantitative analysis of corpus frequency data, which can be understood as an approximative model for the linguistic experience of a human speaker. In addition to learning the necessary statistical theory, participants are taught how to apply it to real-world data using the statistical programming language R [http://www.r-project.org/].
Seminar on Word Frequency Distributions and their application to Computational Morphology (with Anke Lüdeling).
Introductory class on Formal Language Theory for 2nd year students (in German).
Introductory class on Statistical Methods for 2nd year students (in German).
One-week introductory course on the foundations of distributional semantic models (DSM), their evaluation, applications and practical implementation. This course puts special emphasis on hands-on exercises with the wordspace package for the statistical computing environment R. At the European Summer School on Logic, Language and Information in Bolzano, Italy (ESSLLI 2016) and Sofia, Bulgaria (ESSLLI 2018). An earlier version of the course was co-taught with Alessandro Lenci (U of Pisa) in Bordeaux, France (ESSLLI 2009).
Course materials: http://wordspace.collocations.de/doku.php/course:esslli2018:start
An introduction to statistical methods for the analysis of corpus data and their practical application with R [http://www.r-project.org/], co-developed with Marco Baroni (U Trento, Facebook Research).
Subsets of this course have been taught at the 5th LinC Summer School (Aachen, Germany, 2018), at the Asia-Pacific Corpus Linguistics Conference (APCLC 2018, Takamatsu, Japan), at the Autumn School on Variation in Linguistic Corpora (Berlin, Germany, 2018), at the Birmingham Summer School in Corpus Linguistics (2016), the Symposium on Methods and Linguistic Theories (Bamberg, Germany, 2015), at the University of Zurich (2010), at the 9th Summer School of the European Masters in Language and Speech Technology (Stuttgart, Germany, 2008), and at many other locations. The course was originally developed for the DGfS/CL Fall School 2007 in Potsdam, Germany.
Course materials: http://SIGIL.R-Forge.R-Project.org/
A two-week introduction to corpus-based approaches to lexical semantics with a focus on distributional semantics (using R and the new
lexical resources (WordNet),
word sense disambiguation,
sentiment analysis and corpus technology,
taught at the International Summer School in Advanced Language Engineering (ISSALE 2014),
Course materials: http://issale.ucsc.lk/?page_id=256
An overview of current research in computational lexical semantics, combining theoretical and methodological background with hands-on experience. One-week introductory course at the European Summer School on Logic, Language and Information (ESSLLI 2009), Bordeaux, France (with Gemma Boleda, UPC, Barcelona).
Course homepage: http://clseslli09.wordpress.com/
Introduction to Lexical Statistics and mathematical models of word frequency distributions (one-week introductory course) at the European Summer School on Logic, Language and Information (ESSLLI 2006), Malaga, Spain (with Marco Baroni, U of Bologna, Forlì). Slides can be downloaded from http://purl.org/stefan.evert/zipfR/.
One-week introductory course on Computational Approaches to Collocations at the European Summer School on Logic, Language and Information (ESSLLI 2003), Vienna, Austria (with Brigitte Krenn, OFAI). PowerPoint slides can be downloaded from http://www.collocations.de/EK/.
A tutorial on the statistical modelling of type-token distributions, combining mathematical background with practical exercises using the zipfR package for R. This tutorial has been taught at the 11th Language Resources and Evaluation Conference (LREC 2018, Miyazaki, Japan) and the Birmingham Summer School in Corpus Linguistics (2018).
Course materials: http://zipfr.r-forge.r-project.org/lrec2018.html
An tutorial introduction to statistical inference for corpus data taught in German at the Annual Meeting of the German Linguistics Association (DGfS) in Frankfurt, Germany (2012) and Potsdam, Germany (2013).
Course materials: http://wordspace.collocations.de/doku.php/corpus_tutorial:dgfs2013
An tutorial introduction to corpus processing, indexing and linguistic search taught in German at the Annual Meeting of the German Linguistics Association (DGfS) in Göttingen, Germany (2011) and Berlin, Germany (2010) as well as for the Doctoral Programme in Linguistics (LIPP) at LMU Munich, Germany (2011).
Course materials: http://wordspace.collocations.de/doku.php/corpus_tutorial
A block course taught for the Doctorate Programme in Linguistics at the University of Zürich (2010) and at the University of Saarbrücken (2009). It forms the basis for a restructured version of the SIGIL course (see above).
A 6-hour crash course on linear algebra and vector space models held at the Rovereto campus of the University of Trento, Italy, in March 2007. This course was part of an exchange funded by the Erasmus teacher mobility programme.
Handouts: http://purl.org/stefan.evert/PUB/Handout_LA_Trento_1.pdf (vector spaces), http://purl.org/stefan.evert/PUB/Handout_LA_Trento_2.pdf (distance, norm, kernel), http://purl.org/stefan.evert/PUB/Handout_LA_Trento_3.pdf (dimensions & PCA)
PhD - Master - Bachelor
A novel architecture for text alignment (ATLAS) performs alignment at multiple levels in parallel (e.g. paragraph, sentence and word alignment) and is designed for easy integration of various linguistic knowledge sources.
Supervisors: Peter Bosch, Helmar Gust, Stefan Evert
A description of the KrdWrd Project, developed in cooperation with Johannes Steger and other students at the Institute of Cognitive Science. The goals of the project are (i) to provide tools and infrastructure for the acquisition, visual annotation, merging and storage of Web pages for the purpose of corpus building and content mining; (ii) to develop a classification engine that learns to annotate and clean Web pages automatically based on visual renderings of the pages; and (iii) to provide graphical tools for the manual inspection of annotation/cleaning results.
Coordinated with MSc theses Web Attention Technology: JAMF and KrdWrd by Johannes Steger (supervised by Peter König & Stefan Evert). The KrdWrd technology will form the basis of the second CLEANEVAL contest on boilerplate removal for the Web as Corpus, to be held in 2010.
KrdWrd project homepage: https://krdwrd.org/
Supervisors: Stefan Evert, Peter König
Part-of-speech (POS) tagging is often considered a "solved task" in computational linguistics, with state-of-the-art taggers reporting accuracies around 97%. However, this performance is achieved only for texts that are sufficiently similar to the training data, and may drop markedly when the tagger is applied to a different genre. This thesis evaluates three widely-used statistical taggers (TreeTagger, Stanford Tagger and Apache UIMA Tagger) on manually annotated samples from a German Web corpus. The results show the expected loss of accuracy, large differences between genres (such as online newspapers vs. discussion forums), and the importance of good probabilistic models for unknown words.
PDF version of the thesis: http://www.cogsci.uos.de/~CL/download/MSc_Giesbrecht2008.pdf
Supervisors: Stefan Evert, Marco Baroni
Experiments on the automatic detection of e-mail spam (UBE = unsolicited bulk e-mails) with various machine-learning algorithms (Naive Bayes, Maximum Entropy and Decision Trees). In contrast to the bag-of-words approach of most standard spam filters, these experiments focus on linguistic properties such as part-of-speech tags and phrase patterns, achieving good performance with comparatively low-dimensional feature spaces.
Supervisors: Veit Reuer, Stefan Evert
A study on unsupervised learning of German inflectional morphology, using readily available linguistic knowledge about regular inflectional paradigms (as provided by SMOR FST). Stem/class hypotheses are generated by a modified FST (Adolphs 2008) and then ranked and filtered (i) with a MDL algorithm based on corpus frequency data and (ii) with a heuristic method that uses Google queries to find out how many surface forms predicted by a hypothesis are attested on the Web.
PDF version of the thesis, software and data sets can be obtained here: http://www-lehre.inf.uos.de/~thkruege/downloads.html
Design of a game-theoretical model for the typological paradigms of person marking in pronoun systems, as an adaptation of Jäger's (2007) use of evolutionary game theory to explain case marking. Contains excellent concise summaries of evolutionary game theory and person marking (based on Cysouw 2003).
Supervisors: Stefan Evert, Helmar Gust
A thorough reanalysis of surprising findings from a corpus-based study of nation + noun expressions in English (contrasting the Adj-N and N-Prep-N constructions), which was carried out during an internship at the UPC Barcelona. The original observation is explained as a mathematical artefact, and the underlying core phenomenon is revealed (as a basis for linguistic interpretation).
PDF version of the thesis: http://www.cogsci.uos.de/~CL/download/BSc_Berndt_2009.pdf
Supervisors: Stefan Evert, Louise McNally
Development of a Java GUI toolkit for interactive exploration of "word spaces", i.e. distributional representations of the usage and meaning of a word. The toolkit is demonstrated and tested in a number of case studies.
Supervisor: Stefan Evert
Preliminary experiments on the completely unsupervised acquistion of patterns in natural and artifical languages by training recurrent neural networks (rNN) as auto-encoders. Results show that NN in general, and the standard training algorithms for rNN in particular, are not very well suited for the representation of symbolic structures. Better performance is obtained with hand-crafted networks using a special "Cantor encoding".
Supervisors: Helmar Gust, Stefan Evert
Experiments on the automatic identification of semantic relations between Wikipedia "concepts" (i.e. articles) based on information from hyperlinks between these articles. An XML-annotated corpus is compiled from the Wikipedia database, and training data are obtained by a semi-automatic mapping of Wikipedia articles to WordNet synsets.
PDF version of the thesis: http://www.cogsci.uos.de/~CL/download/BSc_Bauer2007.pdf
Supervisor: Stefan Evert
Several prototype-based supervised learning algorithms from the Learning Vector Quantization (LVQ) family are implemented in a Java library and evaluated with respect to their suitability for text classification tasks in computational linguistics. They are found to achieve high accuracy (comparable to support vector machines) on the task of genre classification for the British National Corpus.
PDF version of the thesis, software and data sets can be downloaded here: http://www.cogsci.uos.de/~CL/download/BSc_Gasthaus2007/
Supervisors: Stefan Evert, Martin Lauer
Suffix trees, an efficient text indexing algorithm used in bioinformatics, are adapted to develop a more fine-grained similarity measure for texts (and other formal strings) than simple n-gram overlap. The new algorithm achieves a remarkable accuracy of 92% for gender classification in the British National Corpus, and marks the first application of suffix trees in computational linguistics to our knowledge.
Supervisors: Stefan Evert, Volker Sperschneider
Wikipedia, the free encyclopedia, uses a category system to establish a hierarchical order for its articles. This thesis explores the possibility of automatically assigning such categories to new articles, using a k-Nearest Neighbour algorithm based on textual features that are unique to Wikipedia articles.
Supervisors: Stefan Evert, Peter Geibel
Tokenization is a first and essential step in most natural language processing (NLP) pipelines for written text. While often considered a "trivial" problem, the accuracy of tokenizers is often unsatisfactory, especially on less formal genres such as personal Web pages. This study specifies requirements for a high-quality tokenization system, analyzes different types of problematic cases, evaluates a number of commonly used tokenizers, and outlines the architecture of a more flexible tokenization system.
PDF version of the thesis: http://www.cogsci.uos.de/~CL/download/BSc_Aulbach_2006.pdf
Supervisors: Bettina Schrader, Stefan Evert
Experiments on using spam filters and support vector machines (SVM) for the automatic detection of unwanted pages in Web corpora (termed "WaC spam"). Results on an existing collection of WaC spam and additional manually classified Web pages are encouraging, especially for the SVM classifiers.
PDF version of the thesis, software and data sets can be downloaded here: http://wacky.sslmit.unibo.it/old_wiki/doku.php?id=cite:gross_2006
Supervisors: Stefan Evert, Peter Geibel