TODO remove this block if we don’t need it

1 Preparation

1.1 R packages

For this hands-on session, you will need the packages tidyverse packages and corpora (version 0.6). We start by loading these packages (ignoring warnings about name conflicts from tidyverse).

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corpora)

If you did not manage to install corpora v0.6, you can load an implementation of the keyness() function from the file keyness_function.R. The code block below does this automatically if it detects an outdated version.

corpora.version <- packageVersion("corpora")
if (corpora.version < "0.6") {
  cat(sprintf("Package 'corpora' v%s is outdated - loading stand-alone implementation of keyness()\n", corpora.version))
  source("keyness_function.R")
}

For some parts of the session, you will also need packages Rtsne and ggrepel, as well as fastTextR to apply the visualisation to your own data sets. Since not everyone may have installed these packages, we will only load them later on when they are needed.

1.2 A small example

The corpora package includes a very small example data set for keyword analysis, containing the frequencies of 60 selected noun lemmas in the written and spoken part of the British National Corpus. For illustration, we show every 10th line of this table.

BNCcomparison[seq(1, 61, 10), ]
##            noun  written  spoken
## 1          time   156300   21720
## 11        woman    56417    3824
## 21   commentary      966      46
## 31        agony      973      31
## 41 polymorphism      100       0
## 51    patrimony      100       0
## 61        OTHER 17813315 1107974

Did you notice the last entry “OTHER”? This is the total frequency of all other nouns, allowing us to compute the sample sizes \(n_1\) and \(n_2\) by summing over the columns. Let us separate the information into a frequency table bnc without the “OTHER” row (converted into a tibble for better tidyverse support) and the sample sizes bnc_nW and bnc_nS.

bnc <- as_tibble(filter(BNCcomparison, noun != "OTHER"))
bnc_nW <- sum(BNCcomparison$written)
bnc_nS <- sum(BNCcomparison$spoken)

The sample sizes are \(n_1 =\) 19277930 and \(n_2 =\) 1274747.

2 Measuring keyness

2.1 Contingency tables

The mathematical basis for computing keyness measures are contingency tables as shown in the lecture. Let us construct a contingency table for the noun time in the first line of bnc.

Think break: Assume you want to find out whether time is characteristic of spoken English. What are the frequency counts \(f_1, f_2\) and sample sizes \(n_1, n_2\) you need to provide? How do you combine them into a suitable contingency table?

The utility function cont.table() from corpora makes it easy to build the contingency table from \(f_1, n_1, f_2, n_2\):

fS <- bnc$spoken[1]
fW <- bnc$written[1]
ct <- cont.table(fS, bnc_nS, fW, bnc_nW)
rownames(ct) <- c("time", "OTHER")     # just for better readability
colnames(ct) <- c("spoken", "written")
ct
##        spoken  written
## time    21720   156300
## OTHER 1253027 19121630

Relative frequencies in the two subcorpora indicate that time is indeed more frequent in spoken English.

prop.table(ct, margin=2)
##           spoken     written
## time  0.01703868 0.008107717
## OTHER 0.98296132 0.991892283

We can easily apply a significance test to such a contingency table, e.g. Pearson’s \(\chi^2\) test or Fisher’s exact test (but note that the popular likelihood-ratio test is not included in R).

chisq.test(ct)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  ct
## X-squared = 11106, df = 1, p-value < 2.2e-16

Note that Fisher’s test also provides a confidence interval for the odds ratio \(\theta\), which you could use as an alternative to the LRC measure (this has only become possible through recent improvements in the R implementation, though, and hasn’t found wide-spread use in corpus linguistics).

fisher.test(ct)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  ct
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  2.090496 2.151104
## sample estimates:
## odds ratio 
##   2.120544

Other keyness measures have to be computed directly using the equations from the lecture.

Task: Can you compute the LogRatio score? (It is more convenient to start from \(f_s, n_s\) and \(f_w, n_w\) then from the contingency table.) Do you think you can also implement the equation for the log-likelihood measure \(G^2\)?

log2((fS + .5) / (bnc_nS + .5)) - log2((fW + .5) / (bnc_nW + .5))
## [1] 1.071474

For \(G^2\), we need to determine the contingency table of expected frequencies \(E_{ij}\) first, which we will call ctE. The observed frequencies \(O_{ij}\) are already available in the table ct. Compare the code below with the formula in the lecture slides.

p0 <- (fS + fW) / (bnc_nS + bnc_nW) # estimated relative frequency under H0
ctE <- cont.table(p0 * bnc_nS, bnc_nS, p0 * bnc_nW, bnc_nW)
round(ctE, 1)
##           [,1]       [,2]
## [1,]   11041.4   166978.6
## [2,] 1263705.6 19110951.4

The \(G^2\) formula requires us to compute the term \(O_{ij}\cdot \log\frac{O_{ij}}{E_{ij}}\) for each cell of the two contingency tables, then sum over the results. This is easy in R because all operations are vectorised and operate on the full contingency tables at once.

2 * sum(ct * log(ct / ctE))
## [1] 8827.91

Compare this to the similar statistic \(X^2 = 11106\) computed by the \(\chi^2\) test above. In order to verify our implementation, we can compare it to the implementation in corpora, i.e. to the output of the keyness() function.

keyness(fS, bnc_nS, fW, bnc_nW, measure="G2")
## [1] 8827.91

Note that our \(G^2\) value doesn’t show whether time is a positive or negative keyword for spoken English, just that the difference is highly significant. By contrast, the value returned by keyness() would have been negative in the latter case.

2.2 Keyword analysis

In order to perform a keyword analysis, we need to compute a keyness score for each candidate item. In our example, there are only 60 noun lemmas and we could in principle write a loop that goes through each of them, builds a contingency table, and computes e.g. the \(G^2\) score. However, in real-life applications there will be tens or hundreds of thousands of candidate items, so we need a more efficient approach.

The solution is to treat the list of contingency tables as tabular data, i.e. construct a table where each row represents a candidate item and there are four columns for the four cells of the contingency table. Equivalently, we can also use columns with the raw data \(f_1, n_1, f_2, n_2\).

The table bnc already has almost the required format. Let us just relabel spoken as \(f_1\) and written as \(f_2\), and add columns for \(n_1\) and \(n_2\). We will use tidyverse pipes to work with tabular data throughout this session, which is a convenient and recommended approach.

bnc %>%
  select(noun, f1=spoken, f2=written) %>%
  mutate(
    n1 = bnc_nS, 
    n2 = bnc_nW
  ) ->
  bncTab

print(bncTab, n=6)
## # A tibble: 60 × 5
##   noun      f1     f2      n1       n2
##   <chr>  <int>  <int>   <int>    <int>
## 1 time   21720 156300 1274747 19277930
## 2 year   15593 146308 1274747 19277930
## 3 people 20939  94568 1274747 19277930
## 4 way    14037  93325 1274747 19277930
## 5 man     6460  88370 1274747 19277930
## 6 day    11011  80062 1274747 19277930
## # ℹ 54 more rows

It is now easy to add further columns with scores of the different keyness measures. Thanks to vectorisation, scores for all candidate items are computed efficiently with a single statement.

bncTab %>%
  mutate(
    LogRatio = log2((f1 + .5) / (n1 + .5)) - log2((f2 + .5) / (n2 + .5))
  ) ->
  bncTab

In keyword analysis, the columns n1 and n2 usually contain the same value for all candidate items. It is therefore common not to include this value redundantly in the table, but rather store it as a scalar in a separate variable and use it directly in the computations.

In order to see e.g. the top-10 keywords, we just have to order the table rows by decreasing LogRatio score:

bncTab %>%
  arrange(desc(LogRatio)) %>%
  print(n = 10)
## # A tibble: 60 × 6
##    noun         f1     f2      n1       n2 LogRatio
##    <chr>     <int>  <int>   <int>    <int>    <dbl>
##  1 thing     23171  51670 1274747 19277930     2.76
##  2 banger       23     77 1274747 19277930     2.20
##  3 sultana      22     78 1274747 19277930     2.12
##  4 paymaster    18     82 1274747 19277930     1.76
##  5 people    20939  94568 1274747 19277930     1.74
##  6 way       14037  93325 1274747 19277930     1.19
##  7 bulb        131    875 1274747 19277930     1.18
##  8 spider      123    884 1274747 19277930     1.08
##  9 time      21720 156300 1274747 19277930     1.07
## 10 day       11011  80062 1274747 19277930     1.06
## # ℹ 50 more rows

Task: Compute log-likelihood scores \(G^2\) in the same way. First, add four columns for the observed contingency table \(O_{ij}\) and four columns for the expected contingency table \(E_{ij}\). Then, add a further column with the \(G^2\) score, explicitly computing each of the four terms. (Experienced R users might find a trick to compute all four terms with a single expression.)

bncTab %>%
  mutate(
    O11 = f1, 
    O12 = f2,
    O21 = n1 - f1, 
    O22 = n2 - f2,
    E11 = n1 * ((f1 + f2) / (n1 + n2)),
    E12 = n2 * ((f1 + f2) / (n1 + n2)),
    E21 = n1 - E11,
    E22 = n2 - E12
  ) %>%
  mutate(
    G2 = 2 * (
      O11 * log(O11 / E11) 
      + O12 * log(O12 / E12)
      + O21 * log(O21 / E21)
      + O22 * log(O22 / E22)
    )
  ) ->
  bncG2

We can take a look at the top-10 keywords according to log-likelihood. In order to keep the table readable, we will only show the first row of the contingency tables.

bncG2 %>%
  arrange(desc(G2)) %>%
  select(noun, O11, E11, O12, E12, G2) %>%
  print(n = 10)
## # A tibble: 60 × 6
##    noun     O11    E11    O12     E12     G2
##    <chr>  <int>  <dbl>  <int>   <dbl>  <dbl>
##  1 thing  23171  4642.  51670  70199. 43128.
##  2 people 20939  7164.  94568 108343. 19356.
##  3 time   21720 11041. 156300 166979.  8828.
##  4 way    14037  6659.  93325 100703.  6780.
##  5 day    11011  5649.  80062  85424.  4343.
##  6 year   15593 10042. 146308 151859.  2853.
##  7 number  6551  3706.  53194  56039.  1928.
##  8 house   5587  3487.  50634  52734.  1156.
##  9 system  2327  3805.  59022  57544.   707.
## 10 area    5156  3604.  52957  54509.   635.
## # ℹ 50 more rows

Let us now compute \(G^2\) scores more conveniently with the keyness() function and compare the results. See ?keyness for detailed information about supported keyness measures and other options.

bncG2 %>%
  mutate(
    keynessG2 = keyness(f1, n1, f2, n2, measure="G2")
  ) %>%
  arrange(desc(G2)) %>%
  select(noun, LogRatio, G2, keynessG2) %>%
  print(n = 15)
## # A tibble: 60 × 4
##    noun       LogRatio     G2 keynessG2
##    <chr>         <dbl>  <dbl>     <dbl>
##  1 thing         2.76  43128.    43128.
##  2 people        1.74  19356.    19356.
##  3 time          1.07   8828.     8828.
##  4 way           1.19   6780.     6780.
##  5 day           1.06   4343.     4343.
##  6 year          0.689  2853.     2853.
##  7 number        0.897  1928.     1928.
##  8 house         0.739  1156.     1156.
##  9 system       -0.746   707.     -707.
## 10 area          0.558   635.      635.
## 11 work          0.518   543.      543.
## 12 company      -0.604   461.     -461.
## 13 world        -0.526   367.     -367.
## 14 government   -0.378   229.     -229.
## 15 child         0.276   166.      166.
## # ℹ 45 more rows

Think break: Can you spot differences between the two \(G^2\) columns? If so, can you explain why the results are different? (In an interactive notebook, make sure to page through all 60 candidate items; or remove n = 15 if this doesn’t work.)

2.3 Significance filter & FWER

We can also use \(G^2\) scores to weed out non-significant candidate items by comparing them against a suitable threshold value (obtained e.g. from a statistics textbook). But it is more convenient to use the keyness() function for this purpose, selecting the desired signifiance level with option alpha.

bncTab %>%
  mutate(
    G2sig = keyness(f1, n1, f2, n2, "G2", alpha=.05, p.adjust=FALSE)
  ) ->
  bncSig

filter(bncSig, G2sig == 0) # not significant
## # A tibble: 10 × 7
##    noun        f1    f2      n1       n2 LogRatio G2sig
##    <chr>    <int> <int>   <int>    <int>    <dbl> <dbl>
##  1 woman     3824 56417 1274747 19277930   0.0359     0
##  2 query       70   938 1274747 19277930   0.184      0
##  3 drinking    53   952 1274747 19277930  -0.235      0
##  4 brewery     65   938 1274747 19277930   0.0779     0
##  5 voltage     75   927 1274747 19277930   0.300      0
##  6 draper       7    93 1274747 19277930   0.279      0
##  7 sundial      4    96 1274747 19277930  -0.504      0
##  8 iodine       4    96 1274747 19277930  -0.504      0
##  9 dynamite     6    94 1274747 19277930   0.0569     0
## 10 tracer      10    90 1274747 19277930   0.811      0

Keep in mind that this simple significance filter only controls the error rate for each individual candidate item. At our significance level \(\alpha = .05\), three of our 60 candidates (i.e. 1 in 20) may well be false positives. It is therefore recommended to control the family-wise error rate instead, which is the default setting in keyness():

bncSig %>%
  mutate(
    G2fwer = keyness(f1, n1, f2, n2, "G2", alpha=.05)
  ) ->
  bncSig

filter(bncSig, G2sig != G2fwer) # where FWER makes a difference
## # A tibble: 9 × 8
##   noun          f1    f2      n1       n2 LogRatio G2sig G2fwer
##   <chr>      <int> <int>   <int>    <int>    <dbl> <dbl>  <dbl>
## 1 commentary    46   966 1274747 19277930   -0.459 -5.24      0
## 2 lobby         43   959 1274747 19277930   -0.545 -7.01      0
## 3 manual        83   918 1274747 19277930    0.459  6.84      0
## 4 blaster        2    98 1274747 19277930   -1.38  -4.06      0
## 5 saxophone      2    98 1274747 19277930   -1.38  -4.06      0
## 6 sitcom         2    98 1274747 19277930   -1.38  -4.06      0
## 7 propulsion     1    99 1274747 19277930   -2.13  -7.04      0
## 8 ascendant      1    99 1274747 19277930   -2.13  -7.04      0
## 9 barbarism      2    98 1274747 19277930   -1.38  -4.06      0

The Bonferroni correction used by keyness() simply divides the significance level by the number of tests carried out, i.e. the number of candidate items. In our case, the adjusted level is thus \(alpha' = \alpha / 60\).

bncSig %>%
  mutate(
    G2adjust = keyness(f1, n1, f2, n2, "G2", alpha=.05 / 60, p.adjust=FALSE)
  ) %>%
  filter(abs(G2fwer) < 20) %>%
  select(noun, starts_with("G2"))
## # A tibble: 29 × 4
##    noun        G2sig G2fwer G2adjust
##    <chr>       <dbl>  <dbl>    <dbl>
##  1 woman        0       0        0  
##  2 group      -19.8   -19.8    -19.8
##  3 commentary  -5.24    0        0  
##  4 query        0       0        0  
##  5 drinking     0       0        0  
##  6 cargo      -12.6   -12.6    -12.6
##  7 brewery      0       0        0  
##  8 voltage      0       0        0  
##  9 jazz       -12.6   -12.6    -12.6
## 10 lobby       -7.01    0        0  
## # ℹ 19 more rows

Now you know everything you need to carry out a few real-world keyword analyses.

3 Case studies

As explained in the lecture, most corpus tools can export word frequency lists in some tabular format, but the details of this format differ between tools. It is therefore important to read documentation and inspect samples of the tabular files in order to work out how to read frequency lists correctly into R. In order to practise this, we will look at four data sets from different tools in this session. Only AntConc supports the MTSV format proposed by Anthony & Evert (2019), but you will find that all data sets can easily be loaded with a little care.

3.1 Donald Trump (CWB)

Let us start by replicating the initial example from the lecture and carry out a keyword analysis based on Donald Trump’s tweets (from the Trump Twitter Archive = TTA). You will find a complete word list of verbs, nouns, adjectives, proper nouns and hashtags in the file tta_target.tsv, and a reference list from a general Twitter sample in tta_reference.tsv. These word lists were created efficiently with the cwb-scan-corpus tool of the IMS Corpus Workbench (https://cwb.sourceforge.io/), so they have a very simple format quite different from MTSV.

Task: A good starting point is to inspect the files manually, which you can do by clicking on tta_target.tsv in the RStudio Files tab (usually located in the bottom right pane). You can also try opening the file in a text editor. Can you figure out the file format and what data are found in each column? Do you notice anything unusual?

Unlike most TSV files, the out of cwb-scan-corpus doesn’t include column headers, so we will have to make sure to specify suitable column names when reading the data. The items are POS-disambiguated lemmas; it should be easy to work out the meaning of the POS codes appended after and underscore as separator. The frequency values are in fact document frequencies, i.e. the number of tweets containing each lemma (so that repetitions within the same tweet are ignored).

The tidyverse provides a powerful and convenient function read_delim() to load tabular files. It will try to work out the file format by itself, though usually a few settings are still required to get a correct result. If you know that an input file is in CSV or TSV format, you can also use read_csv() or read_tsv(), respectively.

One important feature of TSV files is that they don’t quote the values in cells of the table, assuming that they won’t contain TAB or newline characters. It is therefore important to specify quote="" so that read_delim() doesn’t get confused.

tta_target <- read_delim("tta_target.tsv", quote="", col_names=c("f", "lemma"))
## Rows: 21780 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): lemma
## dbl (1): f
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tta_ref <- read_delim("tta_reference.tsv", quote="", col_names=c("f", "lemma"))
## Rows: 174239 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): lemma
## dbl (1): f
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Think break: An important bit of information for the keyword analysis is still missing. Do you know what it is?

The cwb-scan-corpus tool doesn’t include the sample size of the corpus in its output (because it doesn’t fit into the simple tabular format, requiring something like MTSV). In fact, I had to determine the sample sizes \(n_1, n_2\) separately (also with cwb-scan-corpus) and note them down.

tta_n1 <- 43878     # number of tweets in TTA
tta_n2 <- 13904798  # number of tweets in the reference corpus

We now have two separate frequency lists from TTA and from the reference corpus. However, for the kewyord analysis, we need a table that combines the values \(f_1\) and \(f_2\) for each candidate item. In database parlance, this is known as a join operation between the two frequency tables. In this case, we need a left join because we want to include items that don’t occur in the reference corpus (with \(f_2 = 0\)).

left_join(tta_target, tta_ref, by="lemma", suffix=c("1", "2")) %>%
  mutate(f2 = coalesce(f2, 0)) %>%
  select(lemma, f1, f2) ->
  tta

head(tta)
## # A tibble: 6 × 3
##   lemma                 f1    f2
##   <chr>              <dbl> <dbl>
## 1 jaime_Z                1  1306
## 2 os_N                   1   540
## 3 extreme_J             24  5812
## 4 cryptocurrencies_Z     1   219
## 5 sullen_J               1    27
## 6 bettis_Z               1    80

Think break: Can you work out why we needed to specify the suffix option above? And can you imagine what coalesce() does?

We can now easily add several keyness measures to the table, using mutate() and keyness(). We combine LogRatio with a significance filter (corrected for FWER), as recommended by Hardie (2014).

tta %>% mutate(
  LR = keyness(f1, tta_n1, f2, tta_n2, "LogRatio", alpha=.05),
  G2 = keyness(f1, tta_n1, f2, tta_n2, "G2"),
  LRC = keyness(f1, tta_n1, f2, tta_n2, "LRC")
) ->
  tta

Let’s have a look at top-10 keywords from each measure:

tta %>% arrange(desc(LR)) %>% print(n=10)
## # A tibble: 21,780 × 6
##    lemma                   f1    f2    LR    G2   LRC
##    <chr>                <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 #celebapprentice_#     256     0  17.3 2951. 12.5 
##  2 coronavirus_Z           82     0  15.7  945. 10.8 
##  3 #timetogettough_#       81     0  15.7  934. 10.8 
##  4 #trumpforpresident_#    76     0  15.6  876. 10.7 
##  5 #trump2016https_#       66     0  15.4  761. 10.4 
##  6 #kag2020_#              58     0  15.2  668. 10.2 
##  7 usmca_Z                 45     0  14.8  519.  9.80
##  8 #celebapprentice_Z      41     0  14.7  472.  9.64
##  9 covid_Z                 34     0  14.4  392.  9.32
## 10 #trumpvlog_#            76     1  14.0  865. 10.4 
## # ℹ 21,770 more rows
tta %>% arrange(desc(G2)) %>% print(n=10)
## # A tibble: 21,780 × 6
##    lemma           f1      f2     LR     G2    LRC
##    <chr>        <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1 amp_N         3315     633 10.7   34984. 10.4  
##  2 great_J       6441  184737  3.47  19876.  3.38 
##  3 trump_Z       4794  108753  3.80  16671.  3.70 
##  4 be_V         22842 3673663  0.978 12915.  0.933
##  5 will_V        6411  343738  2.56  12737.  2.48 
##  6 donald_Z      1761   14032  5.31   9405.  5.14 
##  7 country_N     2339   42064  4.14   9009.  3.99 
##  8 #trump2016_#   797     177 10.5    8276.  9.93 
##  9 president_N   1842   29656  4.30   7450.  4.13 
## 10 democrat_N    1312    9183  5.50   7306.  5.30 
## # ℹ 21,770 more rows
tta %>% arrange(desc(LRC)) %>% print(n=10)
## # A tibble: 21,780 × 6
##    lemma                     f1    f2    LR     G2   LRC
##    <chr>                  <dbl> <dbl> <dbl>  <dbl> <dbl>
##  1 #celebapprentice_#       256     0  17.3  2951.  12.5
##  2 coronavirus_Z             82     0  15.7   945.  10.8
##  3 #timetogettough_#         81     0  15.7   934.  10.8
##  4 #trumpforpresident_#      76     0  15.6   876.  10.7
##  5 #celebrityapprentice_#   101     2  13.7  1144.  10.6
##  6 #trump2016https_#         66     0  15.4   761.  10.4
##  7 amp_N                   3315   633  10.7 34984.  10.4
##  8 #trumpvlog_#              76     1  14.0   865.  10.4
##  9 #kag2020_#                58     0  15.2   668.  10.2
## 10 donaldtrump_Z            134    10  12.0  1472.  10.1
## # ℹ 21,770 more rows

Think break: What is your first impression of the three keyness measures? How many significant keywords are there in total? Does it make sense to look at top-10 keywords? What else could you do to bring out more interesting keywords?

sum(tta$LR != 0) # LR = 0 if not significant
## [1] 3492

Unsurprisingly, many hashtags and proper names are (almost) unique to Trump’s tweets. It might make sense to filter out these parts of speech in order to focus on nouns, verbs and adjectives. Moreover, 10 keywords are far too few to obtain interesting insights and judge the usefulness of a keyness measure (instead leading researchers towards default settings that make the top-10 keywords look good, as with SimpleMaths). The DT package allows us to view a larger candidate table interactively in the knitted HTML output.

tta %>%
  filter( !grepl("_[#Z]$", lemma, perl=TRUE) ) %>%
  arrange(desc(LRC)) %>%
  head(200) %>%
  DT::datatable()

3.2 AmE vs BrE (AntConc)

AntConc conveniently provides word frequency lists in MTSV format (though with individual files as CSV rather than TSV). Each word frequency list is a collection of three tables combined into a ZIP archive (saved with Save Current Tab Database Tables). Since I don’t use AntConc myself, I can only provide a boring example comparing keywords of AmE vs. BrE, based on the sample corpora included in AntConc. The MTSV ZIP archives are named antconc_AmE.zip and antconc_BrE.zip.

We can use R to list the contents of each ZIP file:

unzip("antconc_AmE.zip", list=TRUE)
##                Name Length                Date
## 1   corpus_info.csv    521 2023-09-10 09:45:00
## 2 wordlist_info.csv    407 2023-09-10 09:45:00
## 3      wordlist.csv 715499 2023-09-10 09:45:00
unzip("antconc_BrE.zip", list=TRUE)
##                Name Length                Date
## 1   corpus_info.csv    488 2023-09-10 09:39:00
## 2 wordlist_info.csv    407 2023-09-10 09:39:00
## 3      wordlist.csv 695654 2023-09-10 09:39:00

We will need the main table (wordlist.csv) and the corpus metadata with sample size information (corpus_info.csv). These can be extracted directly from the ZIP archive using unz().

ant_ae <- read_csv(unz("antconc_AmE.zip", "wordlist.csv"))
## Rows: 44434 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): type
## dbl (2): freq, range
## lgl (2): pos, headword
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ant_ae_meta <- read_csv(unz("antconc_AmE.zip", "corpus_info.csv"))
## Rows: 1 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): full_name, short_name, language, mode, license, reference, summary...
## dbl  (4): file_count, token_count, type_count, date
## lgl  (1): number_replace
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ant_be <- read_csv(unz("antconc_BrE.zip", "wordlist.csv"), show_col_types=FALSE)
ant_be_meta <- read_csv(unz("antconc_BrE.zip", "corpus_info.csv"), show_col_types=FALSE)

Task: Take a look at the MTSV tables with View(ant_ae) and View(ant_ae_meta), or by clicking the little table icons in the Environment tab (usually in the top right pane). Can you work out what the different columns mean and which columns we need?

Let us rename columns so they match our standard notation and discard the empty columns.

ant_ae %>% select(type, f=freq, df=range) -> ant_ae
ant_be %>% select(type, f=freq, df=range) -> ant_be
ant_ae_n  <- ant_ae_meta$token_count
ant_ae_dn <- ant_ae_meta$file_count
ant_be_n  <- ant_be_meta$token_count
ant_be_dn <- ant_be_meta$file_count

In this case of a “symmetric” keyword analysis between two comparable corpora, we should be interested both in positive keywords (characteristic of AmE) and in negative keywords (characteristic of BrE). Therefore we need an outer (or full) join between the two frequency tables. Keep in mind that each of the four frequency columns \(f_1, df_1, f_2, df_2\) might contain missing values that need to be filled in with 0.

full_join(ant_ae, ant_be, by="type", suffix=c("1", "2")) %>%
  mutate(
    f1  = coalesce(f1, 0),
    df1 = coalesce(df1, 0),
    f2  = coalesce(f2, 0),
    df2 = coalesce(df2, 0)
  ) ->
  ant

The rest of the keyword analysis is straightforward. Let us look at top-200 positive and negative keywords, combining them in a single ranking as e.g. done by CQPweb.

ant %>%
  mutate(
    LRC = keyness(f1, ant_ae_n, f2, ant_be_n, "LRC")
  ) %>%
  arrange(desc(abs(LRC))) %>% # note the abs()!
  head(200) %>%
  DT::datatable()

Task: What do you think of these keywords? Try other keyness measures. How many keywords are significant? What changes if you measure keyness based on document frequency rather than token frequency? Are there different effects for different keyness measures?

3.3 Online newspaper comments (CQPweb)

CQPweb saves its word frequency lists in a more unusual format. The two samples provided here as ap_guardian.txt and ap_telegraph.txt are from two small corpora of online newspaper comments available on the CQPweb server at Lancaster U.

Task: Open the files in RStudio or a text editor. Can you figure out the format? Do you think that read_delim() will read them successfully? Also make sure to scroll down to the end of the file: what about the last line of the table?

The informative header added by CQPweb and the blank lines break the standard TSV format. There is also an extra row at then end with the sample size; though it looks quite different, it does fit into the TSV table (with Total: in the first column and the sample size in the last).

Fortunately, we can tell read_delim() to skip the header lines and ignore blank lines, allowing us to read the CQPweb format without further pre-processing.

ap_guardian <- read_delim("ap_guardian.txt", quote="", skip=2, skip_empty_rows=TRUE)
## Rows: 30589 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): Number, Lemma
## dbl (1): Frequency
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ap_telegraph <- read_delim("ap_telegraph.txt", quote="", skip=2, skip_empty_rows=TRUE)
## Rows: 35754 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): Number, Lemma
## dbl (1): Frequency
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Think break: Can you figure out why R has read the first column (Number) as strings (type character) even though it seems to contain numbers?

As a first step, we need to extract the sample sizes from the tables and remove these lines. We can also rename columns to our standard notation and drop the irrelevant Number column.

ap_guardian  %>% filter(Number == "Total:") %>% pull(Frequency) -> ap_n1
ap_telegraph %>% filter(Number == "Total:") %>% pull(Frequency) -> ap_n2

ap_guardian %>%
  filter(Number != "Total:") %>%
  select(lemma=Lemma, f=Frequency) ->
  ap_guardian

ap_telegraph %>%
  filter(Number != "Total:") %>%
  select(lemma=Lemma, f=Frequency) ->
  ap_telegraph

The remainder of the analysis should be familiar and boring by now. Since we’re looking at contrastive keywords for the liberal vs. conservative newspapers, we will be interested in both positive and negative keywords, so we need a full outer join.

full_join(ap_guardian, ap_telegraph, by="lemma", suffix=c("1", "2")) %>%
  mutate(
    f1 = coalesce(f1, 0),
    f2 = coalesce(f2, 0)
  ) ->
  ap

Here we take a look at all significant \(G^2\) keywords.

ap %>%
  mutate( 
    G2 = keyness(f1, ap_n1, f2, ap_n2, "G2", alpha=.05)
  ) %>%
  filter( G2 != 0 ) %>%
  arrange( desc(abs(G2)) ) %>%
  DT::datatable()

Think break: Can you make sense of the positive and negative keywords? Does a clear pattern emerge? Can you think of better ways to display these keywords than as a list ranked by keyness? (Of course, you can also try some other keyness measures or look only at positive / negative keywords.)

3.4 Austerity in Britain (tidy CSV)

As a final example, you can look at two word lists from the AuBriN corpus (Austerity in British Newspapers, see Griebel et al. 2020), which contains newspaper articles on austerity policies in the UK from Guardian and Telegraph. This data set is provided in a tidy CSV format, though not full MTSV.

aubrin_guardian <- read_csv("austerity_guardian.csv", show_col_types=FALSE)
aubrin_telegraph <- read_csv("austerity_telegraph.csv", show_col_types=FALSE)
aubrin_meta <- read_csv("austerity_meta.csv", show_col_types=FALSE)

Task: Now it’s your turn!

  • Work out how the relevant data are provided in these three tables
  • Perform a join operation to combine frequencies \(f_1\) and \(f_2\) for each candidate item
  • Compute different keyness measures
  • Look at top-100 or top-200 positive and/or negative keywords for each measure
  • Can you find ways of combining information from multiple keyness measures? How could you compare top-200 lists from different measures?
  • Try all of this without peeking at the examples above!

4 Visualisation

Long keyword lists can be difficult to interpret, especially if they are simply ranked by keyness scores. Linguistic interpretation often involves grouping related keywords in order to discover general topics, framings, stances or communicative patterns. Suitable visualisations can be enormously helpful for this step, as long as you don’t resort to word clouds.

I will demonstrate some visualisation options based on the small CQPweb data set introduced above. It is quite easy to generate plots in R using the ggplot2 package together with tidyverse pipes to rearrange the tables in a suitable format. Fine-tuning such plots has a steeper learning curve, but many textbooks and online resources are available to help you.

4.1 Multidimensional keyness

Some authors (including Stefan Th. Gries) have suggested that keyness is a multidimensional phenomenon, while ranked keyword lists focus on a single aspect (in the form of the selected keyness measure). They propose to visualise keywords in two or more dimensions to represent e.g. both LogRatio and \(G^2\) scores.

ap %>% 
  mutate(
    LR = keyness(f1, ap_n1, f2, ap_n2, "LogRatio", alpha=.05),
    G2 = keyness(f1, ap_n1, f2, ap_n2, "G2"),
    LRC = keyness(f1, ap_n1, f2, ap_n2, "LRC")
  ) %>%
  filter( LRC != 0 ) -> # discard non-significant keywords
  ap_viz

Here is a visualisation of all significant keywords with \(G^2\) scores on the x-axis and LogRatio on the y-axis. Since LogRatio distinguishes positive and negative keywords, it makes sense to display \(|G^2|\) as a measure of how much evidence we have for keyness. We also need to use a logarithmic scale here because of the skewed distribution of log-likelihood scores.

ap_viz %>%
  ggplot(aes(x=abs(G2), y=LR, label=lemma)) +
  scale_x_log10() +
  geom_text()

Think break: Many of the keywords seem to be arranged on two sloped lines at the top and bottom of the plot. Do you have any idea why this is the case?

We can enrich such visualisations by conveying further information with visual variables such as (font) size and colour. For example, font size could indicate frequency (across both corpora, and on a logarithmic scale) and colour could be used to highlight lemmas that are exclusive to one of the two corpora.

ap_viz %>%
  mutate(
    unique = fct(case_when(
      f1 > 0 & f2 > 0 ~ "both",
      f1 > 0 ~ "A",
      f2 > 0 ~ "B"
    ), levels=c("both", "A", "B"))
  ) %>%
  ggplot(aes(x=abs(G2), y=LR, label=lemma, size=log10(f1 + f2), col=unique)) +
  scale_x_log10() +
  geom_text() +
  scale_colour_viridis_d() +
  theme_light(base_size=8)

4.2 Semantic maps

While the visualisation above conveys rich information in a single plot, it doesn’t really help us group related keywords together. For this purpose, a much better approach are semantic maps, which arrange keywords by their similarity in meaning. The semantic distances required for this visualisation are usually obtained from neural word embeddings, i.e. high-dimensional vector representations of the word meanings.

You will find embeddings for the CQPweb data set in file ap_embeddings.rda. They cover all candidate items that occur at least \(f_i \geq 5\) times in either A or B (which happens to include everything in ap_viz, but could easily be guaranteed with a suitable filter()).

load("ap_embeddings.rda", verbose=TRUE)
## Loading objects:
##   ap_embeddings

The embeddings are provided as a matrix of row vectors where rows are labelled with the respective lemma. If you wish to explore these embeddings by looking for nearest neighbour or visualising the network structure of similarities, you might want to give the wordspace package a try.

In order to use the word embeddings for a semantic map visualisation, we need to project the high-dimensional vectors into two dimensions, preserving the semantic distances between lemmas (or, more precisely, the topological structure of the embedding space) as faithfully as possible. A suitable state-of-the-art algorithm for this purpose is t-SNE, implemented in the Rtsne package.

library(Rtsne)

vocab <- ap_viz$lemma  # only project keywords to be visualised
tSNE <- Rtsne(ap_embeddings[vocab, ], perplexity=50)
coord <- tibble(lemma=vocab, x=tSNE$Y[, 1], y=tSNE$Y[, 2])

The parameter perplexity is a magic number that will have a substantial impact on the resulting visualisation. Now all we have to do is to add the visualisation coordinates to ap_viz and call ggplot() again.

inner_join(ap_viz, coord, by="lemma") %>%
  ggplot(aes(x=x, y=y, label=lemma)) +
  geom_text() +
  theme_void()

Again, we improve the visualisation with additional visual variables. In this case it makes sense to indicate (unsigned) keyness by font size and distinguish positive and negative keywords by their colour. Let us add the relevant information to a new table first.

inner_join(ap_viz, coord, by="lemma") %>%
  mutate(
    keyness = log10(abs(G2)),
    source = fct(case_when(
      LRC > 0 ~ "guardian",
      LRC < 0 ~ "telegraph"
    ), levels=c("guardian", "telegraph"))
  ) ->
  ap_smap
ap_smap %>%
  ggplot(aes(x=x, y=y, label=lemma, size=keyness, col=source)) +
  geom_text() +
  scale_colour_discrete(type=c("darkgreen", "darkred")) +
  scale_size(range=c(2, 7)) +
  theme_void()

The only remaining problem is that the overlapping keywords can be difficult to read. Fortunately, there’s the ggrepel package to take care of this issue (some fiddling with its parameters will be needed, though). The result is a meaningful and useful version of a word cloud!

library(ggrepel)

ap_smap %>%
  ggplot(aes(x=x, y=y, label=lemma, size=keyness, col=source)) +
  geom_text_repel(segment.size=0, force=.1, box.padding=0, max.overlaps=40) +
  scale_colour_discrete(type=c("darkgreen", "darkred")) +
  scale_size(range=c(1, 5)) +
  theme_void()

Task: Can you further improve this semantic map? Can you get rid of what appear to be usernames from the online forum? What property might identify them in the ap_smap table? You could also try adding more keywords even if they are not statistically significant, or visualising LRC as a keyness measure.

4.3 Creating your own maps

If you want to create semantic maps for your own data, you will need appropriate word embedding vectors. Though not quite state-of-the-art, FastText embeddings are popular because they are convenient to use and are available for 157 different languages. First, you have to download a pre-trained FastText model from https://fasttext.cc/docs/en/crawl-vectors.html#models (be sure to pick the binary version, e.g. cc.en.300.bin) and save it in the current project directory. You may also need to decompress the downloaded file so that it has the extension .bin. NB: These models are very large (usually around 6.7 GiB)!

Then install the fastTextR package to access the embeddings from within R:

library(fastTextR)
model <- ft_load("cc.en.300.bin")
my_embeddings <- ft_word_vectors(model, vocab)
dim(my_embeddings) # you can now pass them on to Rtsne