13 Computational Text Analysis

13.1 Folien

13.2 Code und Ausgaben aus der Vorlesung

Laden der relevanten Pakete

library(quanteda) # Populäres Paket für klassische BoW-Analysen

Package version: 4.3.1
Unicode version: 14.0
ICU version: 71.1

Parallel computing: disabled

See https://quanteda.io for tutorials and examples.

library(rollama) # Paket zur Kommunikation mit LLMs via https://ollama.com/
library(proxyC) # Paket zur schnellen Berechnung von Distanzmaßen (hier: Für Word Embeddings)


Attaching package: 'proxyC'

The following object is masked from 'package:stats':

    dist

The following objects are masked from 'package:base':

    crossprod, tcrossprod

library(tidyverse) # Datenmanagement und Visualisierung: https://www.tidyverse.org/

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Corpus

data_corpus_inaugural <- data_corpus_inaugural |> rev.default() # Umdrehen, damit aktuellste Reden zuerst kommen
data_corpus_inaugural |>
  _[1] |>
  as.character() |>
  str_sub(end = 3000) |>
  str_wrap(width = 80) |>
  cat()

Thank you. Thank you very much, everybody. Wow. Thank you very, very much.
Vice President Vance, Speaker Johnson, Senator Thune, Chief Justice Roberts,
justices of the Supreme Court of the United States, President Clinton, President
Bush, President Obama, President Biden, Vice President Harris, and my fellow
citizens, the golden age of America begins right now. From this day forward, our
country will flourish and be respected again all over the world. We will be the
envy of every nation, and we will not allow ourselves to be taken advantage of
any longer. During every single day of the Trump administration, I will, very
simply, put America first. Our sovereignty will be reclaimed. Our safety will
be restored. The scales of justice will be rebalanced. The vicious, violent,
and unfair weaponization of the Justice Department and our government will end.
And our top priority will be to create a nation that is proud, prosperous, and
free. America will soon be greater, stronger, and far more exceptional than ever
before. I return to the presidency confident and optimistic that we are at the
start of a thrilling new era of national success. A tide of change is sweeping
the country, sunlight is pouring over the entire world, and America has the
chance to seize this opportunity like never before. But first, we must be honest
about the challenges we face. While they are plentiful, they will be annihilated
by this great momentum that the world is now witnessing in the United States
of America. As we gather today, our government confronts a crisis of trust. For
many years, a radical and corrupt establishment has extracted power and wealth
from our citizens while the pillars of our society lay broken and seemingly in
complete disrepair. We now have a government that cannot manage even a simple
crisis at home while, at the same time, stumbling into a continuing catalogue
of catastrophic events abroad. It fails to protect our magnificent, law-abiding
American citizens but provides sanctuary and protection for dangerous criminals,
many from prisons and mental institutions, that have illegally entered our
country from all over the world. We have a government that has given unlimited
funding to the defense of foreign borders but refuses to defend American borders
or, more importantly, its own people. Our country can no longer deliver basic
services in times of emergency, as recently shown by the wonderful people of
North Carolina — who have been treated so badly — (applause) — and other states
who are still suffering from a hurricane that took place many months ago or,
more recently, Los Angeles, where we are watching fires still tragically burn
from weeks ago without even a token of defense. They’re raging through the
houses and communities, even affecting some of the wealthiest and most powerful
individuals in our country — some of whom are sitting here right now. They don’t
have a home any longer. That’s interesting. But we can’t let

Tokenisierung

data_corpus_inaugural |>
  _[1] |>
  tokens(
    what = "word",
    remove_punct = TRUE,
    remove_symbols = FALSE,
    remove_numbers = TRUE,
    remove_url = FALSE,
    remove_separators = TRUE,
    split_hyphens = FALSE,
    split_tags = FALSE
  ) |>
  as.character() |>
  head(100)

  [1] "Thank"          "you"            "Thank"          "you"           
  [5] "very"           "much"           "everybody"      "Wow"           
  [9] "Thank"          "you"            "very"           "very"          
 [13] "much"           "Vice"           "President"      "Vance"         
 [17] "Speaker"        "Johnson"        "Senator"        "Thune"         
 [21] "Chief"          "Justice"        "Roberts"        "justices"      
 [25] "of"             "the"            "Supreme"        "Court"         
 [29] "of"             "the"            "United"         "States"        
 [33] "President"      "Clinton"        "President"      "Bush"          
 [37] "President"      "Obama"          "President"      "Biden"         
 [41] "Vice"           "President"      "Harris"         "and"           
 [45] "my"             "fellow"         "citizens"       "the"           
 [49] "golden"         "age"            "of"             "America"       
 [53] "begins"         "right"          "now"            "From"          
 [57] "this"           "day"            "forward"        "our"           
 [61] "country"        "will"           "flourish"       "and"           
 [65] "be"             "respected"      "again"          "all"           
 [69] "over"           "the"            "world"          "We"            
 [73] "will"           "be"             "the"            "envy"          
 [77] "of"             "every"          "nation"         "and"           
 [81] "we"             "will"           "not"            "allow"         
 [85] "ourselves"      "to"             "be"             "taken"         
 [89] "advantage"      "of"             "any"            "longer"        
 [93] "During"         "every"          "single"         "day"           
 [97] "of"             "the"            "Trump"          "administration"

Kleinschreibung

data_corpus_inaugural |>
  _[1] |>
  tokens(
    what = "word",
    remove_punct = TRUE,
    remove_symbols = FALSE,
    remove_numbers = TRUE,
    remove_url = FALSE,
    remove_separators = TRUE,
    split_hyphens = FALSE,
    split_tags = FALSE
  ) |>
  tokens_tolower() |>
  as.character() |>
  head(100)

  [1] "thank"          "you"            "thank"          "you"           
  [5] "very"           "much"           "everybody"      "wow"           
  [9] "thank"          "you"            "very"           "very"          
 [13] "much"           "vice"           "president"      "vance"         
 [17] "speaker"        "johnson"        "senator"        "thune"         
 [21] "chief"          "justice"        "roberts"        "justices"      
 [25] "of"             "the"            "supreme"        "court"         
 [29] "of"             "the"            "united"         "states"        
 [33] "president"      "clinton"        "president"      "bush"          
 [37] "president"      "obama"          "president"      "biden"         
 [41] "vice"           "president"      "harris"         "and"           
 [45] "my"             "fellow"         "citizens"       "the"           
 [49] "golden"         "age"            "of"             "america"       
 [53] "begins"         "right"          "now"            "from"          
 [57] "this"           "day"            "forward"        "our"           
 [61] "country"        "will"           "flourish"       "and"           
 [65] "be"             "respected"      "again"          "all"           
 [69] "over"           "the"            "world"          "we"            
 [73] "will"           "be"             "the"            "envy"          
 [77] "of"             "every"          "nation"         "and"           
 [81] "we"             "will"           "not"            "allow"         
 [85] "ourselves"      "to"             "be"             "taken"         
 [89] "advantage"      "of"             "any"            "longer"        
 [93] "during"         "every"          "single"         "day"           
 [97] "of"             "the"            "trump"          "administration"

stopwords

stopwords()

  [1] "i"          "me"         "my"         "myself"     "we"        
  [6] "our"        "ours"       "ourselves"  "you"        "your"      
 [11] "yours"      "yourself"   "yourselves" "he"         "him"       
 [16] "his"        "himself"    "she"        "her"        "hers"      
 [21] "herself"    "it"         "its"        "itself"     "they"      
 [26] "them"       "their"      "theirs"     "themselves" "what"      
 [31] "which"      "who"        "whom"       "this"       "that"      
 [36] "these"      "those"      "am"         "is"         "are"       
 [41] "was"        "were"       "be"         "been"       "being"     
 [46] "have"       "has"        "had"        "having"     "do"        
 [51] "does"       "did"        "doing"      "would"      "should"    
 [56] "could"      "ought"      "i'm"        "you're"     "he's"      
 [61] "she's"      "it's"       "we're"      "they're"    "i've"      
 [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
 [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
 [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
 [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
[101] "who's"      "what's"     "here's"     "there's"    "when's"    
[106] "where's"    "why's"      "how's"      "a"          "an"        
[111] "the"        "and"        "but"        "if"         "or"        
[116] "because"    "as"         "until"      "while"      "of"        
[121] "at"         "by"         "for"        "with"       "about"     
[126] "against"    "between"    "into"       "through"    "during"    
[131] "before"     "after"      "above"      "below"      "to"        
[136] "from"       "up"         "down"       "in"         "out"       
[141] "on"         "off"        "over"       "under"      "again"     
[146] "further"    "then"       "once"       "here"       "there"     
[151] "when"       "where"      "why"        "how"        "all"       
[156] "any"        "both"       "each"       "few"        "more"      
[161] "most"       "other"      "some"       "such"       "no"        
[166] "nor"        "not"        "only"       "own"        "same"      
[171] "so"         "than"       "too"        "very"       "will"

stopwords entfernen

data_corpus_inaugural |>
  _[1] |>
  tokens(
    what = "word",
    remove_punct = TRUE,
    remove_symbols = FALSE,
    remove_numbers = TRUE,
    remove_url = FALSE,
    remove_separators = TRUE,
    split_hyphens = FALSE,
    split_tags = FALSE
  ) |>
  tokens_tolower() |>
  tokens_remove(stopwords()) |>
  as.character() |>
  head(100)

  [1] "thank"          "thank"          "much"           "everybody"     
  [5] "wow"            "thank"          "much"           "vice"          
  [9] "president"      "vance"          "speaker"        "johnson"       
 [13] "senator"        "thune"          "chief"          "justice"       
 [17] "roberts"        "justices"       "supreme"        "court"         
 [21] "united"         "states"         "president"      "clinton"       
 [25] "president"      "bush"           "president"      "obama"         
 [29] "president"      "biden"          "vice"           "president"     
 [33] "harris"         "fellow"         "citizens"       "golden"        
 [37] "age"            "america"        "begins"         "right"         
 [41] "now"            "day"            "forward"        "country"       
 [45] "flourish"       "respected"      "world"          "envy"          
 [49] "every"          "nation"         "allow"          "taken"         
 [53] "advantage"      "longer"         "every"          "single"        
 [57] "day"            "trump"          "administration" "simply"        
 [61] "put"            "america"        "first"          "sovereignty"   
 [65] "reclaimed"      "safety"         "restored"       "scales"        
 [69] "justice"        "rebalanced"     "vicious"        "violent"       
 [73] "unfair"         "weaponization"  "justice"        "department"    
 [77] "government"     "end"            "top"            "priority"      
 [81] "create"         "nation"         "proud"          "prosperous"    
 [85] "free"           "america"        "soon"           "greater"       
 [89] "stronger"       "far"            "exceptional"    "ever"          
 [93] "return"         "presidency"     "confident"      "optimistic"    
 [97] "start"          "thrilling"      "new"            "era"

stemming

data_corpus_inaugural |>
  _[1] |>
  tokens(
    what = "word",
    remove_punct = TRUE,
    remove_symbols = FALSE,
    remove_numbers = TRUE,
    remove_url = FALSE,
    remove_separators = TRUE,
    split_hyphens = FALSE,
    split_tags = FALSE
  ) |>
  tokens_tolower() |>
  tokens_remove(stopwords()) |>
  tokens_wordstem() |>
  as.character() |>
  head(100)

  [1] "thank"       "thank"       "much"        "everybodi"   "wow"        
  [6] "thank"       "much"        "vice"        "presid"      "vanc"       
 [11] "speaker"     "johnson"     "senat"       "thune"       "chief"      
 [16] "justic"      "robert"      "justic"      "suprem"      "court"      
 [21] "unit"        "state"       "presid"      "clinton"     "presid"     
 [26] "bush"        "presid"      "obama"       "presid"      "biden"      
 [31] "vice"        "presid"      "harri"       "fellow"      "citizen"    
 [36] "golden"      "age"         "america"     "begin"       "right"      
 [41] "now"         "day"         "forward"     "countri"     "flourish"   
 [46] "respect"     "world"       "envi"        "everi"       "nation"     
 [51] "allow"       "taken"       "advantag"    "longer"      "everi"      
 [56] "singl"       "day"         "trump"       "administr"   "simpli"     
 [61] "put"         "america"     "first"       "sovereignti" "reclaim"    
 [66] "safeti"      "restor"      "scale"       "justic"      "rebalanc"   
 [71] "vicious"     "violent"     "unfair"      "weapon"      "justic"     
 [76] "depart"      "govern"      "end"         "top"         "prioriti"   
 [81] "creat"       "nation"      "proud"       "prosper"     "free"       
 [86] "america"     "soon"        "greater"     "stronger"    "far"        
 [91] "except"      "ever"        "return"      "presid"      "confid"     
 [96] "optimist"    "start"       "thrill"      "new"         "era"

Document-Term-Matrix

data_corpus_inaugural |>
  tokens(
    what = "word",
    remove_punct = TRUE,
    remove_symbols = FALSE,
    remove_numbers = TRUE,
    remove_url = FALSE,
    remove_separators = TRUE,
    split_hyphens = FALSE,
    split_tags = FALSE
  ) |>
  tokens_tolower() |>
  tokens_remove(stopwords()) |>
  tokens_wordstem() |>
  dfm() |>
  print(max_ndoc = 20, max_nfeat = 9)

Document-feature matrix of: 60 documents, 5,478 features (89.35% sparse) and 4 docvars.
                 features
docs              thank much everybodi wow vice presid vanc speaker johnson
  2025-Trump         23    6         1   1    2     10    1       1       1
  2021-Biden          3    8         0   0    3      7    0       1       0
  2017-Trump          3    1         0   0    0      5    0       0       0
  2013-Obama          1    0         0   0    1      2    0       0       0
  2009-Obama          2    1         0   0    0      1    0       0       0
  2005-Bush           0    0         0   0    1      4    0       0       0
  2001-Bush           2    2         0   0    1      3    0       0       0
  1997-Clinton        0    1         0   0    0      1    0       0       0
  1993-Clinton        2    2         0   0    0      3    0       0       0
  1989-Bush           5    3         0   0    1      7    0       3       0
  1985-Reagan         1    4         0   0    1      3    0       1       0
  1981-Reagan         2    4         0   0    2      6    0       1       0
  1977-Carter         1    0         0   0    0      3    0       0       0
  1973-Nixon          0    1         0   0    1      1    0       1       0
  1969-Nixon          1    0         0   0    2      3    0       0       1
  1965-Johnson        0    1         0   0    0      0    0       0       0
  1961-Kennedy        0    1         0   0    2      4    0       1       1
  1957-Eisenhower     0    2         0   0    1      1    0       1       0
  1953-Eisenhower     0    0         0   0    0      0    0       0       0
  1949-Truman         0    0         0   0    1      1    0       0       0
[ reached max_ndoc ... 40 more documents, reached max_nfeat ... 5,469 more features ]

100 haeufigste Woerter

top100 <- data_corpus_inaugural |>
  tokens(
    what = "word",
    remove_punct = TRUE,
    remove_symbols = FALSE,
    remove_numbers = TRUE,
    remove_url = FALSE,
    remove_separators = TRUE,
    split_hyphens = FALSE,
    split_tags = FALSE
  ) |>
  tokens_remove(stopwords()) |>
  dfm() |>
  topfeatures(n = 100) |>
  names()

Word Embeddings

top100_emb <- top100 |>
  embed_text(model = "qwen3-embedding:0.6b") |>
  mutate(feature = top100) |>
  relocate(feature)
top100_emb |>
  mutate(across(where(is.numeric), \(x) round(x, digits = 3))) |>
  print(n = 50)

# A tibble: 100 × 1,025
   feature  dim_1  dim_2  dim_3  dim_4  dim_5  dim_6  dim_7  dim_8  dim_9 dim_10
   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 people  -0.045 -0.022 -0.013 -0.007  0.036 -0.068  0.047 -0.034 -0.027  0.024
 2 govern… -0.024 -0.062 -0.011 -0.022  0.053 -0.112 -0.009  0.034 -0.021  0.049
 3 us       0.009 -0.037 -0.014 -0.066  0.053  0.004  0.001 -0.003 -0.037  0.006
 4 can     -0.006 -0.065 -0.014 -0.062  0.025 -0.009  0.036  0.015 -0.007  0.045
 5 must    -0.029 -0.052 -0.013 -0.058  0.039 -0.013  0.051  0.003 -0.02   0.058
 6 upon     0.009 -0.062 -0.014 -0.055  0.038 -0.004  0.026  0.03  -0.035  0.014
 7 great   -0.022 -0.029 -0.014  0.017  0.055  0.009 -0.029  0.049 -0.017  0.025
 8 states   0.006 -0.067 -0.014 -0.067  0.039  0.041  0.039 -0.02  -0.005  0.04 
 9 may     -0.057 -0.009 -0.014 -0.045  0.043 -0.011  0.035  0.034 -0.001  0.013
10 world   -0.015  0.017 -0.012 -0.024  0.057 -0.058  0.009  0.024 -0.067 -0.014
11 nation   0.009 -0.033 -0.013 -0.022  0.065 -0.05  -0.022  0.027 -0.03   0.014
12 country -0.017  0     -0.012 -0.008  0.084 -0.069 -0.013 -0.005 -0.007  0.032
13 shall   -0.002 -0.074 -0.014 -0.043  0.003 -0.028  0.021  0.033 -0.058  0    
14 every   -0.038 -0.05  -0.015 -0.046  0.041 -0.035  0.043  0.035 -0.028 -0.012
15 one      0.007 -0.068 -0.017 -0.057  0.028  0.035  0.006  0.001 -0.039  0.019
16 peace   -0.007 -0.003 -0.013  0.016  0.026 -0.053 -0.019  0.035 -0.018  0.034
17 new     -0.017 -0.054 -0.016 -0.056  0.035 -0.031  0.01   0.083 -0.058 -0.001
18 power    0.015 -0.036 -0.013 -0.004  0.05   0.031  0.02  -0.018 -0.021  0.062
19 now     -0.012 -0.065 -0.015 -0.059  0.05  -0.018  0.029  0.041 -0.046  0.021
20 public  -0.024 -0.071 -0.012 -0.035  0.023 -0.083 -0.005  0.01  -0.031  0.058
21 time    -0.048 -0.013 -0.014 -0.064  0.067  0.023  0.009  0.028 -0.044 -0.007
22 america -0.011  0     -0.012 -0.038  0.039 -0.023 -0.012 -0.004 -0.016  0.018
23 citize… -0.008 -0.043 -0.013 -0.007  0.055 -0.126  0.066  0.003 -0.019  0.017
24 united   0.009 -0.025 -0.014 -0.039  0.053 -0.029  0.004  0.008 -0.055  0.022
25 consti…  0.02  -0.091 -0.013 -0.044  0.034 -0.021 -0.01   0.012 -0.042  0.018
26 nations  0.006 -0.036 -0.013 -0.06   0.041 -0.07   0.018  0.02  -0.039  0.006
27 union    0.009 -0.063 -0.016 -0.062  0.054  0.015  0.008  0.022 -0.026  0.022
28 freedom  0.005 -0.043 -0.013 -0.029  0.053 -0.106 -0.015  0.056 -0.024  0.02 
29 free    -0.02  -0.042 -0.015 -0.086  0.062 -0.052 -0.021  0.072 -0.04   0.042
30 americ… -0.01  -0.009 -0.015 -0.019  0.051 -0.04  -0.033  0.015  0.007  0.016
31 war      0.029 -0.05  -0.012  0.003  0.053 -0.056 -0.03   0.019 -0.047  0.007
32 nation…  0.003 -0.052 -0.015 -0.009  0.056 -0.062 -0.026  0.038 -0.025  0.02 
33 let     -0.004 -0.046 -0.013 -0.069 -0.015  0.012 -0.029  0.058 -0.05   0.031
34 made    -0.034 -0.044 -0.014 -0.07   0.018 -0.012  0.048  0.022 -0.031  0.064
35 years   -0.025  0.004 -0.015 -0.009  0.078 -0.041  0.057  0.022 -0.051 -0.02 
36 make    -0.032 -0.041 -0.014 -0.08   0.031 -0.046  0.039  0.037 -0.055  0.054
37 good    -0.021 -0.045 -0.015 -0.01   0.038 -0.006 -0.039  0.065 -0.008  0.066
38 justice  0.017 -0.047 -0.012  0      0.034 -0.069 -0.009  0.062 -0.015 -0.021
39 spirit   0.011  0.035 -0.011  0.002  0.049 -0.025  0.015  0.013 -0.021  0.019
40 never   -0.018 -0.055 -0.015 -0.051  0.029 -0.012 -0.01   0.05  -0.047  0.019
41 without -0.026 -0.064 -0.016 -0.077  0.045 -0.019  0.001  0.034 -0.047  0.027
42 life    -0.045  0.007 -0.013  0.004  0.047 -0.028  0.02   0.033 -0.028  0.021
43 men     -0.023 -0.049 -0.015 -0.014  0.009 -0.057  0.038 -0.015 -0.014  0.013
44 rights  -0.009 -0.021 -0.015 -0.075  0.037 -0.098 -0.029  0.047 -0.031  0.026
45 law     -0.014 -0.047 -0.011 -0.016  0.024 -0.078 -0.019  0.074 -0.009 -0.011
46 just     0.031 -0.042 -0.012 -0.038  0.025  0.017  0.02   0.043 -0.04   0.036
47 congre… -0.019 -0.068 -0.011 -0.008  0.035 -0.051 -0.009  0.003 -0.044  0.034
48 laws    -0.023 -0.037 -0.013 -0.024  0.024 -0.076 -0.02   0.067 -0.013 -0.015
49 right   -0.015 -0.032 -0.014 -0.015  0.041 -0.015 -0.038  0.056 -0.026  0.027
50 work    -0.015 -0.048 -0.012 -0.046  0.037 -0.03   0.051  0.03  -0.008  0.049
# ℹ 50 more rows
# ℹ 1,014 more variables: dim_11 <dbl>, dim_12 <dbl>, dim_13 <dbl>,
#   dim_14 <dbl>, dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>, dim_18 <dbl>,
#   dim_19 <dbl>, dim_20 <dbl>, dim_21 <dbl>, dim_22 <dbl>, dim_23 <dbl>,
#   dim_24 <dbl>, dim_25 <dbl>, dim_26 <dbl>, dim_27 <dbl>, dim_28 <dbl>,
#   dim_29 <dbl>, dim_30 <dbl>, dim_31 <dbl>, dim_32 <dbl>, dim_33 <dbl>,
#   dim_34 <dbl>, dim_35 <dbl>, dim_36 <dbl>, dim_37 <dbl>, dim_38 <dbl>, …

Word Embeddings: Aehnlichkeit

top100_sim_matrix <- simil(as.matrix(select(top100_emb, -feature)), method = "cosine")
dimnames(top100_sim_matrix) <- list(top100, top100)
top100_sim_matrix |>
  as.matrix() |>
  data.frame() |>
  rownames_to_column(var = "feature") |>
  as_tibble() |>
  gather("feature2", "similarity", -feature) |>
  filter(feature > feature2) |>
  arrange(desc(similarity)) |>
  print(n = 30)

# A tibble: 4,950 × 3
   feature      feature2 similarity
   <chr>        <chr>         <dbl>
 1 laws         law           0.970
 2 states       state         0.964
 3 national     nation        0.951
 4 nation       country       0.933
 5 men          man           0.924
 6 duty         duties        0.906
 7 people       men           0.906
 8 national     country       0.897
 9 americans    america       0.895
10 political    policy        0.892
11 nation       america       0.892
12 law          justice       0.891
13 country      america       0.890
14 one          first         0.884
15 without      within        0.877
16 nations      nation        0.877
17 without      now           0.877
18 without      upon          0.872
19 good         best          0.871
20 constitution congress      0.870
21 americans    american      0.869
22 now          new           0.867
23 today        day           0.867
24 much         many          0.865
25 laws         justice       0.865
26 make         made          0.864
27 now          never         0.863
28 government   congress      0.861
29 still        now           0.861
30 never        ever          0.860
# ℹ 4,920 more rows

Sentence Embeddings

set.seed(1)
sentence_sample <- data_corpus_inaugural |>
  tokens(what = "sentence") |>
  as.character() |>
  sample(300) |>
  unique()
sentence_emb <- sentence_sample |>
  embed_text(model = "qwen3-embedding:0.6b") |>
  mutate(sentence = sentence_sample) |>
  relocate(sentence)
print(sentence_emb, n = 50)

# A tibble: 300 × 1,025
   sentence          dim_1    dim_2    dim_3    dim_4    dim_5    dim_6    dim_7
   <chr>             <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 "To renew Ame… -2.25e-2 -0.0336  -0.0105   2.37e-2  0.0603  -1.12e-1  9.42e-3
 2 "Actual event…  1.67e-2 -0.0587  -0.00838  6.88e-2  0.0559  -9.04e-2 -3.42e-2
 3 "We can gain … -1.17e-2 -0.0244  -0.00678  4.06e-2 -0.0129  -3.32e-2 -2.70e-2
 4 "As early as … -1.45e-2  0.0235  -0.00560  4.22e-2 -0.0105  -8.61e-2  3.67e-2
 5 "Just as Amer… -4.32e-3 -0.0319  -0.00575 -3.59e-3  0.00715 -8.32e-2  3.65e-5
 6 "If this is t…  3.53e-2 -0.0804  -0.0102  -2.30e-3 -0.0433  -1.20e-1  4.58e-2
 7 "I am certain…  1.18e-2 -0.0690  -0.00791  1.57e-2 -0.0134  -1.07e-1  1.70e-2
 8 "And, I belie… -3.41e-2 -0.0132  -0.00555  1.30e-2  0.00736 -2.51e-2 -5.87e-2
 9 "It is safe t… -4.17e-2 -0.0704  -0.00512  1.63e-3  0.0233  -6.04e-2 -4.09e-2
10 "Friends and …  3.86e-2 -0.0758  -0.00910 -4.15e-3 -0.0486  -1.01e-1  4.90e-2
11 "The Governme… -2.02e-2 -0.0154  -0.00691 -2.68e-2  0.0515  -6.24e-2 -1.96e-2
12 "In doing thi…  1.89e-2 -0.100   -0.00635 -5.79e-3  0.0120  -7.62e-2 -1.32e-2
13 "Rather, it h… -1.19e-2 -0.0370  -0.0129   5.96e-2 -0.00316 -1.13e-1 -1.66e-2
14 "I have appro…  3.14e-2 -0.0557  -0.00991 -1.11e-2  0.0578  -3.22e-2 -2.34e-2
15 "We will repa…  7.85e-2 -0.0361  -0.0111  -3.51e-2  0.0818  -9.00e-2  3.13e-3
16 "Finally, to …  3.59e-2 -0.00509 -0.0110   1.55e-3  0.00858 -9.71e-2 -9.65e-3
17 "My best effo… -1.18e-2 -0.0647  -0.0116  -4.49e-3  0.0310   6.54e-4  5.29e-2
18 "And we have …  8.67e-3 -0.0619  -0.00824 -2.13e-2  0.0371  -6.51e-2 -3.48e-2
19 "And we will … -1.34e-2 -0.0580  -0.00585 -1.78e-2  0.0334  -7.92e-2 -1.78e-2
20 "And in any c… -2.45e-2 -0.0793  -0.00348 -3.17e-2  0.0811  -8.45e-3 -4.05e-2
21 "Today I also…  2.14e-2 -0.0480  -0.00588 -2.26e-2  0.0523  -1.41e-1  9.61e-2
22 "We must keep… -1.12e-2 -0.0462  -0.00956 -1.11e-4  0.0505  -8.39e-2  4.17e-2
23 "What makes u…  5.86e-3 -0.0482  -0.00943  1.50e-2  0.0267  -9.91e-2  9.89e-4
24 "The prayers …  2.14e-2 -0.0116  -0.00460 -2.24e-2  0.0174   1.77e-4  3.41e-2
25 "The whole sy…  5.67e-4 -0.139   -0.0122   5.36e-3 -0.00749 -7.54e-2  4.23e-2
26 "War never le… -2.68e-2 -0.0443  -0.00852  1.92e-2  0.0485  -9.41e-2 -2.44e-2
27 "I would like… -7.08e-2  0.0102  -0.0102   4.41e-2  0.0500  -1.11e-1  1.65e-2
28 "Small wonder… -8.62e-3 -0.0385  -0.00951  2.99e-2  0.0197  -8.60e-2  4.91e-2
29 "We are told …  3.23e-2 -0.0597  -0.00792  8.33e-2  0.0175  -9.01e-2 -4.42e-2
30 "Whoever woul… -5.41e-2 -0.00349 -0.00527  4.77e-2 -0.0681  -1.14e-2 -3.52e-2
31 "If we meet t…  1.49e-2 -0.0626  -0.0102   1.43e-2 -0.00148 -8.34e-2  6.12e-3
32 "The North an… -2.56e-2 -0.0886  -0.00970  3.97e-2  0.0424  -8.56e-2  1.74e-2
33 "This trial c…  2.23e-2 -0.0138  -0.00806  1.32e-1  0.0343  -3.52e-2 -1.37e-2
34 "Ours is a la… -5.39e-2 -0.0437  -0.00798  1.44e-2  0.0395  -8.75e-2  3.89e-2
35 "But we shall…  3.04e-3 -0.0521  -0.00774 -3.27e-3  0.0128  -1.83e-2  4.81e-2
36 "Experience h…  6.17e-3 -0.102   -0.00788  5.72e-2 -0.0193  -9.25e-2  6.29e-2
37 "In our prese…  1.59e-2 -0.0554  -0.00385 -9.71e-3  0.0357  -3.56e-2  9.67e-3
38 "Now the very… -4.03e-2 -0.116   -0.00688 -2.93e-3  0.00555  1.12e-2  3.17e-2
39 "Wise counsel…  1.13e-4 -0.0791  -0.00624  2.66e-2 -0.0249  -6.19e-2  5.10e-2
40 "So much has … -7.68e-2 -0.0377  -0.0110   3.19e-2  0.0662  -3.03e-2  1.97e-2
41 "They have as…  4.38e-2 -0.0985  -0.00940  7.07e-3  0.0153  -4.62e-2  4.15e-2
42 "We have beat…  7.35e-2 -0.0288  -0.0126   5.14e-2  0.0444  -3.66e-2 -7.72e-3
43 "Fortunately … -5.41e-2 -0.0927  -0.00480  3.58e-2  0.0426  -1.31e-2  3.39e-2
44 "We contempla… -1.86e-2 -0.0711  -0.01000 -1.17e-3  0.0277  -7.34e-2  6.52e-2
45 "They made a …  1.10e-2 -0.0527  -0.00426 -3.91e-2  0.00186 -9.29e-2  3.01e-2
46 "Some may sti… -5.47e-2  0.0630  -0.00754  8.45e-2  0.0419  -6.70e-2  4.23e-2
47 "President Re… -3.50e-2  0.0179  -0.00685 -4.03e-2  0.00825 -4.26e-2 -1.05e-2
48 "It fails to …  9.40e-3 -0.0660  -0.00717  4.52e-2  0.0375  -1.52e-1 -3.48e-2
49 "But we have … -2.42e-2 -0.0749  -0.00827  1.92e-2  0.0213   7.21e-2  4.12e-4
50 "A bridge wid…  1.20e-2 -0.0154  -0.00821  1.84e-2  0.0260  -8.72e-2  6.55e-2
# ℹ 250 more rows
# ℹ 1,017 more variables: dim_8 <dbl>, dim_9 <dbl>, dim_10 <dbl>, dim_11 <dbl>,
#   dim_12 <dbl>, dim_13 <dbl>, dim_14 <dbl>, dim_15 <dbl>, dim_16 <dbl>,
#   dim_17 <dbl>, dim_18 <dbl>, dim_19 <dbl>, dim_20 <dbl>, dim_21 <dbl>,
#   dim_22 <dbl>, dim_23 <dbl>, dim_24 <dbl>, dim_25 <dbl>, dim_26 <dbl>,
#   dim_27 <dbl>, dim_28 <dbl>, dim_29 <dbl>, dim_30 <dbl>, dim_31 <dbl>,
#   dim_32 <dbl>, dim_33 <dbl>, dim_34 <dbl>, dim_35 <dbl>, dim_36 <dbl>, …

Sentence Embeddings: Aehnlichkeit

sentence_sim_matrix <- simil(as.matrix(select(sentence_emb, -sentence)), method = "cosine")
sentence_sim_matrix |>
  as.matrix() |>
  as_tibble() |>
  set_names(sentence_sample) |>
  mutate(sentence = sentence_sample) |>
  gather("sentence2", "similarity", -sentence) |>
  filter(sentence > sentence2) |>
  arrange(desc(similarity)) |>
  slice_head(n = 20) |>
  knitr::kable()

Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.

sentence	sentence2	similarity
Just as America’s role is indispensable in preserving the world’s peace, so is each nation’s role indispensable in preserving its own peace.	It is important that we understand both the necessity and the limitations of America’s role in maintaining that peace.	0.8076691
There is so much to be done.	For everywhere we look, there is work to be done.	0.7877597
This faith is the abiding creed of our fathers.	That’s what will lend meaning to the creed our fathers once declared.	0.7758326
Yet we endured and we prevailed.	We have beaten back despair and defeatism.	0.7722788
We must act, knowing that our work will be imperfect.	We must act on what we know.	0.7379628
It is our glory that whilst other nations have extended their dominions by the sword we have never acquired any territory except by fair purchase or, as in the case of Texas, by the voluntary determination of a brave, kindred, and independent people to blend their destinies with our own.	Foreign powers should therefore look on the annexation of Texas to the United States not as the conquest of a nation seeking to extend her dominions by arms and violence, but as the peaceful acquisition of a territory once her own, by adding another member to our confederation, with the consent of that member, thereby diminishing the chances of war and opening to them new and ever-increasing markets for their products.	0.7312595
Now more than ever, we must do these things together, as one nation and one people.	For any one of us to succeed, we must succeed as one America.	0.7287994
That’s America.	And, I believe America is better than this.	0.7244906
To renew America we must be bold.	A spring reborn in the world’s oldest democracy, that brings forth the vision and courage to reinvent America.	0.7093138
We will carry on.	We sing it still.	0.7087671
We will repair our alliances and engage with the world once again.	To the world, too, we offer new engagement and a renewed vow: We will stay strong to protect the peace.	0.7057480
I will fight for you with every breath in my body - and I will never, ever let you down.	I will always level with you.	0.7056759
We must act, knowing that our work will be imperfect.	For everywhere we look, there is work to be done.	0.7034697
To renew America we must be bold.	And by our dreams and labors we will redeem the promise of America in the 21st century.	0.6984393
To renew America we must be bold.	Above all, my message to Americans today is that it is time for us to once again act with courage, vigor, and the vitality of history’s greatest civilization.	0.6948836
This is our summons to greatness.	Now we must step up.	0.6942487
That’s America.	America stands alone as the world’s indispensable nation.	0.6926186
Yet we endured and we prevailed.	We will carry on.	0.6907114
We have been carried in safety through a perilous crisis.	Our crisis today is the reverse.	0.6886360
Texas was once a part of our country - was unwisely ceded away to a foreign power - is now independent, and possesses an undoubted right to dispose of a part or the whole of her territory and to merge her sovereignty as a separate and independent state in ours.	It is our glory that whilst other nations have extended their dominions by the sword we have never acquired any territory except by fair purchase or, as in the case of Texas, by the voluntary determination of a brave, kindred, and independent people to blend their destinies with our own.	0.6858266

Lexicoder Sentiments

data_dictionary_LSD2015

Dictionary object with 4 key entries.
- [negative]:
  - a lie, abandon*, abas*, abattoir*, abdicat*, aberra*, abhor*, abject*, abnormal*, abolish*, abominab*, abominat*, abrasiv*, absent*, abstrus*, absurd*, abus*, accident*, accost*, accursed* [ ... and 2,838 more ]
- [positive]:
  - ability*, abound*, absolv*, absorbent*, absorption*, abundanc*, abundant*, acced*, accentuat*, accept*, accessib*, acclaim*, acclamation*, accolad*, accommodat*, accomplish*, accord, accordan*, accorded*, accords [ ... and 1,689 more ]
- [neg_positive]:
  - best not, better not, no damag*, no no, not ability*, not able, not abound*, not absolv*, not absorbent*, not absorption*, not abundanc*, not abundant*, not acced*, not accentuat*, not accept*, not accessib*, not acclaim*, not acclamation*, not accolad*, not accommodat* [ ... and 1,701 more ]
- [neg_negative]:
  - not a lie, not abandon*, not abas*, not abattoir*, not abdicat*, not aberra*, not abhor*, not abject*, not abnormal*, not abolish*, not abominab*, not abominat*, not abrasiv*, not absent*, not abstrus*, not absurd*, not abus*, not accident*, not accost*, not accursed* [ ... and 2,840 more ]

13.3 Hausaufgabe

1) Lesen Sie Bachl & Scharkow (2024).

13.4 Transkript

Hinweise zum automatisiert erstellten Transkript

Das folgende Transkript wurde auf Basis der Aufzeichnung der Vorlesung erstellt. Die vollständigen Aufzeichnungen inklusive der Bildschirminhalte sind in Blackboard🔒 verfügbar. Die Tonspur wurde zuerst mit Hilfe der Werkzeuge des Oral-History.Digital Projekts wörtlich transkribiert. Die wörtliche Transkription wurde in Kombination mit den Vorlesungsfolien mithilfe von Sprachmodellen (v. a. Claude Sonnet 4.5 und GPT 5.2) zu einem übersichtlichen Transkript zusammengefasst. Im Anschluss wurde das Transkript von einer studentischen Hilfskraft überprüft, geglättet und ggf. angepasst. In diesem Prozess kann es an verschiedenen Stellen zu Fehlern kommen. Im Zweifel gilt das gesprochene Wort, und auch beim Vortrag mache ich Fehler.

Ich stelle das Transkript hier als experimentelles, ergänzendes Material zur Dokumentation der Vorlesung zur Verfügung. Noch bin ich mir unsicher, ob es eine sinnvolle Ergänzung ist und behalte mir vor, es weiter zu bearbeiten oder zu löschen.

Computational Text Analysis

Diese Vorlesung behandelt Computational Text Analysis als Oberbegriff für automatisierte Analysen von Texten mit Computern und Algorithmen. Im Mittelpunkt stehen die historische Entwicklung, die Umwandlung von Texten in Daten und die wichtigsten Verfahrensfamilien: unüberwachte Methoden, Wörterbücher, Embeddings, Topic Modeling, Supervised Machine Learning und moderne Large-Language-Model-Anwendungen.

Entwicklung des Feldes

Die Vorlesung zeichnet die Entwicklung von den 1950er und 1960er Jahren bis zu heutigen Sprachmodellen nach. In den frühen Phasen standen vor allem Wortzählungen im Vordergrund, weil Computer zwar verfügbar wurden, aber teuer waren und Medieninhalte oft erst mühsam digitalisiert werden mussten. Später, in den 1970er bis 2000er Jahren, wurde durch die Digitalisierung von Medien und die Verbreitung von PCs die Arbeit mit Texten deutlich einfacher und skalierbarer. Seit den 2010er Jahren prägen vor allem Deep Learning und große Sprachmodelle das Feld, weil sie mit sehr großen Textmengen trainiert werden und semantisch reichere Repräsentationen liefern.

Frühe Phase: Wörter zählen, einfache Dictionaries, hohe technische Hürden.
Mittlere Phase: digitale Medien, bessere Software, breitere Nutzung in Kommunikations- und Politikwissenschaft.
Gegenwart: Embeddings, Deep Learning, große Sprachmodelle und breiter Zugang zu solchen Verfahren.

Texte als Daten

Damit Texte computerbasiert analysierbar werden, müssen sie in eine numerische Form überführt werden. Die Vorlesung unterscheidet dafür drei grundlegende Repräsentationen: Bag-of-Words, semantische Netzwerke und Embeddings; praktisch relevant sind heute vor allem Bag-of-Words und Embeddings. Die zentrale Idee ist, dass Texte nicht mehr als „Lesetexte“, sondern als Datensätze behandelt werden. Dadurch können klassische statistische Verfahren angewendet werden, allerdings nur, nachdem bestimmte methodische Entscheidungen getroffen wurden.

Texte müssen in Zahlen bzw. Vektoren überführt werden.
Die Wahl der Repräsentation beeinflusst die Ergebnisse stark.
Je nach Verfahren werden unterschiedliche Aspekte von Sprache bewahrt oder verworfen.

Bag-of-Words

Beim Bag-of-Words-Ansatz wird ein Text als „Sack“ von Wörtern behandelt. Die Reihenfolge der Wörter spielt dabei keine Rolle; wichtig ist vor allem, wie häufig bestimmte Wörter oder Textbestandteile vorkommen. Dafür sind mehrere Vorverarbeitungsschritte nötig: Tokenisierung, Kleinschreibung, Entfernen von Stoppwörtern und oft auch Stemming oder Lemmatisierung. Am Ende entsteht eine Document-Term-Matrix, in der Zeilen Dokumente und Spalten Wörter darstellen.

Tokenisierung: Text wird in einzelne Wörter oder Tokens zerlegt.
Stoppwörter entfernen: sehr häufige Funktionswörter werden oft gelöscht, weil sie inhaltlich wenig beitragen.
Stemming/Lemmatisierung: Wortformen werden auf eine gemeinsame Grundform reduziert.
Document-Term-Matrix: Dokumente in Zeilen, Wörter in Spalten, Werte = Häufigkeiten.

Die Vorlesung betont, dass diese Entscheidungen nicht neutral sind. Schon kleine Änderungen im Pre-Processing können Ergebnisse deutlich verändern, etwa wenn Stoppwörter entfernt oder nicht entfernt werden. Bag-of-Words funktioniert eher bei längeren, standardsprachlichen Texten gut, ist aber für kurze Social-Media-Posts oft zu uninformativ.

Beispiel Inaugural Addresses

Als Beispiel wurde ein Korpus aus Inaugurationsreden von US-Präsidenten verwendet. Die Vorlesung zeigt daran, wie ein Fließtext Schritt für Schritt tokenisiert, kleingeschrieben, von Stoppwörtern bereinigt und schließlich in eine Document-Feature-Matrix überführt wird. Am Ende entstehen für 60 Reden rund 5.478 Features, also sehr viele Spalten für vergleichsweise wenige Fälle. Das illustriert ein typisches Problem der Textanalyse: Sprache ist sehr variabel, sodass selbst längere Texte nur begrenzt Wortüberlappungen liefern.

Der Text wird zunächst in Tokens zerlegt.
Danach werden Kleinschreibung und weitere Normalisierungen vorgenommen.
Nach dem Entfernen von Stoppwörtern und Stemming bleiben stark verdichtete Wortformen übrig.
Das Resultat ist eine sehr breite, dünn besetzte Matrix mit vielen Features.

Embeddings

Embeddings sind eine modernere Form der Textrepräsentation. Wörter werden dabei als Vektoren in einem hochdimensionalen Raum dargestellt; im Beispiel der Vorlesung haben diese Vektoren 1024 Dimensionen. Die Grundidee ist, dass das Modell aus großen Textmengen lernt, welche Wörter in ähnlichen Kontexten vorkommen. Dadurch liegen semantisch ähnliche Wörter im Raum nahe beieinander, zum Beispiel „laws“ und „law“ oder „nation“ und „country“.

Wörter werden als numerische Vektoren dargestellt.
Nähe im Vektorraum steht für inhaltliche Ähnlichkeit.
Embeddings können für Wörter, Sätze und auch längere Texte berechnet werden.
Pre-Processing ist oft deutlich weniger wichtig als bei Bag-of-Words.

Die Vorlesung unterscheidet statische Word-Embeddings und kontextgebundene Embeddings. Statische Embeddings geben jedem Wort einen festen Ort im Raum, während kontextgebundene Modelle berücksichtigen, dass Wörter je nach Umgebung unterschiedliche Bedeutungen haben können. Das ist besonders wichtig bei mehrdeutigen Wörtern wie „bank“ oder „date“.

Ähnlichkeit und Validität

Ein zentraler Vorteil von Embeddings ist, dass man Ähnlichkeiten direkt berechnen kann. In der Vorlesung wird dafür die Cosine Similarity verwendet, also ein Maß dafür, wie nah sich zwei Vektoren im Raum sind. Die gezeigten Ergebnisse sind intuitiv nachvollziehbar: etwa sind „laws“ und „law“ sehr ähnlich, ebenso „national“ und „nation“ oder „nation“ und „country“. Das zeigt, dass Embeddings viele Formen von Normalisierung teilweise überflüssig machen können.

Die Vorlesung nennt aber auch Grenzen. Embeddings übernehmen Bias aus den Trainingsdaten, also gesellschaftliche Verzerrungen wie Sexismus oder Rassismus, und diese können spätere Analysen beeinflussen. Deshalb muss man immer prüfen, wie gut ein Embedding zu den eigenen Daten und Forschungsfragen passt.

Ähnlichkeit wird statistisch über Vektor-Distanzen bzw. Cosine Similarity bestimmt.
Sprachliche Normalisierung wie Stemming ist oft weniger nötig.
Bias in Trainingsdaten kann in die Analyse eingehen.
Validierung erfolgt unter anderem durch Vergleich mit menschlichen Urteilen.

Unüberwachte Verfahren

Unüberwachte Verfahren arbeiten ohne vorgegebene Kategorien. Man gibt große Textmengen in ein Verfahren und schaut, welche Strukturen oder Gruppen sich daraus ergeben. Typische Ziele sind Exploration und Deskription: Welche Themen sind in einem Korpus enthalten? Welche Dokumente ähneln sich? Welche Muster tauchen auf, ohne dass man sie vorher festgelegt hat?

Keine vorab definierten Zielkategorien.
Geeignet für explorative Fragestellungen.
Besonders relevant: Topic Modeling.
Auch einfache Textstatistiken gehören in diese Kategorie.

Textstatistiken

Die einfachste Form unüberwachter Analyse ist das Berechnen von Textstatistiken pro Dokument. Dazu gehören Worthäufigkeiten, Wortvielfalt oder formale Verständlichkeitsindizes, die etwa mit Satzlängen und Wortlängen arbeiten. Solche Maße sind praktisch, weil sie sehr direkt interpretierbar sind. Gleichzeitig bleibt der theoretische Gehalt oft begrenzt, weil diese Maße zunächst nur numerische Zusammenfassungen von Texten sind.

Worthäufigkeiten dienen oft als Grundlage für Wordclouds oder vergleichende Analysen.
Verständlichkeitsindizes kombinieren formale Merkmale wie Satzlänge und Silbenzahl.
Solche Maße sind deskriptiv nützlich, aber theoretisch nicht immer sehr tief.

Topic Modeling

Topic Modeling soll in großen Textmengen Themen sichtbar machen. Die Vorlesung betont, dass „Topic“ als Begriff attraktiv ist, weil er direkt verständlich klingt, auch wenn die theoretische Definition oft unscharf bleibt. Bei klassischen Topic-Modellen muss man meist nur vorgeben, wie viele Topics man ungefähr haben möchte, nicht aber, welche das sein sollen. Der Output zeigt dann für jedes Topic die typischen Wörter und ermöglicht eine deskriptive Interpretation.

Man gibt große Textmengen ein und erhält Themenstrukturen zurück.
Vorab muss meist nur die Anzahl der Topics festgelegt werden.
Besonders nützlich für Zeitungsartikel, Social Media oder andere große Korpora.
Der theoretische Mehrwert bleibt teilweise offen, weil „Topic“ nicht immer klar definiert ist.

Überwachte Verfahren

Überwachte Verfahren arbeiten mit Zielkategorien, die vorab definiert werden. Die Forschenden geben also schon an, was das Modell lernen soll, etwa positive/negative Stimmung, Politik/Sport/Wirtschaft oder bestimmte Frames. Die Maschine lernt dann aus Trainingsdaten, wie Texte diesen Kategorien zugeordnet werden können, und wendet das Gelernte anschließend auf neue Texte an.

Zielkategorien sind vorab festgelegt.
Die Forschenden bringen mehr Vorwissen ein als bei unüberwachten Verfahren.
Geeignet für Klassifikation und Vorhersage.
Typische Ansätze: Wörterbücher, Supervised Machine Learning, Transfer Learning.

Wörterbücher

Wörterbücher oder Diktionäre sind ein klassischer Ansatz der Inhaltsanalyse. Man definiert eine Liste von Wörtern, die für ein Konstrukt stehen, und zählt dann, wie häufig diese in Texten vorkommen. Wenn ein Text viele Wörter aus einem negativen Wörterbuch enthält, wird er als negativer interpretiert. Diese Methode ist transparent und leicht nachvollziehbar, aber inhaltlich begrenzt, weil Mehrdeutigkeiten und Kontext oft nur schwer berücksichtigt werden können.

Ein Wortbuch steht für ein theoretisches Konstrukt.
Je mehr Treffer, desto stärker ausgeprägt soll das Konstrukt sein.
Sehr gut nachvollziehbar, aber methodisch begrenzt bei Ambiguität.
Auch heute noch nützlich, etwa für Archive oder einfache Suchaufgaben.

Distributed Dictionaries

Eine modernere Variante sind Distributed Dictionary Representations. Dabei wird nicht mehr nur nach exakten Worttreffern gesucht, sondern das Dictionary wird in den Embedding-Raum übersetzt. Statt nur zu zählen, wie oft Wörter exakt vorkommen, kann man messen, wie nah ein Text insgesamt an einer semantischen Wortgruppe liegt. Das macht die Methode robuster gegenüber Schreibvarianten und Kontextunterschieden.

Wörterbuchidee wird mit Embeddings kombiniert.
Keine exakten Matches mehr nötig.
Kontinuierliche statt nur binäre oder zählende Messung.

Supervised Machine Learning

Beim Supervised Machine Learning werden manuell kodierte Trainingsdaten verwendet, um ein Klassifikationsmodell zu trainieren. Der Ablauf ist: Menschliche Codierung, Training des Modells, Prüfung der Güte, anschließende Vorhersage für neue Texte. Die Vorlesung betont, dass dies besonders gut zu kommunikationswissenschaftlichen Inhaltsanalysen passt, weil viele Studierende bereits mit Codebüchern und manuellen Kodierungen vertraut sind.

Manuell kodierte Daten dienen als Trainingsgrundlage.
Das Modell lernt Regeln aus den gelabelten Beispielen.
Danach kann es neue Texte klassifizieren.
Moderne Varianten nutzen oft Embeddings statt Bag-of-Words-Features.

Transfer Learning und LLMs

Transfer Learning bedeutet, dass bereits vortrainierte Modelle für einen neuen Zweck feinjustiert werden. Die Vorlesung beschreibt dies als Nutzung großer Sprachmodelle, die schon sehr viel Kontextwissen mitbringen und dann für konkrete Klassifikationsaufgaben angepasst werden. Das senkt den Aufwand für Trainingsdaten deutlich. Die Vorlesung verweist darauf, dass in vielen Fällen heute schon deutlich weniger manuelle Annotation nötig ist als früher, etwa nur noch einige Hundert statt vieler Tausend Beispiele.

Vortrainierte Modelle werden für neue Aufgaben angepasst.
Weniger Trainingsdaten sind nötig als bei älteren Verfahren.
Große Sprachmodelle werden damit zunehmend praktisch für Inhaltsanalysen.
Die Vorlesung sieht darin eine besonders wichtige Entwicklung für die Zukunft.

Fazit

Die zentrale Entwicklung der Vorlesung ist der Weg von einfacher Wortzählung hin zu immer stärker semantisch informierten Verfahren. Die Analyse wird dadurch leistungsfähiger, weil nicht nur Häufigkeiten, sondern zunehmend auch Bedeutungsbeziehungen und Kontext berücksichtigt werden. Zugleich bleibt wichtig, dass jede Methode Annahmen enthält und methodische Entscheidungen verlangt. Gerade bei Pre-Processing, Embeddings, Dictionaries und maschinellen Klassifikatoren muss man immer prüfen, was gemessen wird und welche Verzerrungen entstehen können.