SR2 News Mining – Dani’s Blog

Analysis

Topics are set and public opinion is framed by broadcasting stations. This project wants to analyze the daily news broadcasted by German radio station SR2.

Show the code

# Setup
# Load pckgs
library(knitr)
library(tidytext)
library(tidyverse)
library(RColorBrewer)
library(wordcloud)


# Load data
folder <- "../../../3_SR2 News Mining/data"
files <- list.files(folder, pattern = ".Rdata", full.names = TRUE)

loaded_data <- vector("list")
for (file in files) {
  load(file)
  loaded_data[[file]] <- news  # Assuming the data frames are named "news"
}

news_raw <- map2(loaded_data, names(loaded_data), ~mutate(.x, source_file = .y)) %>% 
  map(bind_rows) %>%
  list_rbind() %>% 
  as_tibble()

Overview

The data collected from the webpage goes from 2017-08-31 to 2023-12-01.

If we have a closer look on the URLs, we can see that every article has an identification number associated which comes after the id= parameter at a similar position for each URL.

Show the code

news <- news_raw %>% 
  mutate(id = str_replace(Links, ".+id=(\\d+).*", "\\1"))
news %>% select(id, Links) %>% head(3) %>% kable()

id	Links
22060	https://dev2.sr-mediathek.sr-multimedia.de/index.php?seite=7&id=22060&pnr=&tbl=pf
22045	https://dev2.sr-mediathek.sr-multimedia.de/index.php?seite=7&id=22045&pnr=&tbl=pf
22030	https://dev2.sr-mediathek.sr-multimedia.de/index.php?seite=7&id=22030&pnr=&tbl=pf

By clicking on a link, we also note, that many pages have gone offline already. Because we have scraped the page over time, we can now observe that a few articles have modified their news message afterwards. However, these were only minor changes.

Show the code

news %>%
  left_join(news, join_by(id), suffix = c("_Version_A", "_Version_B")) %>%
  filter(Themen_Version_A != Themen_Version_B) %>% 
  select(id, starts_with("Themen")) %>% 
  distinct(id, .keep_all = TRUE) %>% 
  kable()
news_distinct <- news %>% distinct(id, .keep_all = TRUE)

id	Themen_Version_A	Themen_Version_B
22060	“Planer” der IS-Miliz in Afghanistan im Visier - USA fliegen Drohnenangriff / Trotz Terrorwarnung - Tausende Menschen versuchen Kabul zu verlassen und das Interview der Woche mit Jens Spahn, Bundesgesundheitsminister	“Planer” der IS-Miliz in Afghanistan im Visier - USA fliegen Drohnenangriff / Trotz Terrorwarnung - Tausende Menschen versuchen Kabul zu verlassen / Interview der Woche mit Jens Spahn, Bundesgesundheitsminister (CDU)
22080	Gewagtes Schutzversprechen - Bisher nur 138 deutsche Ortskräfte ausgeflogen / Nach britischem Abzug - Kritik an Regierung Johnson / Die Preise steigen - wirklich nur vorübergehend? - Hohe Inflationsrate befürchtet	Gewagtes Schutzversprechen - Bisher nur 138 deutsche Ortskräfte ausgeflogen / Nach britischem Abzug - Kritik an Regierung Johnson / Die Preise steigen - wirklich nur vorübergehend? Hohe Inflationsrate befürchtet
22586	Kommentar zum Ende der Sondierungsgespräche / Ab heute Kita-Lockerungen: Fluch und Segen / Pandora Papers: Warum funktionieren Briefkastenfirmen trotz Regulierung? / EU-Parlament verurteilt Belarus / Abholzung in Kongo	Kommentar zum Ende der Sondierungsgespräche / Ab heute Kita-Lockerungen - Fluch und Segen / Pandora Papers - Warum funktionieren Briefkastenfirmen trotz Regulierung? / EU-Parlament verurteilt Belarus / Abholzung in Kongo

Let’s examine the time frame covered by the articles.

Show the code

# Articles by month
news_distinct %>%
  count(Month = floor_date(Datum, "month"),
        name = "Number of Articles") %>%
  ggplot(aes(x = Month, y = `Number of Articles`)) +
  geom_col() +
  scale_x_date(date_breaks = "6 months") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Our data shows two time periods that are uncovered. The first is before August 2020. Unfortunately, SR2 seems to have deleted their data or they simply did not upload their editions consequently before that date. Therefore, to not bias our analysis, the 80 articles from before August 2020 are deleted (listwise deletion, since these are just a few cases). Moreover, we identify a significant gap in information between February and October 2022. You see, behind this code there is a human and humans aren’t robots. Sometimes life throws in its own surprises and a unique blend of personal events distracted me from continuing this analysis. But I’m back in action now!😊

Show the code

# Listwise deletion
news_filtered <- news_distinct %>% filter(Datum >= "2020-08-01")

We then observe, while Bilanz am Abend is published on weekdays, Bilanz am Mittag also appears on Saturdays. Sunday is a holiday.

Show the code

# Articles by day of week
news_filtered %>%
  count(Format,
        Weekday = wday(Datum, locale = "German", label = TRUE),
        name = "Count") %>% 
  ggplot(aes(x = Weekday, y = Count, fill = Format)) +
  geom_col()

Authors

When focusing on the narrators, it is interesting to note how the SR webpage content managers do not know the names of their colleagues. Or what is the reasons of that many different spellings of the same name?

Show the code

# Distinct authors/narrators
news_filtered %>% distinct(Autor) %>% arrange(Autor) %>% print(n = Inf)

#> # A tibble: 37 × 1
#>    Autor                              
#>    <chr>                              
#>  1 Böffel, Janek                      
#>  2 Florian Mayer                      
#>  3 Florian mayer                      
#>  4 Folrian Mayer                      
#>  5 Frank Hofmann                      
#>  6 Gallmeyer, Kerstin                 
#>  7 Isabel Tentrup                     
#>  8 Isabell Tentrupp                   
#>  9 Isabelle Tentrup                   
#> 10 Isabelle Tentrupp                  
#> 11 Janek Böffel                       
#> 12 Jochem Marmit                      
#> 13 Jochen Marmit                      
#> 14 Karin Mayer                        
#> 15 Kathrin Aue                        
#> 16 Katrin Aue                         
#> 17 Katrin Aue, Janek Böffel           
#> 18 Katrin AueFrankreich streikt weiter
#> 19 Kerstin Gallmeyer                  
#> 20 Lisa Krauser                       
#> 21 Mayer, Florian                     
#> 22 Michael Thieser                    
#> 23 Peter Weitzmann                    
#> 24 SR 2 KulturRadio                   
#> 25 SR 2 Kulturradio                   
#> 26 Sarah Sassou                       
#> 27 Staphan Deppen                     
#> 28 Stefan Deppen                      
#> 29 Stephan Deppen                     
#> 30 Stephan Deppenh                    
#> 31 Thomas SHihabi                     
#> 32 Thomas Shihabi                     
#> 33 Thomas Shihabi et al.              
#> 34 Yvonne Scheinhege                  
#> 35 Yvonne Schleinhege                 
#> 36 Yvonne Schleinhege-Böffel          
#> 37 Îsabelle Tentrup

Let’s head over to analyse the content and check which news appear within the daily news blocks.

Keywords by number of appearence

First, we load a dictionary with German stop-words so we can delete unnecessary words from the text corpus.

Show the code

# Load dictionary
stop_words_german <-
  data.frame("Wort" = stopwords::stopwords("de", source = "snowball"))
# Delete stopwords
news_unnested <- news_filtered %>%
  unnest_tokens(output = "Wort", input = Themen) %>% 
  anti_join(stop_words_german, by = "Wort")

Now, let’s plot the top keywords by number of appearance.

Show the code

# Keyword frecuency
top_n_keywords <- news_unnested %>%
  count(Wort, name = "Count", sort = TRUE) %>% 
  rename(Word = Wort)

top_n_keywords %>% filter(Count >= 55) %>% print(n = Inf)

#> # A tibble: 33 × 2
#>    Word        Count
#>    <chr>       <int>
#>  1 corona        384
#>  2 eu            263
#>  3 lage          207
#>  4 saarland      179
#>  5 bundestag     127
#>  6 interview     126
#>  7 neue          112
#>  8 ukraine       112
#>  9 deutschland   101
#> 10 china          99
#> 11 reaktionen     98
#> 12 woche          89
#> 13 kommentar      86
#> 14 debatte        84
#> 15 israel         84
#> 16 mehr           82
#> 17 jahre          77
#> 18 frankreich     76
#> 19 afghanistan    74
#> 20 usa            73
#> 21 us             72
#> 22 wahl           71
#> 23 berlin         69
#> 24 gipfel         69
#> 25 cdu            67
#> 26 russland       65
#> 27 treffen        64
#> 28 geht           63
#> 29 aktuelle       61
#> 30 afd            58
#> 31 biden          58
#> 32 trump          58
#> 33 urteil         57

The keyword frequency is visually represented by the following word cloud.

Show the code

wordcloud(top_n_keywords$Word, top_n_keywords$Count, min.freq = 55, colors = brewer.pal(6, "Dark2"))

We see, Corona clearly dominated the news and also EU related topics were discussed. Since SR2 is a regional broadcasting station, it’s no surprise that we also observe the keyword Saarland. If we sum up all US related keywords that show up (Biden, Trump, USA, US) we notice it is another very dominant topic.

Show the code

# Frecuency US related words
usa_keywords <- c("biden", "trump", "usa", "us")
news_unnested %>% 
  mutate(Word = ifelse(Wort %in% usa_keywords,
                       "US_keywords_summary", Wort)) %>% 
  count(Word, name = "Count", sort = TRUE) %>% 
  head(3) %>% 
  kable()

Word	Count
corona	384
eu	263
US_keywords_summary	261

It is also interesting to observe the distribution of these keywords within the week.

Show the code

news_unnested %>% 
  mutate(Word = ifelse(Wort %in% usa_keywords,
                       "US_keywords_summary", Wort),
         Weekday = wday(Datum, label = TRUE)) %>% 
  count(Weekday, Word, name = "Count", sort = TRUE) %>% 
  arrange(Weekday) %>% 
  pivot_wider(names_from = Weekday, values_from = Count, values_fill = 0) %>% 
  rowwise() %>% 
  mutate(`Total Count` = sum(c_across(Mon:Sat))) %>% 
  arrange(desc(`Total Count`)) %>% 
  head(10) %>% 
  kable()

Word	Mon	Tue	Wed	Thu	Fri	Sat	Total Count
corona	84	74	77	59	72	18	384
eu	53	60	42	54	47	7	263
US_keywords_summary	31	48	51	44	49	38	261
lage	40	35	41	40	40	11	207
saarland	33	33	37	40	26	10	179
bundestag	1	8	29	38	51	0	127
interview	15	12	5	8	13	73	126
neue	30	20	16	15	19	12	112
ukraine	21	25	19	16	25	6	112
deutschland	10	22	21	23	21	4	101

Corona and the Ukraine dominate news during the week, indicating perhaps an avoidance of such pressing topics on weekends. Saturdays seem reserved for more background information, as implied by the prominence of the word “interview”. US-related topics persist from Monday to Saturday, and I wonder why a similar pattern does not seem to hold for other countries, such as China. Regional messages concerning the Bundestag (the German Parliament), Saarland or the EU take center stage during weekdays.

Keywords over time

Let’s visualize the changing patterns of specific keyword appearances over time.

Show the code

# Select keywords
keywords <- c("lockdown", "corona", "afghanistan", "kabul", "ukraine", "russland", "gaza", "israel", "china", "türkei", "eu", "selenskyj")

# Selected keywords over time
news_unnested %>% 
  count(Wort, Month = floor_date(Datum, "month"), name = "Count") %>% 
  filter(Wort %in% keywords) %>% 
  # If keyword does not appear in certain month, insert row with explicit 0
  complete(Month = seq(
    from = floor_date(min(news_unnested$Datum), "month"),
    to = max(news_unnested$Datum),
    by = "month"),
    Wort,
    fill = list(Count = 0)) %>% 
  ggplot(aes(x = Month, y = Count, color = Wort)) +
  geom_line(linewidth = 1) +
  geom_label(aes(label = "No data available", x = as_date("2022-06-15"), y = 0), size = 3, label.padding = unit(.15, "lines")) +
  facet_wrap(~ Wort, ncol = 3, scales = "free_y") +
  scale_x_date(breaks = "6 month", limits = c(NA_Date_, floor_date(today() - months(1), "month"))) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 75, vjust = 0.58))

It’s interesting to see how news themes develop in the course of time. Gaza and Israel have a peak since October 2023, but note that Israel appears twice as often as Gaza. Accordingly, the Ukraine is every day of fewer importance and maybe soon to disappear, although the war is going on? Corona and lockdown are not present anymore, however at the end of each year it feels like Corona celebrates its comeback in the news.

NB: This project was a work in progress from September 2020 until December 2023. Moving forward, I won’t be sending regular updates. However, I am considering to use this experience to extend it to another media website, driving it towards a more professional level and maybe a broader audience. More to come…