Scraping Game of Thrones Data

PUBLISHED ON APR 1, 2018 — DATA SCRAPING

WARNING: This post contains spoilers for Game of Thrones and some foul language reproduced from the show script

April may as well be Game of Thrones (GoT) month. For 7 loyal years, fans eagerly await April as the premiere of the new season. The final season is no different and will yet again return in April…2019 bypassing a season in 2018.

Late Season Meme

As a loyal fan I’m determined to not let my April 2018 be any different and am dedicating a few blog posts to GoT. The project will involve some analysis of GoT data available on the internet such as scripts, HBO main characters lists, and some other character lists. In this specific post, I will not cover the basics of scraping but rather provide some resources to scrape the material. I’ll be using the data obtained using this code in future blog posts and wanted to provide a way for any user to reproduce my analysis. Up until a few weeks ago, I myself didn’t know what web scraping was so if you weren’t aware web scraping it is simply extracting data from websites. Essentially, I can pull the html source code used to generate and host the website and then harvest or extract the information to be used as data. All the data scraping and analysis will be done in the R environment.

Shall we begin?

Late Season Meme

All of the code provided below uses the filepaths specific to my computer. To reproduce just be sure to change these filepaths to match that of your computer and where you’ll be saving and loading data. Use the button below to download code to scrape and clean all the data I will be using. Code provided in the download is more thorough than that used in the tutorial.

Scraping GoT Scripts

I want to eventually do some analysis on the GoT scripts. Since HBO does not provide these scripts, I’m going to rely on www.genius.com. I chose this website for two main reasons:

  1. An R package geniusr already exists with functions to easily scrape the data
  2. These scripts contain the speaker information which I’ll need later for my eventual analysis

To use the geniusr package you must have an API token. Go here to create an account or login and obtain an API token. For more information on the API creation you can look at the genius documentation here.

Once you’ve created an account and obtained an API they’ll assign you a token. Be sure to copy the token. With the token copied you can add it to the R environment with some code.

Running the following code will open your .Renviron

user_renviron = path.expand(file.path("~", ".Renviron"))
if(!file.exists(user_renviron)) # check to see if the file already exists
  file.create(user_renviron)
file.edit(user_renviron) # open with another text editor if this fails

Once this file is open paste the following into the script:

GENIUS_API_TOKEN="your-token-goes-here"

If done properly then this code will return your API token.

Sys.getenv("GENIUS_API_TOKEN")

Now you can install and load geniusr to scrape some data. You can find more information about the geniusr package through CRAN or the package maintainers GitHub.

# Install the stable version from CRAN
install.packages('geniusr')
# Install the developers version from GitHub directly
devtools::install_github('ewenme/geniusr')

Recently, I’ve had Amy Winehouse’s Back to Black album on repeat so to show a simple example I’ll use this artist and album to show a quick demo of a few geniusr functions. If you want to explore the website I will be scraping here you can start here.

Warning: Before running any of the functions below be aware that you are pinging the genius server every time you request data. If done quickly or too often it can trigger the server to set up security measures and block your IP from using the site. It is always best to ping the server once and then save the data so that you don’t have to constantly be pulling the data.

library(geniusr)
# Search for Amy Winehouse
geniusr::search_artist("Amy Winehouse")
# Use artist_id to obtain song_id
geniusr::get_artist_songs(artist_id = 1525)
# Scrape lyrics using an ID
geniusr::scrape_lyrics_id(song_id = 247618)
# Get song lyrics to Rehab using the URL directly
geniusr::scrape_lyrics_url("https://genius.com/Amy-winehouse-rehab-lyrics")

Genius was originally set up to host lyrics of songs as well as artist and album information. Now they host TV show and movie scripts as well. The GoT scripts can be accessed here. Notice, the seasons are recorded as an “album” on the site and within each album the script for an episode is characterized as a song. To scrape Game of Thrones data, I created an R package that wraps geniusr or does the scraping within the function. You can find out more about the package on my GitHub. To download the development version from my GitHub run:

devtools::install_github("avalcarcel9/GoT")
library(GoT)

Now let’s scrape the scripts and then save them so you don’t need to continuously ping the genius server. Be sure to change the directory in save() when you run this code.

# Obtain all the scripts
geniusr_scripts = scrape_GoT(base_url = "https://genius.com/albums/Game-of-thrones", season = 7)

# Visualize the data
head(geniusr_scripts)
tail(geniusr_scripts)

# Save data - Change filepath to match where you'd like to save the data
save(geniusr_scripts, file = "/Users/alval/Box/Coursework/IndStudy/Data/geniusr_scripts.RData")
## # A tibble: 6 x 6
##   song_number album_name      
##         <dbl> <chr>           
## 1           1 Season 1 Scripts
## 2           1 Season 1 Scripts
## 3           1 Season 1 Scripts
## 4           1 Season 1 Scripts
## 5           1 Season 1 Scripts
## 6           1 Season 1 Scripts
##   line                                                                     
##   <chr>                                                                    
## 1 WAYMAR ROYCE: What d’you expect? They’re savages. One lot steals a goat …
## 2 WILL: I’ve never seen wildlings do a thing like this. I’ve never seen a …
## 3 WAYMAR ROYCE: How close did you get?                                     
## 4 WILL: Close as any man would.                                            
## 5 GARED: We should head back to the wall.                                  
## 6 ROYCE: Do the dead frighten you?                                         
##   song_lyrics_url                                              
##   <chr>                                                        
## 1 https://genius.com/Game-of-thrones-winter-is-coming-annotated
## 2 https://genius.com/Game-of-thrones-winter-is-coming-annotated
## 3 https://genius.com/Game-of-thrones-winter-is-coming-annotated
## 4 https://genius.com/Game-of-thrones-winter-is-coming-annotated
## 5 https://genius.com/Game-of-thrones-winter-is-coming-annotated
## 6 https://genius.com/Game-of-thrones-winter-is-coming-annotated
##   song_name        artist_name    
##   <chr>            <chr>          
## 1 Winter is Coming Game of Thrones
## 2 Winter is Coming Game of Thrones
## 3 Winter is Coming Game of Thrones
## 4 Winter is Coming Game of Thrones
## 5 Winter is Coming Game of Thrones
## 6 Winter is Coming Game of Thrones
## # A tibble: 6 x 6
##   song_number album_name      
##         <dbl> <chr>           
## 1          10 Season 6 Scripts
## 2          10 Season 6 Scripts
## 3          10 Season 6 Scripts
## 4          10 Season 6 Scripts
## 5          10 Season 6 Scripts
## 6          10 Season 6 Scripts
##   line                                                                     
##   <chr>                                                                    
## 1 CERSEI enters, escorted by GREGOR and the Kingsguard. The room is full o…
## 2 QYBURN: I now proclaim Cersei of the House Lannister First of Her Name, …
## 3 QYBURN places the crown on CERSEI’s head. CERSEI sits in the Iron Throne…
## 4 QYBURN: Long may she reign.                                              
## 5 ALL: Long may she reign.                                                 
## 6 THEON and YARA are standing on their ship. They look out at the sea. A v…
##   song_lyrics_url                                                        
##   <chr>                                                                  
## 1 https://genius.com/Game-of-thrones-the-winds-of-winter-script-annotated
## 2 https://genius.com/Game-of-thrones-the-winds-of-winter-script-annotated
## 3 https://genius.com/Game-of-thrones-the-winds-of-winter-script-annotated
## 4 https://genius.com/Game-of-thrones-the-winds-of-winter-script-annotated
## 5 https://genius.com/Game-of-thrones-the-winds-of-winter-script-annotated
## 6 https://genius.com/Game-of-thrones-the-winds-of-winter-script-annotated
##   song_name                    artist_name    
##   <chr>                        <chr>          
## 1 The Winds of Winter (Script) Game of Thrones
## 2 The Winds of Winter (Script) Game of Thrones
## 3 The Winds of Winter (Script) Game of Thrones
## 4 The Winds of Winter (Script) Game of Thrones
## 5 The Winds of Winter (Script) Game of Thrones
## 6 The Winds of Winter (Script) Game of Thrones

Again, I scrape the website once and save the results as an R object so I do not get blocked for scraping too much from genius. After I’ve saved the file then whenever I want to start working with the data I simply load my scraped file rather than re-scrape from the internet.

We now have every episode script with season indicators!

The GoT package also comes with a simple function to clean the www.genius.com scripts. It extracts the speaker information, tries to clean the speaking character name, adds actions in a column if indicated, and removes some of the narration.

clean_data = GoT::cleanGoT(geniusr_scripts)
head(clean_data)
## # A tibble: 6 x 6
##   episode season
##     <dbl>  <dbl>
## 1       1      1
## 2       1      1
## 3       1      1
## 4       1      1
## 5       1      1
## 6       1      1
##   line                                                                     
##   <chr>                                                                    
## 1 What d’you expect? They’re savages. One lot steals a goat from another l…
## 2 I’ve never seen wildlings do a thing like this. I’ve never seen a thing …
## 3 How close did you get?                                                   
## 4 Close as any man would.                                                  
## 5 We should head back to the wall.                                         
## 6 Do the dead frighten you?                                                
##   speaker      action
##   <chr>        <chr> 
## 1 Waymar Royce ""    
## 2 Will         ""    
## 3 Waymar Royce ""    
## 4 Will         ""    
## 5 Gared        ""    
## 6 Royce        ""    
##   orig_line                                                                
##   <chr>                                                                    
## 1 WAYMAR ROYCE: What d’you expect? They’re savages. One lot steals a goat …
## 2 WILL: I’ve never seen wildlings do a thing like this. I’ve never seen a …
## 3 WAYMAR ROYCE: How close did you get?                                     
## 4 WILL: Close as any man would.                                            
## 5 GARED: We should head back to the wall.                                  
## 6 ROYCE: Do the dead frighten you?

Scrape Episode Lengths

To add temporal information to the season and episode, I need each episode length. This scraping process will be very similar to the scripts above. We will use an R package omdbapi which is borrowing information from IMDB. Similar to geniusr you need to create an account at http://www.omdbapi.com/. After creating an account you can obtain and API token. Follow the steps in Scraping GoT Scripts section and put the API token in your .Renviron as OMDB_API_KEY="your_key_goes_here". Be sure to activate this key or the omdbapi package won’t work.

# Install omdbapi
devtools::install_github("hrbrmstr/omdbapi")

The package developer has a lot of documentation on their GitHub https://github.com/hrbrmstr/omdbapi if you’d like to explore the package more. Conveniently, they have an example that obtains specific Game of Thrones data.

library(omdbapi)
library(dplyr)

# Example from github
imdb_data = omdbapi::find_by_title("Game of Thrones", type="series", season=1, episode=1)

# Remove first row to create an empty tibble
imdb_data = imdb_data %>% 
  dplyr::slice(-1)

# Get each episode
# Seaon 1
for(i in 1:10){
  imdb_data = dplyr::bind_rows(imdb_data, omdbapi::find_by_title("Game of Thrones", type="series", season=1, episode=i))
  Sys.sleep(90)
}
Sys.sleep(120)
# Season 2
for(i in 1:10){
  imdb_data = dplyr::bind_rows(imdb_data, omdbapi::find_by_title("Game of Thrones", type="series", season=2, episode=i))
  Sys.sleep(90)
}
Sys.sleep(120)
# Season 3
for(i in 1:10){
  imdb_data = dplyr::bind_rows(imdb_data, omdbapi::find_by_title("Game of Thrones", type="series", season=3, episode=i))
  Sys.sleep(90)
}
Sys.sleep(120)
# Season 4
for(i in 1:10){
  imdb_data = dplyr::bind_rows(imdb_data, omdbapi::find_by_title("Game of Thrones", type="series", season=4, episode=i))
  Sys.sleep(90)
}
Sys.sleep(120)
# Season 5
for(i in 1:10){
  imdb_data = dplyr::bind_rows(imdb_data, omdbapi::find_by_title("Game of Thrones", type="series", season=5, episode=i))
  Sys.sleep(90)
}
Sys.sleep(120)
# Season 6
for(i in 1:10){
  imdb_data = dplyr::bind_rows(imdb_data, omdbapi::find_by_title("Game of Thrones", type="series", season=6, episode=i))
  Sys.sleep(90)
}
Sys.sleep(120)
# Season 7
for(i in 1:7){
  imdb_data = dplyr::bind_rows(imdb_data, omdbapi::find_by_title("Game of Thrones", type="series", season=7, episode=i))
  Sys.sleep(90)
}

save(imdb_data, file = 'Users/alval/Box/Coursework/IndStudy/Data/imdb_data.RData')

Normally, I would create a nested for loop but the omdbapi package gives errors when data is scraped too quickly so instead I just created individual for loops for each season and waited a minute before running each one.

This returns a lot of data from IMDB. While I only need the episode length (runtime) I’m not going to remove any column as it may be interesting to use some of this data later.

Scrape Character Names

To scrape the character names I used “http://awoiaf.westeros.org/index.php/List_of_characters”. This website contains all characters in GoT (book and show). The character list will be used as a reference for data cleaning which is to come in future blog posts.

I created a scraping function to pull the characters from the website and clean the data into a tibble. I’ll save the character vector so scraping only has to be done once. This list includes over 2000 characters. Many of these characters on this website contain the same first names. This is going to make data cleaning later more difficult. Due to this, we will additionally scrape HBO’s main character list. I created two functions in the GoT package to help us do this GoT::scrape_characters and GoT::scrape_HBO_characters which both scrape each respective website for the character list.

Now we can apply these functions to scrape and save the character data.

# Scrape wiki characters and save
url = "http://awoiaf.westeros.org/index.php/List_of_characters"
wiki_characters = GoT::scrape_characters(url)

# Change filepath to match where you'd like to save the data
save(wiki_characters, file = "/Users/alval/Box/Coursework/IndStudy/Data/wiki_characters.RData")

head(wiki_characters)

# Scrape HBO characters and save
url = 'https://www.hbo.com/game-of-thrones/cast-and-crew'
hbo_characters = GoT::scrape_HBO_characters(url)

# Change filepath to match where you'd like to save the data
save(hbo_characters, file = "/Users/alval/Box/Coursework/IndStudy/Data/hbo_characters.RData")

head(hbo_characters)
##                                                                                                      raw
## 1                                         Abelar Hightower, challenger at the tourney at Ashford Meadow.
## 2                                                                                       Addam, a knight.
## 3                             Addam Frey, a knight at the Whitewalls tourney, cousin to Lady Butterwell.
## 4                          Addam Marbrand, a knight in the service of House Lannister. Heir to Ashemark.
## 5 Addam Osgrey, son of Eustace Osgrey, slain on the Redgrass Field during the First Blackfyre Rebellion.
## 6                                         Addam Velaryon, a dragonrider during the Dance of the Dragons.
##              names
## 1 Abelar Hightower
## 2            Addam
## 3       Addam Frey
## 4   Addam Marbrand
## 5     Addam Osgrey
## 6   Addam Velaryon
##                                                      description
## 1                   challenger at the tourney at Ashford Meadow.
## 2                                                      a knight.
## 3                             a knight at the Whitewalls tourney
## 4  a knight in the service of House Lannister. Heir to Ashemark.
## 5                                          son of Eustace Osgrey
## 6                 a dragonrider during the Dance of the Dragons.
##         scraped_name first_name last_name nickname             joined_name
## 1 Eddard ?Ned? Stark     Eddard     Stark      Ned Eddard Stark|Ned|Eddard
## 2   Robert Baratheon     Robert Baratheon          Robert Baratheon|Robert
## 3   Tyrion Lannister     Tyrion Lannister          Tyrion Lannister|Tyrion
## 4   Cersei Lannister     Cersei Lannister          Cersei Lannister|Cersei
## 5      Catelyn Stark    Catelyn     Stark            Catelyn Stark|Catelyn
## 6    Jaime Lannister      Jaime Lannister            Jaime Lannister|Jaime
##          full_name
## 1     Eddard Stark
## 2 Robert Baratheon
## 3 Tyrion Lannister
## 4 Cersei Lannister
## 5    Catelyn Stark
## 6  Jaime Lannister

The wiki_characters contains a lot of information about character names, nicknames, and even description and will be useful later as a reference. The hbo_characters object contains information only pertaining to the main characters on the show. I’ll be using this information later to clean the script data.

Scrape Character Death Timeline

To scrape the character deaths and at what time in a particular season and episode they die I use the website “https://deathtimeline.com/”. This website contains dynamic content that allows the user to interact with and needs some extra TLC to scrape.

From this, I obtain who is killed, how they were killed, the time they were killed, season, episode, and specifically how they were killed. The specific kill is to describe who killed that person. There is some bias to this data. Mostly the bias is that it is taken from https://deathtimeline.com/ and relies on their information. From that data I can only collect that they deemed as major and minor. Again, we will use a function in the GoT package.

death_times = GoT::get_death_times()

head(death_times)
## # A tibble: 6 x 6
##   who          how                    times    season episode specific_how
##   <chr>        <chr>                  <chr>     <dbl>   <dbl> <chr>       
## 1 Waymar Royce killed by white walker 00:05:52      1       1 White Walker
## 2 Gared        killed by white walker 00:06:58      1       1 White Walker
## 3 Will         executed by ned stark  00:13:44      1       1 Ned Stark   
## 4 Jon Arryn    poisoned by lysa       00:18:34      1       1 Lysa        
## 5 Assassin     killed by bran's wolf  00:31:27      1       2 Bran's Wolf 
## 6 Mycah        killed by the hound    00:04:37      1       2 The Hound
save(death_times, file = "/Users/alval/Box/Coursework/IndStudy/Data/death_times.RData")
## # A tibble: 6 x 6
##   who          how                    times    season episode specific_how
##   <chr>        <chr>                  <chr>     <dbl>   <dbl> <chr>       
## 1 Waymar Royce killed by white walker 00:05:52      1       1 White Walker
## 2 Gared        killed by white walker 00:06:58      1       1 White Walker
## 3 Will         executed by ned stark  00:13:44      1       1 Ned Stark   
## 4 Jon Arryn    poisoned by lysa       00:18:34      1       1 Lysa        
## 5 Assassin     killed by bran's wolf  00:31:27      1       2 Bran's Wolf 
## 6 Mycah        killed by the hound    00:04:37      1       2 The Hound

Notice that season 7 data is not available using this source. Instead, I will scrape this data from http://time.com/3924852/every-game-of-thrones-death/. I am going to scrape the full data just as a reference.

time.com_death_data = GoT::get_time.com_deaths(url = 'http://time.com/3924852/every-game-of-thrones-death/')
head(time.com_death_data)

save(time.com_death_data, file = "/Users/alval/Box/Coursework/IndStudy/Data/time.com_death_data.RData")
## # A tibble: 6 x 6
##   who               season episode times
##   <chr>              <dbl>   <dbl> <lgl>
## 1 Will                   1       1 NA   
## 2 Jon Arryn              1       1 NA   
## 3 Jory Cassel            1       5 NA   
## 4 Viserys Targaryen      1       6 NA   
## 5 Benjen Stark           1       7 NA   
## 6 Robert Baratheon       1       7 NA   
##   how                                                                      
##   <chr>                                                                    
## 1 Beheaded for desertion by Ned Stark                                      
## 2 Poisoned by Lysa Arryn and Littlefinger                                  
## 3 Stabbed by Jaime Lannister through the eye                               
## 4 Khal Drogo pours molten gold on his head at Daenerys’ command            
## 5 Killed by White Walkers                                                  
## 6 Mortally wounded by a wild boar after drinking wine given to him by Lanc…
##   specific_how                                                             
##   <chr>                                                                    
## 1 Beheaded for desertion by Ned Stark                                      
## 2 Poisoned by Lysa Arryn and Littlefinger                                  
## 3 Stabbed by Jaime Lannister through the eye                               
## 4 Khal Drogo pours molten gold on his head at Daenerys’ command            
## 5 Killed by White Walkers                                                  
## 6 Mortally wounded by a wild boar after drinking wine given to him by Lanc…

Biases

There are some biases introduced that are immediatly obvious. First, we scrape the GoT scripts from www.genius.com and not every episode was annotated. Therefore, there are a few episodes that we will be missing in our analysis later. Most of season 2 and 3 were not available when I wrote this post. I’ll continually check if someone annotates and update this blog post and future analysis if they appear. Second, the content on this site is supplied by users and is not verified by HBO. The content does contain some typo’s and may not exactly match the true script. Third, the character vector scraped from “http://awoiaf.westeros.org/index.php/List_of_characters” is similarly a fan generated page and is susceptible to inaccurate information.

Future Work

In my next blog post, I’ll be covering how to clean this data using dplyr and tidyr tools. While geniusr and my functions to scrape character data returned dataframes that are relatively clean, before I analyze I will need to extract more information and format the data more specifically.