Project 1
- Lets Get Started: OMDb API Key
- Build URL for One Movie Title
- Build URL for One Movie Title and One Date
- Build URL for One IMDb ID
- Build URL to Search for Movies in a Series
- Build URL to Search for One or More Titles or Series
- Get the Data for One Series
- Get the Data for One or More Series
- Let’s make a data set!
- Exploratory Data Analysis
- That’s All Folks!
Lets Get Started: OMDb API Key
In order to access the OMDb API, you need to get a free api key. In the rest of this document, “mykey” refers to your OMDb API key.
You should also “turn on” these packages by running the code below. If
you don’t have them installed yet, run install.packages()
with the
package in quotes. For example, to install tidyverse
, you would run
install.packages("tidyverse")
.
library(httr) #this package will help use use the URL we built to get information from the OMDb API
library(jsonlite) #this package will help us convert the data we get from the OMDb API to a more usable format
library(tidyverse) #this package will help us work with our nicely formatted data.
library(lubridate) #this package will help us create dates
library(ggplot2) #this package will help us make graphs
In order to get information from the OMDb API, we have to build a URL with our search criteria. It’s similar to doing a Google search. There are two ways to build a URL: “By ID or Title” or “By Search”.
Build URL for One Movie Title
Let’s say you have a movie title in mind, like Star Wars (1977). Here’s a function you can use to get data from the OMDb API about Star Wars (1977):
Note: the parameter “type” has three options: movie, series, and episode. If “type” is not specified, it will give everything (including movies, series, and episodes). I’m making the default for “type” be “movie”, but you can change this if you want.
search_by_title <- function(mykey,title,type="movie"){
#build URL:
base_url <- paste0("http://www.omdbapi.com/?apikey=",mykey)
info_url <- paste0("&t=",title,"&type=",type)
full_url <- paste0(base_url, info_url)
full_url <- gsub(full_url, pattern = " ", replacement = "%20")
#use URL to get data from the OMDb API:
movie_api_call <- GET(full_url)
movie_api_call_char <- rawToChar(movie_api_call$content)
movie_JSON <- jsonlite::fromJSON(movie_api_call_char, flatten = TRUE)
movie_JSON <- as.data.frame(movie_JSON)
tibble_movie_JSON <- as_tibble(movie_JSON)
return(tibble_movie_JSON)
}
You should run the function like this:
search_by_title("mykey","star_wars",type="movie")
You should get a tibble that looks like this:
## # A tibble: 3 × 26
## Title Year Rated Released Runtime Genre Director Writer Actors Plot Language Country Awards Poster Ratings.Source
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Star Wars 1977 PG 25 May 1… 121 min Acti… George … Georg… Mark … Luke… English United… Won 6… https… Internet Movi…
## 2 Star Wars 1977 PG 25 May 1… 121 min Acti… George … Georg… Mark … Luke… English United… Won 6… https… Rotten Tomato…
## 3 Star Wars 1977 PG 25 May 1… 121 min Acti… George … Georg… Mark … Luke… English United… Won 6… https… Metacritic
## # … with 11 more variables: Ratings.Value <chr>, Metascore <chr>, imdbRating <chr>, imdbVotes <chr>, imdbID <chr>,
## # Type <chr>, DVD <chr>, BoxOffice <chr>, Production <chr>, Website <chr>, Response <chr>
Build URL for One Movie Title and One Date
If you don’t specify a date, the OMDb API will give the first result. So, since Star Wars (1977) was the first Star Wars movie ever made, it gives Star Wars (1977) as the result. But, what if you wanted a different Star Wars movie like Star Wars: Episode VII - The Force Awakens (2015)? You can use this funciton:
search_by_title_and_date <- function(mykey,title,type="movie",date){
#build URL:
base_url <- paste0("http://www.omdbapi.com/?apikey=",mykey)
info_url <- paste0("&t=",title,"&type=",type,"&y=",date)
full_url <- paste0(base_url, info_url)
full_url <- gsub(full_url, pattern = " ", replacement = "%20")
#use URL to get data from the OMDb API:
movie_api_call <- GET(full_url)
movie_api_call_char <- rawToChar(movie_api_call$content)
movie_JSON <- jsonlite::fromJSON(movie_api_call_char, flatten = TRUE)
movie_JSON <- as.data.frame(movie_JSON)
tibble_movie_JSON <- as_tibble(movie_JSON)
return(tibble_movie_JSON)
}
You should run the function like this:
search_by_title_and_date("mykey","star_wars",type="movie",date=2015)
Get a tibble like this:
## # A tibble: 3 × 26
## Title Year Rated Released Runtime Genre Director Writer Actors Plot Language Country Awards Poster Ratings.Source
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Star Wars… 2015 PG-13 18 Dec … 138 min Acti… J.J. Ab… Lawre… Daisy… As a… English United… Nomin… https… Internet Movi…
## 2 Star Wars… 2015 PG-13 18 Dec … 138 min Acti… J.J. Ab… Lawre… Daisy… As a… English United… Nomin… https… Rotten Tomato…
## 3 Star Wars… 2015 PG-13 18 Dec … 138 min Acti… J.J. Ab… Lawre… Daisy… As a… English United… Nomin… https… Metacritic
## # … with 11 more variables: Ratings.Value <chr>, Metascore <chr>, imdbRating <chr>, imdbVotes <chr>, imdbID <chr>,
## # Type <chr>, DVD <chr>, BoxOffice <chr>, Production <chr>, Website <chr>, Response <chr>
Build URL for One IMDb ID
Lets say you have a valid IMDb ID. You can find an IMDb ID by searching for a title on the IMDb website. After you find a movie you like, the IMDb ID will be in the URL. For example, the URL for the IMDb page for Star Wars: Episode V - The Empire Strikes Back (1980) is https://www.imdb.com/title/tt0080684/?ref_=nv_sr_srsg_0. Therefore, its IMDb ID is tt0080684.
Here’s a function you can use if you have a valid IMDb ID:
search_by_IMDb_ID <- function(mykey,IMDb_ID,type="movie"){
#build URL:
base_url <- paste0("http://www.omdbapi.com/?apikey=",mykey)
info_url <- paste0("&i=",IMDb_ID,"&type=",type)
full_url <- paste0(base_url, info_url)
full_url <- gsub(full_url, pattern = " ", replacement = "%20")
#use URL to get data from the OMDb API:
movie_api_call <- GET(full_url)
movie_api_call_char <- rawToChar(movie_api_call$content)
movie_JSON <- jsonlite::fromJSON(movie_api_call_char, flatten = TRUE)
movie_JSON <- as.data.frame(movie_JSON)
tibble_movie_JSON <- as_tibble(movie_JSON)
return(tibble_movie_JSON)
}
You should run the function like this:
search_by_IMDb_ID("mykey","tt0080684",type="movie")
You should get a tibble that looks like this:
## # A tibble: 3 × 26
## Title Year Rated Released Runtime Genre Director Writer Actors Plot Language Country Awards Poster Ratings.Source
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Star Wars… 1980 PG 20 Jun … 124 min Acti… Irvin K… Leigh… Mark … Afte… English United… Won 1… https… Internet Movi…
## 2 Star Wars… 1980 PG 20 Jun … 124 min Acti… Irvin K… Leigh… Mark … Afte… English United… Won 1… https… Rotten Tomato…
## 3 Star Wars… 1980 PG 20 Jun … 124 min Acti… Irvin K… Leigh… Mark … Afte… English United… Won 1… https… Metacritic
## # … with 11 more variables: Ratings.Value <chr>, Metascore <chr>, imdbRating <chr>, imdbVotes <chr>, imdbID <chr>,
## # Type <chr>, DVD <chr>, BoxOffice <chr>, Production <chr>, Website <chr>, Response <chr>
Build URL to Search for Movies in a Series
Let’s say you wanted to get all of the titles for all of the Star Wars movies. You would then need to build your URL “By Search” instead. Here’s a function you can use if you wanted to search for multiple movie titles:
by_search_series <- function(mykey,title,type="movie"){
#build URL:
base_url <- paste0("http://www.omdbapi.com/?apikey=",mykey)
info_url <- paste0("&s=",title,"&type=",type)
full_url <- paste0(base_url, info_url)
full_url <- gsub(full_url, pattern = " ", replacement = "%20")
#use URL to get a data frame with a list of titles from the OMDb API:
movie_api_call <- GET(full_url)
movie_api_call_char <- rawToChar(movie_api_call$content)
movie_JSON <- jsonlite::fromJSON(movie_api_call_char, flatten = TRUE)
movie_JSON <- as.data.frame(movie_JSON)
tibble_movie_JSON <- as_tibble(movie_JSON)
return(tibble_movie_JSON)
}
You should run the function like this:
by_search_series("mykey","star_wars",type="movie")
You should get a tibble that looks like this:
## # A tibble: 10 × 7
## Search.Title Search.Year Search.imdbID Search.Type Search.Poster totalResults Response
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Star Wars 1977 tt0076759 movie https://m.me… 540 True
## 2 Star Wars: Episode V - The Empire Strikes Ba… 1980 tt0080684 movie https://m.me… 540 True
## 3 Star Wars: Episode VI - Return of the Jedi 1983 tt0086190 movie https://m.me… 540 True
## 4 Star Wars: Episode VII - The Force Awakens 2015 tt2488496 movie https://m.me… 540 True
## 5 Star Wars: Episode I - The Phantom Menace 1999 tt0120915 movie https://m.me… 540 True
## 6 Star Wars: Episode III - Revenge of the Sith 2005 tt0121766 movie https://m.me… 540 True
## 7 Star Wars: Episode II - Attack of the Clones 2002 tt0121765 movie https://m.me… 540 True
## 8 Star Wars: Episode VIII - The Last Jedi 2017 tt2527336 movie https://m.me… 540 True
## 9 Rogue One: A Star Wars Story 2016 tt3748528 movie https://m.me… 540 True
## 10 Star Wars: Episode IX - The Rise of Skywalker 2019 tt2527338 movie https://m.me… 540 True
Build URL to Search for One or More Titles or Series
Now, what if you want to get all of the data for all of the Star Wars movies and all of the Indiana Jones movies. The function below can handle a list of several titles or one title.
mat=NULL
by_search_one_or_more_titles <- function(mykey,title,type="movie"){
#if you only give one title, this part will run:
if(length(title)<=1){
#build URL:
base_url <- paste0("http://www.omdbapi.com/?apikey=",mykey)
info_url <- paste0("&s=",title,"&type=",type)
full_url <- paste0(base_url, info_url)
full_url <- gsub(full_url, pattern = " ", replacement = "%20")
#use URL to get a data frame with a list of titles from the OMDb API:
movie_api_call <- GET(full_url)
movie_api_call_char <- rawToChar(movie_api_call$content)
movie_JSON <- jsonlite::fromJSON(movie_api_call_char, flatten = TRUE)
movie_JSON <- as.data.frame(movie_JSON)
movie_JSON <- as_tibble(movie_JSON)
return(movie_JSON)
}
if(length(title)>1){
#if you give more than one title, this part will run:
for(i in title){
#build URL:
base_url <- paste0("http://www.omdbapi.com/?apikey=",mykey)
info_url <- paste0("&s=",i,"&type=",type)
full_url <- paste0(base_url, info_url)
full_url <- gsub(full_url, pattern = " ", replacement = "%20")
#use URL to get a data frame with a list of titles from the OMDb API:
movie_api_call <- GET(full_url)
movie_api_call_char <- rawToChar(movie_api_call$content)
movie_JSON <- jsonlite::fromJSON(movie_api_call_char, flatten = TRUE)
movie_JSON <- as.data.frame(movie_JSON)
mat=rbind(mat,movie_JSON)
mat=as_tibble(mat)
}
}
return(mat)
}
You should run the function like this:
by_search_one_or_more_titles("mykey",c("star_wars","indiana_jones"),type="movie")
You would get the tibble below:
## # A tibble: 20 × 7
## Search.Title Search.Year Search.imdbID Search.Type Search.Poster totalResults Response
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Star Wars 1977 tt0076759 movie https://m.me… 540 True
## 2 Star Wars: Episode V - The Empire Strikes Ba… 1980 tt0080684 movie https://m.me… 540 True
## 3 Star Wars: Episode VI - Return of the Jedi 1983 tt0086190 movie https://m.me… 540 True
## 4 Star Wars: Episode VII - The Force Awakens 2015 tt2488496 movie https://m.me… 540 True
## 5 Star Wars: Episode I - The Phantom Menace 1999 tt0120915 movie https://m.me… 540 True
## 6 Star Wars: Episode III - Revenge of the Sith 2005 tt0121766 movie https://m.me… 540 True
## 7 Star Wars: Episode II - Attack of the Clones 2002 tt0121765 movie https://m.me… 540 True
## 8 Star Wars: Episode VIII - The Last Jedi 2017 tt2527336 movie https://m.me… 540 True
## 9 Rogue One: A Star Wars Story 2016 tt3748528 movie https://m.me… 540 True
## 10 Star Wars: Episode IX - The Rise of Skywalker 2019 tt2527338 movie https://m.me… 540 True
## 11 Indiana Jones and the Raiders of the Lost Ark 1981 tt0082971 movie https://m.me… 83 True
## 12 Indiana Jones and the Last Crusade 1989 tt0097576 movie https://m.me… 83 True
## 13 Indiana Jones and the Temple of Doom 1984 tt0087469 movie https://m.me… 83 True
## 14 Indiana Jones and the Kingdom of the Crystal… 2008 tt0367882 movie https://m.me… 83 True
## 15 Indiana Jones and the Temple of the Forbidde… 1995 tt0764648 movie https://m.me… 83 True
## 16 The Adventures of Young Indiana Jones: Treas… 1995 tt0115031 movie https://m.me… 83 True
## 17 The Adventures of Young Indiana Jones: Trave… 1996 tt0154003 movie https://m.me… 83 True
## 18 The Adventures of Young Indiana Jones: Attac… 1995 tt0154004 movie https://m.me… 83 True
## 19 Mr. Plinkett's Indiana Jones and the Kingdom… 2011 tt6330122 movie https://m.me… 83 True
## 20 The Adventures of Young Indiana Jones: Holly… 1994 tt0111806 movie https://m.me… 83 True
If you wanted to search for one title or series, like Indiana Jones, you would run the function like this:
by_search_one_or_more_titles("mykey","indiana_jones",type="movie")
You would get this tibble:
## # A tibble: 10 × 7
## Search.Title Search.Year Search.imdbID Search.Type Search.Poster totalResults Response
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Indiana Jones and the Raiders of the Lost Ark 1981 tt0082971 movie https://m.me… 83 True
## 2 Indiana Jones and the Last Crusade 1989 tt0097576 movie https://m.me… 83 True
## 3 Indiana Jones and the Temple of Doom 1984 tt0087469 movie https://m.me… 83 True
## 4 Indiana Jones and the Kingdom of the Crystal… 2008 tt0367882 movie https://m.me… 83 True
## 5 Indiana Jones and the Temple of the Forbidde… 1995 tt0764648 movie https://m.me… 83 True
## 6 The Adventures of Young Indiana Jones: Treas… 1995 tt0115031 movie https://m.me… 83 True
## 7 The Adventures of Young Indiana Jones: Trave… 1996 tt0154003 movie https://m.me… 83 True
## 8 The Adventures of Young Indiana Jones: Attac… 1995 tt0154004 movie https://m.me… 83 True
## 9 Mr. Plinkett's Indiana Jones and the Kingdom… 2011 tt6330122 movie https://m.me… 83 True
## 10 The Adventures of Young Indiana Jones: Holly… 1994 tt0111806 movie https://m.me… 83 True
Get the Data for One Series
That’s great! Now, lets get the data for all of the Star Wars movies:
mat=NULL
get_data_series <- function(mykey,title){
#this part gets the titles in the series given:
temp_table <- by_search_series(mykey,title,type="movie")
list_of_titles <- unique(temp_table$Search.Title)
#this part cycles through each title and gets the data:
for(movie_title in list_of_titles){
table <- search_by_title(mykey,movie_title,type="movie")
mat=rbind(mat,table)
}
return(mat)
}
You should run the function like this:
get_data_series("mykey","star_wars")
You should get a tibble that looks like this:
## # A tibble: 30 × 26
## Title Year Rated Released Runtime Genre Director Writer Actors Plot Language Country Awards Poster Ratings.Source
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Star Wars 1977 PG 25 May … 121 min Acti… George … Georg… Mark … Luke… English United… Won 6… https… Internet Movi…
## 2 Star Wars 1977 PG 25 May … 121 min Acti… George … Georg… Mark … Luke… English United… Won 6… https… Rotten Tomato…
## 3 Star Wars 1977 PG 25 May … 121 min Acti… George … Georg… Mark … Luke… English United… Won 6… https… Metacritic
## 4 Star War… 1980 PG 20 Jun … 124 min Acti… Irvin K… Leigh… Mark … Afte… English United… Won 1… https… Internet Movi…
## 5 Star War… 1980 PG 20 Jun … 124 min Acti… Irvin K… Leigh… Mark … Afte… English United… Won 1… https… Rotten Tomato…
## 6 Star War… 1980 PG 20 Jun … 124 min Acti… Irvin K… Leigh… Mark … Afte… English United… Won 1… https… Metacritic
## 7 Star War… 1983 PG 25 May … 131 min Acti… Richard… Lawre… Mark … Afte… English United… Nomin… https… Internet Movi…
## 8 Star War… 1983 PG 25 May … 131 min Acti… Richard… Lawre… Mark … Afte… English United… Nomin… https… Rotten Tomato…
## 9 Star War… 1983 PG 25 May … 131 min Acti… Richard… Lawre… Mark … Afte… English United… Nomin… https… Metacritic
## 10 Star War… 2015 PG-13 18 Dec … 138 min Acti… J.J. Ab… Lawre… Daisy… As a… English United… Nomin… https… Internet Movi…
## # … with 20 more rows, and 11 more variables: Ratings.Value <chr>, Metascore <chr>, imdbRating <chr>, imdbVotes <chr>,
## # imdbID <chr>, Type <chr>, DVD <chr>, BoxOffice <chr>, Production <chr>, Website <chr>, Response <chr>
Get the Data for One or More Series
Now, lets get all of the data for both Star Wars and Indiana Jones:
mat=NULL
get_data_one_or_more_titles <- function(mykey,title){
#this part gets the titles in all series given:
temp_table <- by_search_one_or_more_titles(mykey,title,type="movie")
list_of_titles <- unique(temp_table$Search.Title)
#this part cycles through each title and gets the data:
for(movie_title in list_of_titles){
table <- search_by_title(mykey,movie_title,type="movie")
mat=rbind(mat,table)
}
return(mat)
}
You would run the function like this:
get_data_one_or_more_titles("mykey",c("star_wars","indiana_jones"))
## # A tibble: 48 × 26
## Title Year Rated Released Runtime Genre Director Writer Actors Plot Language Country Awards Poster Ratings.Source
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Star Wars 1977 PG 25 May … 121 min Acti… George … Georg… Mark … Luke… English United… Won 6… https… Internet Movi…
## 2 Star Wars 1977 PG 25 May … 121 min Acti… George … Georg… Mark … Luke… English United… Won 6… https… Rotten Tomato…
## 3 Star Wars 1977 PG 25 May … 121 min Acti… George … Georg… Mark … Luke… English United… Won 6… https… Metacritic
## 4 Star War… 1980 PG 20 Jun … 124 min Acti… Irvin K… Leigh… Mark … Afte… English United… Won 1… https… Internet Movi…
## 5 Star War… 1980 PG 20 Jun … 124 min Acti… Irvin K… Leigh… Mark … Afte… English United… Won 1… https… Rotten Tomato…
## 6 Star War… 1980 PG 20 Jun … 124 min Acti… Irvin K… Leigh… Mark … Afte… English United… Won 1… https… Metacritic
## 7 Star War… 1983 PG 25 May … 131 min Acti… Richard… Lawre… Mark … Afte… English United… Nomin… https… Internet Movi…
## 8 Star War… 1983 PG 25 May … 131 min Acti… Richard… Lawre… Mark … Afte… English United… Nomin… https… Rotten Tomato…
## 9 Star War… 1983 PG 25 May … 131 min Acti… Richard… Lawre… Mark … Afte… English United… Nomin… https… Metacritic
## 10 Star War… 2015 PG-13 18 Dec … 138 min Acti… J.J. Ab… Lawre… Daisy… As a… English United… Nomin… https… Internet Movi…
## # … with 38 more rows, and 11 more variables: Ratings.Value <chr>, Metascore <chr>, imdbRating <chr>, imdbVotes <chr>,
## # imdbID <chr>, Type <chr>, DVD <chr>, BoxOffice <chr>, Production <chr>, Website <chr>, Response <chr>
Let’s make a data set!
First, I’m going to make two lists of movies:
#for these movies, I just want the first result it gives me because they are not a series:
titles <- c("casablanca",
"the_wizard_of_oz",
"it's_a_wonderful_life",
"goodfellas",
"taxi_driver",
"psycho",
"singin_in_the_rain",
"2001:_a_space_odyssey",
"vertigo")
#for these movies, I want all of them in each series:
series <- c("the_godfather",
"star_wars",
"alien",
"fast_and_furious",
"final_destination",
"friday_the_13th")
Here is the function I’m going to use to get all of the data for all of my movies:
mat1=NULL
mat2=NULL
mat3=NULL
get_data_titles_and_series <- function(mykey,titles,series){
#this part gets the data for all of my stand-alone titles provided:
for(i in titles){
temp_table <- search_by_title(mykey,i,type="movie")
mat1=rbind(mat1,temp_table)
}
#this part gets the data by cycling through each movie from each series provided:
for(j in series){
temp_table <- by_search_series(mykey,j,type="movie")
list_of_titles <- unique(temp_table$Search.Title)
for(movie_title in list_of_titles){
table2 <- search_by_title(mykey,movie_title,type="movie")
mat2=rbind(mat2,table2)
}
}
#this part combines data for both the results for the stand-alone titles and series
mat3=rbind(mat3,mat1,mat2)
return(mat3)
}
I’m going to run the function like this:
get_data_titles_and_series("mykey",titles,series)
Here is the tibble I get:
## # A tibble: 152 × 26
## Title Year Rated Released Runtime Genre Director Writer Actors Plot Language Country Awards Poster Ratings.Source
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Casablan… 1942 PG 23 Jan … 102 min Dram… Michael… Juliu… Humph… A cy… English… United… Won 3… https… Internet Movi…
## 2 Casablan… 1942 PG 23 Jan … 102 min Dram… Michael… Juliu… Humph… A cy… English… United… Won 3… https… Rotten Tomato…
## 3 Casablan… 1942 PG 23 Jan … 102 min Dram… Michael… Juliu… Humph… A cy… English… United… Won 3… https… Metacritic
## 4 The Wiza… 1939 G 25 Aug … 102 min Adve… Victor … Noel … Judy … Youn… English United… Won 2… https… Internet Movi…
## 5 The Wiza… 1939 G 25 Aug … 102 min Adve… Victor … Noel … Judy … Youn… English United… Won 2… https… Rotten Tomato…
## 6 The Wiza… 1939 G 25 Aug … 102 min Adve… Victor … Noel … Judy … Youn… English United… Won 2… https… Metacritic
## 7 It's a W… 1946 PG 07 Jan … 130 min Dram… Frank C… Franc… James… An a… English… United… Nomin… https… Internet Movi…
## 8 It's a W… 1946 PG 07 Jan … 130 min Dram… Frank C… Franc… James… An a… English… United… Nomin… https… Rotten Tomato…
## 9 It's a W… 1946 PG 07 Jan … 130 min Dram… Frank C… Franc… James… An a… English… United… Nomin… https… Metacritic
## 10 Goodfell… 1990 R 21 Sep … 145 min Biog… Martin … Nicho… Rober… The … English… United… Won 1… https… Internet Movi…
## # … with 142 more rows, and 11 more variables: Ratings.Value <chr>, Metascore <chr>, imdbRating <chr>, imdbVotes <chr>,
## # imdbID <chr>, Type <chr>, DVD <chr>, BoxOffice <chr>, Production <chr>, Website <chr>, Response <chr>
Here is a list of the movie titles in my data set:
length(unique(unformatted_data$Title))
## [1] 61
There are 61 unique titles.
unique(unformatted_data$Title)
## [1] "Casablanca"
## [2] "The Wizard of Oz"
## [3] "It's a Wonderful Life"
## [4] "Goodfellas"
## [5] "Taxi Driver"
## [6] "Psycho"
## [7] "Singin' in the Rain"
## [8] "2001: A Space Odyssey"
## [9] "Vertigo"
## [10] "The Godfather"
## [11] "The Godfather: Part II"
## [12] "The Godfather: Part III"
## [13] "The Godfather Trilogy: 1901-1980"
## [14] "The Godfather Family: A Look Inside"
## [15] "The Godfather Legacy"
## [16] "Herschell Gordon Lewis: The Godfather of Gore"
## [17] "The Godfather of Green Bay"
## [18] "Paul Mooney: The Godfather of Comedy"
## [19] "The Godfather: Behind the Scenes"
## [20] "Star Wars"
## [21] "Star Wars: Episode V - The Empire Strikes Back"
## [22] "Star Wars: Episode VI - Return of the Jedi"
## [23] "Star Wars: Episode VII - The Force Awakens"
## [24] "Star Wars: Episode I - The Phantom Menace"
## [25] "Star Wars: Episode III - Revenge of the Sith"
## [26] "Star Wars: Episode II - Attack of the Clones"
## [27] "Star Wars: Episode VIII - The Last Jedi"
## [28] "Rogue One: A Star Wars Story"
## [29] "Star Wars: Episode IX - The Rise of Skywalker"
## [30] "Alien"
## [31] "Alien³"
## [32] "Alien: Covenant"
## [33] "Alien: Resurrection"
## [34] "Alien vs. Predator"
## [35] "My Stepmother Is an Alien"
## [36] "Alien Nation"
## [37] "Alien Raiders"
## [38] "Alien Abduction"
## [39] "Alien Autopsy"
## [40] "Fast and Furious"
## [41] "Tasmanian Devil: The Fast and Furious Life of Errol Flynn"
## [42] "Christine: Fast and Furious"
## [43] "Fast and Furious: Fast Cars!"
## [44] "Final Destination"
## [45] "Final Destination 2"
## [46] "Final Destination 3"
## [47] "Final Destination 5"
## [48] "The Final Destination"
## [49] "The City of Your Final Destination"
## [50] "Death's Design: Making 'Final Destination 3'"
## [51] "Final Destination Unknown"
## [52] "Final Destination 5: Circle of Death"
## [53] "Friday the 13th"
## [54] "Friday the 13th Part 2"
## [55] "Friday the 13th Part III"
## [56] "Friday the 13th: The Final Chapter"
## [57] "Friday the 13th Part VI: Jason Lives"
## [58] "Friday the 13th: A New Beginning"
## [59] "Friday the 13th Part VIII: Jason Takes Manhattan"
## [60] "Friday the 13th Part VII: The New Blood"
## [61] "His Name Was Jason: 30 Years of Friday the 13th"
Great, but we can’t use this data until we format it. Here’s what I’m going to do in the next function:
- Convert the following columns from character to numeric:
- Year
- Runtime
- Ratings.Value
- Metascore
- imdbRating
- imdbVotes
- BoxOffice
- Convert the values in these columns to dates (year-month-day):
- Released
- DVD
- Create two new columns:
- average_rating is the average of the Ratings.Value, Metascore, and imdbRating
- Summary_Awards shows whether a movie:
- won and was nominated for an award
- won an award
- was nominated for an award
- did not win and was not nominated for an award
Here are my helper functions:
#this will help us convert the Ratings.Value column to numeric
parse_number <- function(S){
if(grepl("/", S)){
A<-str_split(S, "/")
A<-as.numeric(unlist(A))
A<-A[[1]]/A[[2]]
A<-A*100
} else {
A<-as.numeric(gsub("%","",S))
}
return(A)
}
#this will help us make the Summary_Awards column:
award <- function(S){
if(is.na(S)){
A<-"none"
return(A)
}
S=tolower(S)
#won and nominated:
if((grepl("won", S) | grepl("win", S)) && (grepl("nomina", S))){
A<-"won and nominated"
}
#only won:
else if (grepl("won", S) | grepl("win", S)){
A<-"won"
}
#only nominated:
else if (grepl("nomina", S)){
A<-"nomination"
}
#no awards or nominations:
else {(A<-"none")
return(A)
}
}
Here is my complete formatting function:
mat1=NULL
format_data <- function(mykey,titles,series){
data <- get_data_titles_and_series(mykey,titles,series)
data$Year <- as.numeric(data$Year)
data$Released <- dmy(data$Released)
data$Runtime <- as.numeric(gsub(" min","",data$Runtime))
data$Ratings.Value <- sapply(data$Ratings.Value, FUN=parse_number)
data$Summary_Awards <- as.factor(sapply(data$Awards, FUN=award))
data$Metascore <- as.numeric(data$Metascore)
data$imdbRating <- as.numeric(data$imdbRating)*10
data$imdbVotes <- as.numeric(gsub(",","",data$imdbVotes))
data$DVD <- dmy(data$DVD)
data$BoxOffice <- gsub("\\$","",data$BoxOffice)
data$BoxOffice <- as.numeric(gsub(",","",data$BoxOffice))
movie_list<-unique(data$Title)
for (i in movie_list){
temp=data[is.element(data$Title,i),]
Ratings.Value_mean<-mean(temp$Ratings.Value)
Metascore<-unique(temp$Metascore)
imdbRating<-unique(temp$imdbRating)
#some of the values in the Metascore column have NAs, so these if statements accommodate for this:
if(is.na(Metascore)==TRUE){
temp$average_rating=(Ratings.Value_mean+imdbRating)/2
}
if(is.na(Metascore)==FALSE){
temp$average_rating=(Ratings.Value_mean+Metascore+imdbRating)/3
}
mat1=rbind(mat1,temp)
}
return(mat1)
}
Here’s how I ran it:
formatted_data<-format_data("mykey",titles,series)
Here’s the tibble I got:
## # A tibble: 152 × 28
## Title Year Rated Released Runtime Genre Director Writer Actors Plot Language Country Awards Poster Ratings.Source
## <chr> <dbl> <chr> <date> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Casabl… 1942 PG 1943-01-23 102 Dram… Michael… Juliu… Humph… A cy… English… United… Won 3… https… Internet Movi…
## 2 Casabl… 1942 PG 1943-01-23 102 Dram… Michael… Juliu… Humph… A cy… English… United… Won 3… https… Rotten Tomato…
## 3 Casabl… 1942 PG 1943-01-23 102 Dram… Michael… Juliu… Humph… A cy… English… United… Won 3… https… Metacritic
## 4 The Wi… 1939 G 1939-08-25 102 Adve… Victor … Noel … Judy … Youn… English United… Won 2… https… Internet Movi…
## 5 The Wi… 1939 G 1939-08-25 102 Adve… Victor … Noel … Judy … Youn… English United… Won 2… https… Rotten Tomato…
## 6 The Wi… 1939 G 1939-08-25 102 Adve… Victor … Noel … Judy … Youn… English United… Won 2… https… Metacritic
## 7 It's a… 1946 PG 1947-01-07 130 Dram… Frank C… Franc… James… An a… English… United… Nomin… https… Internet Movi…
## 8 It's a… 1946 PG 1947-01-07 130 Dram… Frank C… Franc… James… An a… English… United… Nomin… https… Rotten Tomato…
## 9 It's a… 1946 PG 1947-01-07 130 Dram… Frank C… Franc… James… An a… English… United… Nomin… https… Metacritic
## 10 Goodfe… 1990 R 1990-09-21 145 Biog… Martin … Nicho… Rober… The … English… United… Won 1… https… Internet Movi…
## # … with 142 more rows, and 13 more variables: Ratings.Value <dbl>, Metascore <dbl>, imdbRating <dbl>, imdbVotes <dbl>,
## # imdbID <chr>, Type <chr>, DVD <date>, BoxOffice <dbl>, Production <chr>, Website <chr>, Response <chr>,
## # Summary_Awards <fct>, average_rating <dbl>
When I ran this, I got the following warnings:
Warning: 2 failed to parse.
Warning in format_data("mykey", titles, series): NAs introduced by coercion
Warning: 12 failed to parse.
Warning in format_data("mykey", titles, series): NAs introduced by coercion
Let’s see where these warnings are coming from by looking at the unformatted data:
test<-unformatted_data[grep("N/A", unformatted_data$Released), ]
test %>%
select(Released,everything())
## # A tibble: 2 × 26
## Released Title Year Rated Runtime Genre Director Writer Actors Plot Language Country Awards Poster Ratings.Source
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 N/A The Godfa… 1971 N/A 9 min Docu… Fredric… N/A James… N/A English United… N/A N/A Internet Movi…
## 2 N/A Final Des… 2011 Not … 5 min Docu… N/A N/A Emma … A lo… English USA N/A N/A Internet Movi…
## # … with 11 more variables: Ratings.Value <chr>, Metascore <chr>, imdbRating <chr>, imdbVotes <chr>, imdbID <chr>,
## # Type <chr>, DVD <chr>, BoxOffice <chr>, Production <chr>, Website <chr>, Response <chr>
The first warning (Warning: 2 failed to parse.
) occurred because the
function dmy()
from lubridate
cannot parse the two rows in the
Released column with “N/A” in it (see tibble above). This means that
these two titles do not have a release date, so R puts a NA
in
instead. They might not have been released to theaters. To remove this
warning, you would need to replace the “N/A” with a date that dmy()
can work with.
Let’s look at the second warning (Warning: 12 failed to parse.
):
test<-unformatted_data[grep("N/A", unformatted_data$DVD), ]
test %>%
select(DVD,everything())
## # A tibble: 12 × 26
## DVD Title Year Rated Released Runtime Genre Director Writer Actors Plot Language Country Awards Poster
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 N/A The Godfather Tri… 1992 R 30 Oct … 583 min Crim… Francis… Franc… Al Pa… The … English… USA N/A https…
## 2 N/A The Godfather Fam… 1990 N/A 12 Jul … 73 min Docu… Jeff We… David… Franc… A do… English… United… N/A https…
## 3 N/A The Godfather Leg… 2012 TV-14 24 Jul … 95 min Docu… Kevin B… Kevin… Peter… THE … English United… N/A https…
## 4 N/A Paul Mooney: The … 2012 N/A 03 Feb … 86 min Come… N/A Paul … Paul … The … English United… N/A https…
## 5 N/A The Godfather: Be… 1971 N/A N/A 9 min Docu… Fredric… N/A James… N/A English United… N/A N/A
## 6 N/A Fast and Furious 1939 Pass… 06 Oct … 73 min Come… Busby B… Harry… Franc… Rare… English United… N/A https…
## 7 N/A Tasmanian Devil: … 2007 N/A 09 Apr … 60 min Docu… Simon N… Simon… Chris… The … English Austra… N/A https…
## 8 N/A Christine: Fast a… 2004 N/A 28 Sep … 29 min Docu… Laurent… Laure… John … N/A English USA N/A https…
## 9 N/A Fast and Furious:… 2015 N/A 24 Mar … 40 min Docu… N/A N/A Vin D… Divi… English USA N/A N/A
## 10 N/A Death's Design: M… 2006 N/A 06 Feb … 15 min Docu… Katy Le… Carol… Texas… N/A English United… N/A N/A
## 11 N/A Final Destination… 1987 N/A 13 Mar … 27 min Shor… Paul Bu… Paul … Andre… Drif… English United… N/A N/A
## 12 N/A Final Destination… 2011 Not … N/A 5 min Docu… N/A N/A Emma … A lo… English USA N/A N/A
## # … with 11 more variables: Ratings.Source <chr>, Ratings.Value <chr>, Metascore <chr>, imdbRating <chr>,
## # imdbVotes <chr>, imdbID <chr>, Type <chr>, BoxOffice <chr>, Production <chr>, Website <chr>, Response <chr>
Again, the function dmy() cannot parse the twelve rows in the DVD column
with “N/A” in it (see tibble above), so R put in an NA
for these rows
instead. They might not have been released to DVD. To remove this
warning, you would need to replace the “N/A” with a date that dmy() can
work with.
I don’t know what to put in place of the NA
, so I am going to leave
them in the data. These warnings will not effect our analysis going
forward.
Here is the structure of my formatted data:
str(formatted_data)
## tibble [152 × 28] (S3: tbl_df/tbl/data.frame)
## $ Title : chr [1:152] "Casablanca" "Casablanca" "Casablanca" "The Wizard of Oz" ...
## $ Year : num [1:152] 1942 1942 1942 1939 1939 ...
## $ Rated : chr [1:152] "PG" "PG" "PG" "G" ...
## $ Released : Date[1:152], format: "1943-01-23" "1943-01-23" "1943-01-23" "1939-08-25" ...
## $ Runtime : num [1:152] 102 102 102 102 102 102 130 130 130 145 ...
## $ Genre : chr [1:152] "Drama, Romance, War" "Drama, Romance, War" "Drama, Romance, War" "Adventure, Family, Fantasy" ...
## $ Director : chr [1:152] "Michael Curtiz" "Michael Curtiz" "Michael Curtiz" "Victor Fleming, George Cukor, Mervyn LeRoy" ...
## $ Writer : chr [1:152] "Julius J. Epstein, Philip G. Epstein, Howard Koch" "Julius J. Epstein, Philip G. Epstein, Howard Koch" "Julius J. Epstein, Philip G. Epstein, Howard Koch" "Noel Langley, Florence Ryerson, Edgar Allan Woolf" ...
## $ Actors : chr [1:152] "Humphrey Bogart, Ingrid Bergman, Paul Henreid" "Humphrey Bogart, Ingrid Bergman, Paul Henreid" "Humphrey Bogart, Ingrid Bergman, Paul Henreid" "Judy Garland, Frank Morgan, Ray Bolger" ...
## $ Plot : chr [1:152] "A cynical expatriate American cafe owner struggles to decide whether or not to help his former lover and her fu"| __truncated__ "A cynical expatriate American cafe owner struggles to decide whether or not to help his former lover and her fu"| __truncated__ "A cynical expatriate American cafe owner struggles to decide whether or not to help his former lover and her fu"| __truncated__ "Young Dorothy Gale and her dog are swept away by a tornado from their Kansas farm to the magical Land of Oz, an"| __truncated__ ...
## $ Language : chr [1:152] "English, French, German, Italian" "English, French, German, Italian" "English, French, German, Italian" "English" ...
## $ Country : chr [1:152] "United States" "United States" "United States" "United States" ...
## $ Awards : chr [1:152] "Won 3 Oscars. 10 wins & 9 nominations total" "Won 3 Oscars. 10 wins & 9 nominations total" "Won 3 Oscars. 10 wins & 9 nominations total" "Won 2 Oscars. 13 wins & 16 nominations total" ...
## $ Poster : chr [1:152] "https://m.media-amazon.com/images/M/MV5BY2IzZGY2YmEtYzljNS00NTM5LTgwMzUtMzM1NjQ4NGI0OTk0XkEyXkFqcGdeQXVyNDYyMDk"| __truncated__ "https://m.media-amazon.com/images/M/MV5BY2IzZGY2YmEtYzljNS00NTM5LTgwMzUtMzM1NjQ4NGI0OTk0XkEyXkFqcGdeQXVyNDYyMDk"| __truncated__ "https://m.media-amazon.com/images/M/MV5BY2IzZGY2YmEtYzljNS00NTM5LTgwMzUtMzM1NjQ4NGI0OTk0XkEyXkFqcGdeQXVyNDYyMDk"| __truncated__ "https://m.media-amazon.com/images/M/MV5BNjUyMTc4MDExMV5BMl5BanBnXkFtZTgwNDg0NDIwMjE@._V1_SX300.jpg" ...
## $ Ratings.Source: chr [1:152] "Internet Movie Database" "Rotten Tomatoes" "Metacritic" "Internet Movie Database" ...
## $ Ratings.Value : Named num [1:152] 85 99 100 81 98 92 86 94 89 87 ...
## ..- attr(*, "names")= chr [1:152] "8.5/10" "99%" "100/100" "8.1/10" ...
## $ Metascore : num [1:152] 100 100 100 92 92 92 89 89 89 90 ...
## $ imdbRating : num [1:152] 85 85 85 81 81 81 86 86 86 87 ...
## $ imdbVotes : num [1:152] 561509 561509 561509 391833 391833 ...
## $ imdbID : chr [1:152] "tt0034583" "tt0034583" "tt0034583" "tt0032138" ...
## $ Type : chr [1:152] "movie" "movie" "movie" "movie" ...
## $ DVD : Date[1:152], format: "1998-11-17" "1998-11-17" "1998-11-17" "2003-08-12" ...
## $ BoxOffice : num [1:152] 4219709 4219709 4219709 24668669 24668669 ...
## $ Production : chr [1:152] "N/A" "N/A" "N/A" "N/A" ...
## $ Website : chr [1:152] "N/A" "N/A" "N/A" "N/A" ...
## $ Response : chr [1:152] "True" "True" "True" "True" ...
## $ Summary_Awards: Factor w/ 4 levels "nomination","none",..: 4 4 4 4 4 4 4 4 4 4 ...
## ..- attr(*, "names")= chr [1:152] "Won 3 Oscars. 10 wins & 9 nominations total" "Won 3 Oscars. 10 wins & 9 nominations total" "Won 3 Oscars. 10 wins & 9 nominations total" "Won 2 Oscars. 13 wins & 16 nominations total" ...
## $ average_rating: num [1:152] 93.2 93.2 93.2 87.8 87.8 ...
Exploratory Data Analysis
Contingency Tables
We can use contingency tables to summarize up to three categorical variables.
Let’s say I wanted to know how many movies are rated “PG-13” in my data set. I would make a contingency table summarizing the ratings for all of the movies:
A <- formatted_data %>%
select(Title,Rated)
B<-unique(A)
table(B$Rated)
##
## G N/A Not Rated Passed PG PG-13 R TV-14
## 3 8 4 1 8 9 27 1
As you can see, 9 out of the 61 movies from my data set are “PG-13”.
Let’s say I wanted to summarize how many movies fell under each genre listed. I would make the contingency table below:
C <- formatted_data %>%
select(Title,Genre)
D<-unique(C)
table(D$Genre)
##
## Action, Adventure, Fantasy Action, Adventure, Horror Action, Adventure, Sci-Fi
## 8 2 2
## Action, Horror, Sci-Fi Action, Horror, Thriller Action, Sci-Fi
## 3 1 1
## Adventure, Family, Fantasy Adventure, Horror, Thriller Adventure, Sci-Fi
## 1 1 1
## Biography, Comedy, Sci-Fi Biography, Crime, Drama Comedy
## 1 1 2
## Comedy, Crime, Mystery Comedy, Musical, Romance Comedy, Sci-Fi
## 1 1 1
## Crime, Drama Crime, Drama, Thriller Documentary
## 4 1 3
## Documentary, Biography, Crime, History Documentary, Horror Documentary, Short
## 1 1 5
## Drama, Family, Fantasy Drama, Romance Drama, Romance, War
## 1 1 1
## Horror, Mystery, Thriller Horror, Sci-Fi Horror, Sci-Fi, Thriller
## 5 1 1
## Horror, Thriller Mystery, Romance, Thriller Short, Fantasy
## 7 1 1
So, for example, it looks like 7 out of the 61 from my data set are classified under the “Horror, Thriller” genre.
What if we wanted to see both the rating and the genre in the same table? We could make a two-way contingency table:
table(B$Rated, D$Genre)
##
## Action, Adventure, Fantasy Action, Adventure, Horror Action, Adventure, Sci-Fi Action, Horror, Sci-Fi
## G 0 0 0 0
## N/A 0 0 0 0
## Not Rated 0 1 0 0
## Passed 0 0 0 0
## PG 5 0 0 0
## PG-13 3 1 2 0
## R 0 0 0 3
## TV-14 0 0 0 0
##
## Action, Horror, Thriller Action, Sci-Fi Adventure, Family, Fantasy Adventure, Horror, Thriller
## G 0 0 1 0
## N/A 0 0 0 0
## Not Rated 0 0 0 0
## Passed 0 0 0 0
## PG 0 0 0 0
## PG-13 0 0 0 0
## R 1 1 0 1
## TV-14 0 0 0 0
##
## Adventure, Sci-Fi Biography, Comedy, Sci-Fi Biography, Crime, Drama Comedy Comedy, Crime, Mystery
## G 1 0 0 0 0
## N/A 0 0 0 1 0
## Not Rated 0 0 0 0 0
## Passed 0 0 0 0 1
## PG 0 0 0 0 0
## PG-13 0 1 0 0 0
## R 0 0 1 1 0
## TV-14 0 0 0 0 0
##
## Comedy, Musical, Romance Comedy, Sci-Fi Crime, Drama Crime, Drama, Thriller Documentary
## G 1 0 0 0 0
## N/A 0 0 0 0 1
## Not Rated 0 0 0 0 1
## Passed 0 0 0 0 0
## PG 0 0 0 0 0
## PG-13 0 1 0 0 0
## R 0 0 4 1 0
## TV-14 0 0 0 0 1
##
## Documentary, Biography, Crime, History Documentary, Horror Documentary, Short Drama, Family, Fantasy
## G 0 0 0 0
## N/A 1 0 4 0
## Not Rated 0 1 1 0
## Passed 0 0 0 0
## PG 0 0 0 1
## PG-13 0 0 0 0
## R 0 0 0 0
## TV-14 0 0 0 0
##
## Drama, Romance Drama, Romance, War Horror, Mystery, Thriller Horror, Sci-Fi Horror, Sci-Fi, Thriller
## G 0 0 0 0 0
## N/A 0 0 0 0 0
## Not Rated 0 0 0 0 0
## Passed 0 0 0 0 0
## PG 0 1 0 0 0
## PG-13 1 0 0 0 0
## R 0 0 5 1 1
## TV-14 0 0 0 0 0
##
## Horror, Thriller Mystery, Romance, Thriller Short, Fantasy
## G 0 0 0
## N/A 0 0 1
## Not Rated 0 0 0
## Passed 0 0 0
## PG 0 1 0
## PG-13 0 0 0
## R 7 0 0
## TV-14 0 0 0
So, for example, it looks like 5 out of the 61 movies from my data set are rated PG and fall under the “Action, Adventure, Fantasy” genre.
Bar Plots
We can use bar plots to visually summarize many categorical variables.
Let’s say we wanted to find out how many movies each director made in my data set. We could make the bar graph below:
C<- formatted_data %>%
select(Title,Director)
D<-unique(C)
g<-ggplot(data = D, aes(x = Director ))
g + geom_bar(fill="lightblue") +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Director", title = "Number of Movies Each Director Made", y="Number of Movies") +
coord_flip()
It looks like George Lucas and Francis Ford Coppola are tied for the most number of movies.
Now, what if we wanted to see, for each director, how many movies either won or was nominated for awards. We could make the bar graph below.
C<- formatted_data %>%
select(Title,Director,Summary_Awards)
D<-unique(C)
g<-ggplot(data = D, aes(x = Director ))
g + geom_bar(aes(fill = as.factor(Summary_Awards))) +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Director", title = "Number of Movies Each Director Made", y="Number of Movies") +
scale_fill_discrete(name = NULL) +
coord_flip()
It looks like, for all 4 of George Lucas’ movies, all of them won and were nominated for awards. Out of the 4 movies Francis Ford Coppola made, 3 won and were nominated and 1 did not win nor was nominated.
Measures of Center and Spread (Histogram and Box Plot)
For quantitative data with one variable, we can calculate measures of center (e.g., mean) and spread (e.g., Variance). We can summarize multiple columns like this:
data_summary <- formatted_data %>% #select some of the numerical columns
select(Year, Runtime, Metascore, imdbRating, imdbVotes, BoxOffice, average_rating)
data_summary <- unique(data_summary)
summary(data_summary)
## Year Runtime Metascore imdbRating imdbVotes BoxOffice average_rating
## Min. :1939 Min. : 5.0 Min. : 13.00 Min. :45.00 Min. : 13 Min. : 12897 Min. :27.56
## 1st Qu.:1980 1st Qu.: 88.0 1st Qu.: 37.50 1st Qu.:59.00 1st Qu.: 7478 1st Qu.: 21878508 1st Qu.:53.00
## Median :1992 Median :101.0 Median : 58.50 Median :67.00 Median : 136213 Median : 47378436 Median :67.00
## Mean :1991 Mean :109.7 Mean : 60.32 Mean :69.21 Mean : 332324 Mean :139516767 Mean :65.89
## 3rd Qu.:2008 3rd Qu.:128.0 3rd Qu.: 89.00 3rd Qu.:81.00 3rd Qu.: 607401 3rd Qu.: 95520612 3rd Qu.:81.00
## Max. :2019 Max. :583.0 Max. :100.00 Max. :93.00 Max. :1786257 Max. :936662225 Max. :96.11
## NA's :17 NA's :17
It looks like my data set contains movies from 1939 to 2019.
Let’s explore this by making a histogram showing the distribution of the “Year” column:
A <- formatted_data %>%
select(Title,Year,Summary_Awards)
B<-unique(A)
g <- ggplot(B, aes(x = Year))
g + geom_histogram(color = "blue", fill = "red",
size = 2, binwidth = 3) +
labs(x = "Year", title = "Distribution of Movies by Year", y="Number of Movies")
It looks like most of the movies in my data set were made after 1975, with a peak in the late 1980s.
Let’s smooth it out with a Kernel Smoother.
ggplot(B, aes(x = Year)) + geom_histogram(aes(y = ..density..),fill = "lightgrey") +
geom_density(adjust = 0.25, size = 1) +
labs(x = "Year", title = "Distribution of Movies by Year")
The line helps us see that there are two peaks, one at around 1988 and one around 2006.
We can also calculate the mean and standard deviation for one column. Let’s calculate the average and standard deviation of the imdbVotes column:
imdbVotes <- unique(formatted_data$imdbVotes)
avg_imdbVotes <- mean(imdbVotes)
avg_imdbVotes
## [1] 332324.4
sd_imdbVotes <- sd(imdbVotes)
sd_imdbVotes
## [1] 423035.6
On average, every movie in my data set got 332324 imdbVotes. The standard deviation is 423035.7, meaning that the number of imdbVotes varies a lot between the movies. To explore this, let’s calculate the average number of imdbVotes for each genre:
A<-formatted_data %>%
select(Genre,imdbVotes) %>%
group_by(Genre) %>%
mutate(avg_genre = mean(imdbVotes))
A<-A %>%
select(Genre,avg_genre)
A<-unique(A)
A %>% as_tibble %>% print(n=30)
## # A tibble: 30 × 2
## Genre avg_genre
## <chr> <dbl>
## 1 Drama, Romance, War 561509
## 2 Adventure, Family, Fantasy 391833
## 3 Drama, Family, Fantasy 445520
## 4 Biography, Crime, Drama 1117933
## 5 Crime, Drama 1051420.
## 6 Horror, Mystery, Thriller 207197.
## 7 Comedy, Musical, Romance 238235
## 8 Adventure, Sci-Fi 652541
## 9 Mystery, Romance, Thriller 395723
## 10 Crime, Drama, Thriller 14694
## 11 Documentary 551.
## 12 Comedy 222
## 13 Documentary, Short 47.4
## 14 Action, Adventure, Fantasy 861926.
## 15 Action, Adventure, Sci-Fi 758986.
## 16 Horror, Sci-Fi 855565
## 17 Action, Horror, Sci-Fi 204364.
## 18 Horror, Sci-Fi, Thriller 275908
## 19 Action, Adventure, Horror 101779
## 20 Comedy, Sci-Fi 26382
## 21 Action, Sci-Fi 15543
## 22 Biography, Comedy, Sci-Fi 7478
## 23 Comedy, Crime, Mystery 832
## 24 Documentary, Biography, Crime, History 82
## 25 Horror, Thriller 102074.
## 26 Action, Horror, Thriller 104371
## 27 Drama, Romance 3087
## 28 Short, Fantasy 51
## 29 Adventure, Horror, Thriller 37819
## 30 Documentary, Horror 3083
Now, let’s find the 5-number summary for this data set.
summary(A$avg_genre)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 47.4 4184.8 103222.4 281225.2 433070.8 1117933.0
So, it looks like the genre with the most number of votes on average is “Biography, Crime, Drama” and the genre with the least number of votes on average is “Documentary, Short”.
Let’s make a box plot showing the spread of the number of imdbVotes for every genre:
A<-formatted_data %>%
select(Genre,imdbVotes)
B<-unique(A)
B %>%
ggplot(aes(x = Genre, y = imdbVotes, fill = Genre)) +
geom_boxplot() +
theme(legend.position = "none",axis.text.x = element_text(angle = 90)) +
labs(x = "Genre", title = "imdbVotes By Genre", y="Number of imdbVotes") +
coord_flip()
It looks like genre with the highest variance in imdbVotes is “Crime, Drama”.
We can also calculate the average number of imdbVotes for each award scenerio (won and nominated, won, nominated, or no wins or nominations):
A<-formatted_data %>%
select(Summary_Awards,imdbVotes) %>%
group_by(Summary_Awards) %>%
mutate(avg_Summary_Awards = mean(imdbVotes))
A<-A %>%
select(Summary_Awards,avg_Summary_Awards)
A<-unique(A)
A<-as_tibble(A)
A
## # A tibble: 4 × 2
## Summary_Awards avg_Summary_Awards
## <fct> <dbl>
## 1 won and nominated 626673.
## 2 none 2246.
## 3 won 4584.
## 4 nomination 62099.
It looks like the movies that both won awards and were nominated for awards received the most number of imdbVotes on average. Let’s make a box plot showing the spread of the number of imdbVotes for every award scenerio:
A<-formatted_data %>%
select(Summary_Awards,imdbVotes)
B<-unique(A)
B %>%
ggplot(aes(x = Summary_Awards, y = imdbVotes, fill = Summary_Awards)) +
geom_boxplot() +
theme(legend.position = "none",axis.text.x = element_text(angle = 90)) +
labs(x = "Award Status", title = "imdbVotes By Award Status", y="Number of imdbVotes") +
coord_flip()
It looks like award status with the highest variance in imdbVotes is “won and nominated”.
Covariance and Correlation
For quantitative data with two variables, we can define the potential linear relationship between them (e.g., Covariance and Correlation).
Let’s see if there is a linear relationship between the number of imdbVotes and the average rating.
A <- formatted_data %>%
select(imdbVotes,average_rating)
B<-unique(A)
cov(B$imdbVotes,B$average_rating) #Covariance
## [1] 4539881
cor(B$imdbVotes,B$average_rating) #Correlation
## [1] 0.5827318
The covariance is 4539894, meaning that there is a positive linear relationship between imdbVotes and average_rating.
The correlation coefficient is 0.5827316, meaning that imdbVotes and average_rating have a positive relationship.
Scatterplot
Now, lets make a scatterplot comparing the amount of money a movie makes at the Box Office to the average rating:
A <- formatted_data %>%
select(BoxOffice, average_rating,Year)
A <- A %>% drop_na(BoxOffice) #removing some NAs from BoxOffice column
B<-unique(A)
correlation <- cor(B$average_rating, B$BoxOffice)
g <- ggplot(B, aes(x = average_rating, y = BoxOffice))
g + geom_text(aes(label = Year)) +
geom_smooth(method = lm, col = "Red") +
labs(x = "Average Rating", title = "Average Rating vs. Box Office", y="Box Office") +
geom_text(x = 40, y = 7.5e+08, size = 5, label = paste0("Correlation = ", round(correlation, 3)))
As you can see, as average_rating increases, so does BoxOffice. In addition, there was one movie made in 2015 that did quite well at the Box Office.
That’s All Folks!
I hope this vignette will help you get data from the OMDb API and do an exploratory data analysis. Now, I think it’s time for me to get some popcorn!