Predicting Opening Box Office Gross with Internet Search Activity

Output

The goal of this project was to estimate the opening box office gross of popular movies before release by quantifying internet “hype.” I originally wanted to look at twitter volume/content for movies in the months leading up to release, but discovered that it’s difficult to scrape twitter data more than two weeks old. A great website for scraping data, however, is wikipedia. A lot of people tend to check wikipedia for a movie synopsis before they decide to watch it, so I assumed that total visits to a movie’s wiki page would be a strong indicator of the popular interest. In summary, I found a fairly strong correlation between wikipedia page visits for a given movie in the month leading up the release and total opening weekend revenue.

First, I wrote a python script to pull a list of movie names, release dates, and wiki links for a few different years from here. Then, I gathered wiki views data from stats.grok using the pyhton requests module. Here’s a few popular movies:

Output

Expectedly, wiki views increase in the week before the movies release and peak around the day of release. For every movie in the data set, I summed up all the views in the month prior to release and plotted them with opening box office gross. See below.

There’s clearly a strong correlation here. (Which you might rightly say was to be expected)

However, Wikipedia only seemed useful if you wanted to forecast a week out; many months before a movie release, total views did not vary much between movies and did not provide any predictive power. So, In addition to wikipedia data, I used google keywords to find the total google search volume for the trailer of each movie over time. This provided a way to forecast much further into the future, as trailers are release many months before the movie itself. In the graph below, you can see spikes in search volume which correspond to the month of the trailers release (derp)

As a final feature of my model, I included the seasonal trends of profits:

With these 3 predictors I achieved an R^2 of around .70 with linear regression.

The internet is a great resource for movie studios to track the effectiveness of their advertisements. Total search volume of google trailers is a powerful early predictor of future revenue (derp), and might allow studios to adjust their marketing plans accordingly.

Github Code

Written on May 1, 2016