Analyze daily box office data with Python crawl

♚ \

Author: Xiao Li, a foreign analyst, mainly engaged in IT industry, but personally very fond of film market analysis, so often write some articles in the field of film.

Blog: blog.sina.com.cn/leonmovie

Another year has passed unconsciously. Say goodbye to 2019 and let’s embrace the new 2020. Happy New Year to you all!

Recently, when I was dealing with some work related to movies, I needed some Box Office data of North American movies, and the most authoritative website for this part of data is Box Office Mojo (hereinafter referred to as BOM), so I went to check. Estimates often focus on this web site friends all know that the site has just carried on the revision, page layout, upgrading, and specifically for mobile devices is optimized (only the computer version of the previous web page), page a lot of good, but a lot less data, before the website almost all data can be checked, and now only to check the part data, Some data can only be found in the BOM Pro, which is available for a fee. In order to make better use of the data and not spend money, I had to make my own money, so I wrote my own Python crawler and crawled through years of box office data. Here is an example of “North American box office daily box office data” to explain how to crawl, other box office data is similar, only a few code changes.

Figure 1 a screenshot of part of the web page to be captured

This crawler is completely completed in Python language, using Anaconda version 2019.10 (this is the latest version at present, theoretically it contains various Python libraries are also the latest or close to the latest, so the following crawler may have problems on some of the old software, please update in time if there is any problem). The crawler program mainly consists of two parts: crawling and storing data, and simply drawing pictures according to the data. Here’s a look at one by one.

1. Crawl and store data

Import all the packages you need first.

import requests
import pandas as pd
import time
import matplotlib.pyplot as plt
import matplotlib.dates as mdate
importPylab as MPL # import Chinese fonts to avoid displaying garbled charactersCopy the code

This is the URL for the daily box office we’ll use, and the %s in the middle is the year we’ll replace later

urltemplate = r'https://www.boxofficemojo.com/daily/%s/?view=year' 
Copy the code

This is where the data is stored, in an Excel file on the desktop, because there’s very little data, so you don’t need a database, Excel is enough, but of course you can also use CSV format here. Here my path contains Chinese, there is no problem when using, if you have problems, it is best to use the English path.

fileLoc = r'C:\Users\ Leon \Desktop\BoxOffice\Box Office Mojo \Daily- Daily- data.xlsx'
Copy the code

This is the crawler head, which prevents the site from crawling back.

headers = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
Copy the code

For reptiles, mode=’a’ is used for pandas. For reptiles, mode=’a’ is used for pandas. Second, I don’t know if there is something wrong with my network. During the process of crawling, there is a disconnection phenomenon, so I use requests.ConnectionError to deal with the disconnection problem. We could have read the page in requests with pd.read_html(URL), but in this case, we used requests to read the page and put the page code in Pd. read_html to avoid the anti-crawler mechanism. You can also speed things up, because pd.read_html is too slow to read web pages directly.

def scraper(file, headers, urltemp, year_start, year_end):
    writer = pd.ExcelWriter(file, engine='openpyxl', mode='a'The file I'm using is XLSX, so specify engine='openpyxl'If the type is XLS, nofor i in range(year_start, year_end+1):
        url = urltemp % i
        try:
            r = requests.get(url, headers=headers)
            if r.status_code == 200:
                source_code = r.text
                df = pd.read_html(source_code)
                df = df[0]
                df.to_excel(writer, sheet_name=str(i), index=False)
                time.sleep(3#) slow down a little bit about the speed, don't put somebody else's web site was exhausted except requests. ConnectionError:print('Can not get access to the %s year daily data now' % i)
            return
    writer.save()
    writer.close()


scraper(fileLoc, headers, urltemplate, 1977.2019)
Copy the code

Since the site only provides data going back to 1977 at the earliest, it captures data from 1977 to 2019.

Figure 2. Screenshot of partial data captured

Two, simple drawing according to the data

Str_to_datetime: str_to_datetime: str_to_datetime: str_to_datetime: str_to_datetime: str_to_datetime: str_to_datetime: str_to_datetime: str_to_datetime: str_to_datetime

def str_to_datetime(x):
    if len(x) > 14:
        temp = x.split('2019')
        x = temp[0] +'2019'
    return x
Copy the code

The str_to_num function converts the “Top 10 Gross” column to a numeric value. The “Top 10 Gross” column is stored in a string format

def str_to_num(x):
    x = x.replace('$'.' ')
    x = x.replace(', '.' ')
    x = int(x)
    return x
Copy the code

Here we are going to make a “line graph of 2019 daily box office data”, so read the corresponding data in the file we just grabbed and do a simple processing

table = pd.read_excel(fileLoc, sheet_name='2019')
data = table[['Date'.'Top 10 Gross']]
data['Date'] = data['Date'].apply(str_to_datetime)
data['Top 10 Gross'] = data['Top 10 Gross'].apply(str_to_num)
Copy the code

Set the data of X-axis and Y-axis. X-axis is the time data and Y-axis is the box office data. The value of X-axis is too large, so change it to a smaller one for the convenience of drawing

x = pd.to_datetime(data['Date'])
y = data['Top 10 Gross'] /1000000
Copy the code

Find the largest value in the box office data and its position in the sequence Y, and then find the position corresponding to the sequence X, which is the corresponding day

max_loc = y.idxmax()
max_y = y.max()
max_date = x.loc[max_loc]
Copy the code

Setting Related Parameters

mpl.rcParams['font.sans-serif'] = ['SimHei'FIG = plt.figure(figsize=(16.6.5))
ax = fig.add_subplot(111This figure contains only one graph ax.set_ylim([0.200])
plt.tick_params(labelsize=13)
Copy the code

The following line of code sets the X-axis to the time format, which is important, otherwise the X-axis will display a transcoded number like ‘796366’

ax.xaxis.set_major_formatter(mdate.DateFormatter('%Y-%m-%d'))
plt.xticks(pd.date_range(x[len(x)- 1], x[0], freq='M'), rotation=90)
text = r'The highest box office day was %s, which took in % 2f million.' % (max_date.date(), max_y/100)
plt.annotate(text, xy=(max_date, max_y), fontsize=14, \
             xytext=(max_date+pd.Timedelta(days=10), max_y+10), \
             arrowprops=dict(arrowstyle="- >", connectionstyle="arc3"), \
             xycoords='data')
plt.ylabel('Box Office/million', fontdict={'size':14})
plt.plot(x, y)
Copy the code

The finished image looks like this

FIG. 3 Daily data chart of 2019 North American box office

Three, endnotes

Above the crawlers is simpler, there is no use complex technology such as database, multithreading, we should be more from the obtained data more value to mining, the author will from these data to analyze the next Hollywood movie the development of the industry over the past year, will share with everyone, stay tuned.

Click become:Community registered members ** like the article, click the **Looking at the

Analyze daily box office data with Python crawl

Related Posts

Is your login interface really secure?

ShardingSphere 4.x read-write separation

Unity documents a maze algorithm based on deep search