This is the second day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021
The Python urllib library is used to manipulate web urls and crawl web pages.
Request header
headers = {
'User-Agent': 'the Mozilla / 5.0 (the device; CPU OS 11_0 like Mac OS X) AppleWebKit/604.1.34 (KHTML, like Gecko) Version/11.0 Mobile/15A5341f Safari/604.1'.'Referer': 'https://*******'
}
Copy the code
The request header can be disguised as a reality machine, pretending that it is browsing and not easily detected. User-agent, Referer and Cookie can all be found in the request of NetWork by right clicking on the web page
It can also collect a bunch of User-agents, or randomly generate user-agents, etc. It can also sleep after climbing several pages. I usually sleep for 2 seconds
crawl
I won’t write out the path in detail, but we can substitute in our own path, and now I need to crawl the movie data to do a simple analysis, and I picked three categories
DICT = {"Comedy": "Https://s/1--- comedy -- -- -- -- -- -- -- -- the HTML"."Love": "Https://s/1--- love -- -- -- -- -- -- -- -- the HTML"."Action": "Https://s/1--- action -- -- -- -- -- -- -- -- the HTML"}
Copy the code
Save the data to data.csv, which has six fields: movie name, movie score, movie category, movie director, region and year
f = open("data.csv"."w", encoding='utf-8', newline=' ')
writer = csv.writer(f)
writer.writerow(['movie_title'.'movie_score'.'movie_type'.'movie_director'.'movie_region'.'movie_year'])
Copy the code
Finish and close f.close()
Paste the full code first:
for kind in DICT:
url = DICT[kind]
url = quote(url, safe=string.printable)
request = Request(url, headers=headers)
res = urlopen(request)
with open("./tempFile/" + kind + ".html".'wb') as f:
f.write(res.read())
with open("./tempFile/" + kind + ".html".'r', encoding='utf-8') as f:
data = f.read()
soup = BeautifulSoup(data, "html.parser")
node_list = soup.find("ul", class_="myui-vodlist").findAll("li")
for li in node_list:
movie_detailUrl = "https://s/" + li.find("h4", class_="title").find("a") ['href']
movie_title = li.find("h4", class_="title").find("a").get_text()
movie_score = li.find("span", class_="pic-tag").get_text()
movie_director, movie_region, movie_year = get_movieInfo(movie_detailUrl)
writer.writerow([movie_title, movie_score, kind, movie_director, movie_region, movie_year])
print(movie_title, movie_detailUrl, kind, movie_director, movie_region, movie_year)
# time.sleep(2)
f.close()
Copy the code
Url = quote(url, safe=string.printable) “./tempFile/” + kind + “.html” will be safer to read in this path, or not;
BeautifulSoup = BeautifulSoup(data, “html.parser”) BeautifulSoup is pretty easy to use, and SCRAPy pathfinding is a bit confusing for me. soup.find(“ul”, Find (“a”).get_text() class_=”myui-vodlist”).findall (“li”) find(“a”).get_text(
Decompression read
Get_movieInfo to get what we need from another page, and this page does the same thing, but this page needs to be unzipped to get it, and it’s a beginner, and it’s the first time I’ve encountered so now we have an extra step, so after we get the request, we read it using the IO stream and unzipped it
res = urlopen(request)
htmls = res.read()
buff = BytesIO(htmls)
ff = gzip.GzipFile(fileobj=buff)
htmls = ff.read()
Copy the code
Get to the point! We want to add this in the request header “Accept – the Encoding”, “gzip”, very important, important to what extent, is don’t write complains
This data collection stop gives me a headache
I originally chose the last two a. goet_text (), but if the year is unknown, the last A may be the partition solution is to extract all the words (no element content), I am too stupid
aaa = soup.find("p", class_="data").text
list = str(aaa).replace('\xa0'.' ').split("")
index = -1
li = []
for i in list:
ifi ! =' ':
li.append(i)
movie_year = li[index].split(':') [1]
Copy the code
Save the data to CSV inside it
writer.writerow([movie_title, movie_score, kind, movie_director, movie_region, movie_year])
Copy the code