Click “Geek Monkey” above and select “Top public account”
Get Python technology dry stuff for the first time!
Photo: By Jinovich from Instagram
It takes about 10 minutes to read the text.
When I first learned Python, I was already fascinated by it. What attracted me to Python was not just its ability to write web crawlers, but its ability to analyze data. I can present a large amount of data in a graphical way and interpret the data more intuitively.
The premise of data analysis is that there is data to analyze. What if there’s no data? First, you can go to some data sites to download relevant data, but the data content may not be what you want. The second is to climb some website data.
Today, I will crawl the information of all pizza Hut restaurants across the country for subsequent data analysis.
We want to climb the target is Pizza Hut China. Open the home page of Pizza Hut China and enter the “Restaurant Inquiry” page.
The data content we need to crawl includes the city, the name of the restaurant, the address of the restaurant and the contact number of the restaurant. Since I saw a map on the page, the page must have the latitude and longitude of the restaurant’s address. Therefore, the latitude and longitude of the restaurant is also the data we need to climb.
As for the list of cities with Pizza Hut restaurants in China, we can get it by “Switching cities” on the page.
Before writing a crawler, I always do a simple analysis of the page first, and then specify the crawling method. And analyzing the structure of a page often yields some unexpected results.
We did a brief analysis of the page structure using the browser’s developer tools.
The Response content of the StoreList page is relatively long. Before we close the page, let’s scroll down and see if there’s anything else available. Finally, we find the code that calls the JavaScript function that gets the restaurant list information.
Let’s search the GetStoreList function to see how the browser gets the restaurant list.
From the code, you can see that the page uses Ajax to fetch data. In the POST request address http://www.pizzahut.com.cn/StoreList/Index page. The request also carries parameters pageIndex and pageSize.
After some analysis of the page structure, we specify the crawl method. First, let’s get the city information. Then take that as a parameter and build an HTTP request to access the Pizza Hut server to get all the restaurant data in the current city.
To facilitate data crawling, I’ve written all cities to cities.txt. When it’s time to crawl, we’ll read the city information from the file.
The idea of crawling seemed right, but there was still a problem. Every time we open the official website of Pizza Hut, the page will automatically locate to our city. If we can’t solve the city location problem, we can only capture one city’s data.
So we went back to the home page to see if we could find some useful information. Eventually, we found that there was an iplocation field in the cookies on the page. I’m going to the Url decoding, get shenzhen | | 0 0 such information.
When I saw the message, it clicked. The original Pizza Hut website based on our IP address to set the initial city information. If we can forge the iplocation field, then we can change the city at will.
The first step is to read the city information from the file.
[]def get_cities() = []def get_cities() = []def get_cities() = []def get_cities() File_name =' cities.txt' with open(file_name, 'r', encoding='UTF-8-sig') as file: for line in file: city = line.replace('\n', '') cities.append(city)
Copy the code
The second step is to traverse the Cities list in turn, constructing the iplocation field of Cookies, taking each city as a parameter.
# Walk through all the restaurants in city in Cities: restaurants = get_stores(city, count) results[city] = restaurants count += 1 time.sleep(2)
Copy the code
Then, we POST the Cookie to the Pizza Hut server. Finally, the returned page data is extracted.
def get_stores(city, count): "" "according to city restaurant information ", "" session = requests. The session (#) was carried out on the [city | | 0 0] Url encoding city_urlencode = quote (city + '| | 0 0') # used to store front page Cookies, cookies = requests. Cookies. RequestsCookieJar () headers = {' the user-agent ':' Mozilla / 5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36', 'Accept ': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp, * / *; Q = 0.8 ', 'Host' : 'www.pizzahut.com.cn', 'cache-control' : 'Max - age = 0', 'Connection' : 'keep alive -'} print (" = = = = = = = = = = = = the first ', count, 'a city:' city, '============') resp_from_index = session.get('http://www.pizzahut.com.cn/', Headers =headers) # print(resp_from_index. Cookies) # print(resp_from_index. cookies.set('AlteonP', resp_from_index.cookies['AlteonP'], domain='www.pizzahut.com.cn') cookies.set('iplocation', city_urlencode, domain='www.pizzahut.com.cn') # print(cookies) page = 1 restaurants = [] while True: data = { 'pageIndex': page, 'pageSize': "50", } response = session.post('http://www.pizzahut.com.cn/StoreList/Index', headers=headers, data=data, Divs = html.xpath("//div[@class='re_RNew']") temp_items = [] for div in divs: item = {} content = div.xpath('./@onclick')[0] # ClickStore (22.538912, 114.09803 | city square on the second floor of the citic city plaza | | shennan road 0755-25942012 ', 'GZH519') # filter out the brackets and the content behind the content = content.split('(\'')[1].split(')')[0].split('\',\'')[0] if len(content.split('|')) == 4: The item [' coordinate] = content. the split (' | ') [0] item [' restaurant_name] = content. the split (' | ') [1] + 'restaurant' item = [' address '] content.split('|')[2] item['phone'] = content.split('|')[3] else: The item [' restaurant_name] = content. the split (' | ') [0] + 'restaurant' item [' address '] = content. the split (' | ') [1] item = [' phone '] content.split('|')[2] print(item) temp_items.append(item) if not temp_items: break restaurants += temp_items page += 1 time.sleep(5) return restaurants
Copy the code
The third step is to write the data of the city and all the restaurants in the city into a Json file.
with open('results.json', 'w', encoding='UTF-8') as file: file.write(json.dumps(results, indent=4, ensure_ascii=False))
Copy the code
When the program is finished running, a file named results.json is generated in the current directory.
I have uploaded the complete code to the background, and the students who need it can reply to “Pizza Hut” in the background.
If you think the article is good, please like it and share it. Your affirmation is my biggest encouragement and support.
Recommended reading:
Python has such a weird library –The Fuck
Python has three ways of traversing directories to help you find hidden files easily
10 charts to take you through the evolution of backend services architecture
High concurrency stuff
A thousand miles is a short step