preface
– I wanted to update scrapy, but it’s not that hard to do, you can basically do it by looking at the official documentation, mostly the first one. If you don’t have a good crawler base, you’re not going to be able to do that either, and installing scrapy can be a problem for most people, because of some historical issues. After all, it’s an old framework from python2. There’s another reason, of course, that IF I’m going to do something that doesn’t use scrapy, if I’m going to do a crawler, it’s going to be a distributed crawler, but what I’m going to do here is probably just a client, which is a spider, so I’m not going to be able to do that.
The target
Today we are going to get the weather, using the API of China Weather network.
BaseUrl = "http://wthrcdn.etouch.cn/weather_mini?city={}"
Copy the code
There are also a lot of crawlers on the Internet, such as the crawler that directly crawls the Weather net of China, but I just don’t understand why I have to go to the web page and then go to xpath or regular to do it. Clearly the data comes out of the same API. Why do I have to go to the page and reverse parse the rendered results to get the data? Why don’t I just take the data?
The request format
Back here, our interface is a GET request, and then we just put the city or the number in the city field, and the result is json, which is what it looks like when we turn this thing into a dictionary
{'data':
{'yesterday':
{'date': 'Saturday the 5th'.'high': 'high temperature 16 ℃'.'fx': 'Northeast wind'.'low': 'low temperature of 9 ℃.'fl': '
".'type': 'cloudy'},
'city': 'jiujiang'.'forecast': [{'date': 'Sunday the 6th'.'high': 'high temperature 12 ℃.'fengli': '
".'low': 'low temperature 7 ℃'.'fengxiang': 'Northeast wind'.'type': 'rain'},
{'date': 'Monday the 7th'.'high': '14 ℃ high temperature'.'fengli': '
".'low': 'low temperature 7 ℃'.'fengxiang': 'the north'.'type': 'cloudy'},
{'date': 'Tuesday the 8th'.'high': '19 ℃ high temperature'.'fengli': '
".'low': 'low temperature 8 ℃'.'fengxiang': Southeast wind.'type': 'or'},
{'date': 'Wednesday the 9th'.'high': 'high temperature of 21 ℃'.'fengli': '
".'low': '11 ℃ low temperature'.'fengxiang': Southeast wind.'type': 'or'},
{'date': 'Thursday the 10th'.'high': 'high temperature of 23 ℃.'fengli': '
".'low': '11 ℃ low temperature'.'fengxiang': 'the south'.'type': 'cloudy'}].'ganmao': 'Common cold season, appropriate to reduce the frequency of going out, appropriate hydration, appropriate clothing. '.'wendu': '8'}, 'status': 1000.'desc': 'OK'}
Copy the code
Request limits
Here have to say, The Chinese weather network YYDS this interface is completely unlimited. Why, what I need to do is to obtain the weather information of the whole country, including the county seat, there are thousands of county seats in China, and also have to analyze in different stages, so the daily request to visit at least 2W start. Well, if there were restrictions, we’d have to crawl backwards and backwards, but by my test, it’s fine.
Requests are not retrieved asynchronously
So, let’s do a comparison, no comparison, no harm, right? Because it’s so simple I’m going to go straight to the code.
import requests
from datetime import datetime
class GetWeather(object) :
urlWheather = "http://wthrcdn.etouch.cn/weather_mini?city={}"
requests = requests
error = {}
today = datetime.today().day
weekday = datetime.today().weekday()
week = {0:"Monday".1:"Tuesday".2:"Wednesday".3:"Thursday".4:"Friday".5:"Saturday".6:"Sunday"}
def __getday(self) - >str:
day = str(self.today)+"Day"+self.week.get(self.weekday)
return day
def get_today_wheather(self,city:str) - >dict:
data = self.getweather(city)
data = data.get("data").get("forecast")
today = self.__getday()
for today_w in data:
if(today_w.get("date")==today):
return today_w
def getweather(self,city:str,timeout:int=3) - >dict:
url = self.urlWheather.format(city)
try:
resp = self.requests.get(url,timeout=timeout)
jsondata = resp.json()
return jsondata
except Exception as e:
self.error['error'] = "Weather acquisition anomaly"
return self.error
def getweathers(self,citys:list,timeout:int=3) :
wheathers_data = {}
for city in citys:
url = self.urlWheather.format(city)
try:
resp = self.requests.get(url=url,timeout=timeout)
wheather_data = resp.json()
wheathers_data[city]=wheather_data
except Exception as e:
self.error['error'] = "Weather acquisition anomaly"
return self.error
return wheathers_data
if __name__ == '__main__':
getwheather = GetWeather()
start = time.time()
times = 1
for i in range(5000):
data = getwheather.get_today_wheather("Jiujiang")
if((times%100= =0)) :print(data,"The first",times,"This visit")
times+=1
print("Access",times,"Secondary time",time.time()-start,"Seconds")
Copy the code
So this code, I’ve done a simple wrapper. Let’s see how long it took 5,000 visits
Here I have visited the same city jiujiang 5,000 times
Asynchronous access
I didn’t wrap this code, so it looks messy. There are a couple of caveats here
Ceiling system
And because of that, it’s kind of an underlying operating system that you’re using asynchronously, there’s a limit to the concurrency, because the coroutine asynchronously has to be switched. It looks a bit like Python’s own multithreading, except that this “multithreading” switches only when I/O is used, or else it doesn’t switch. So yo, limit it
coding
import time
import aiohttp
from datetime import datetime
import asyncio
BaseUrl = "http://wthrcdn.etouch.cn/weather_mini?city={}"
WeekIndex = {0:"Monday".1:"Tuesday".2:"Wednesday".3:"Thursday".4:"Friday".5:"Saturday".6:"Sunday"}
today = datetime.today().day
day = str(today)+"Day"+WeekIndex.get(datetime.today().weekday())
TIMES = 0
async def request(city:str,semaphore:asyncio.Semaphore,timeout:int = 3) :
url = BaseUrl.format(city)
try:
async with semaphore:
async with aiohttp.request("GET", url) as resp:
data = await resp.json(content_type=' ')
return data
except Exception as e:
raise e
def getwheater(task) :
data = task.result()
return data
def get_today_weather(task) :
global TIMES
data = task.result() # get the result
data = data.get("data").get("forecast")
for today_w in data:
if (today_w.get("date") == day):
TIMES+=1The ++ operation is still an atomic operation
if(TIMES%100= =0) :print(today_w,"The first",TIMES,"This visit")
return today_w
if __name__ == '__main__':
semaphore = asyncio.Semaphore(500)
The operating system maximum is 509/1024 concurrency at a time,windows509 / Linux 1024
start = time.time()
tasks = []
for i in range(5000):
c = request("Jiujiang",semaphore,3)
task = asyncio.ensure_future(c)
task.add_done_callback(get_today_weather)
tasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
print("Take",time.time() - start,"Seconds")
Copy the code