Writing in the front
Today to crawl the website for https://500px.me/, this is a photography community, in a photography community should crawl is the picture information, but I found that there seems to be no interesting, suddenly feel to crawl the photographer of this site is more fun, so there is the origin of this article.
Based on the above purpose, I can find a good page https://500px.me/community/search/user
But after careful analysis, found the page and cannot capture as many users as possible, because the drop-down for a period of time, cannot continue to, very annoyed, don’t I stop there, obviously impossible, after an effort (probably nullify the 1 minute), I found the breakthrough point, any open the center of a user’s personal page, Click on any user’s profile picture and the following action will appear.
User personal center page, there is a list of concerns alas ~~, nice ah, this good prone ah, F12 analysis.
Dah, dah, dah, dah, dah, dah. The URL is https://500px.me/community/res/relation/4f7fe110d4e0b8a1fae0632b2358c8898/follow? startTime=&page=1&size=10&type=json
The parameters are as follows. The actual test found that size can be set to 100
https://500px.me/community/res/relation/} {user ID/follow? StartTime = & page = {} page Numbers & size =} {each page data & type = jsonCopy the code
So that’s all we have to do
- Get total attention
- Divide the total number of followers by 100 and loop through to get all the followers (why follow, not followers, here is why follow is more valuable). Once we know our goal, we can start writing code.
Lu code
Basic operation, get the network request, and then parse the page to get the total number of concerns.
The initiation of the user, the id is 5769 e51a04209a9b9b6a8c1e656ff9566, I choose you can choose a random, as long as he has a watch list, you can. Import module, this blog, uses Redis and Mongo, so for the basics, I suggest you prepare ahead of time, otherwise it will look taxing.
import requests
import threading
from redis import StrictRedis
import pymongo
# # # # # # # # # mongo part # # # # # # # # # # # # # # # # # # # # # # # # #
DATABASE_IP = '127.0.0.1'
DATABASE_PORT = 27017
DATABASE_NAME = 'sun'
client = pymongo.MongoClient(DATABASE_IP,DATABASE_PORT)
db = client.sun
db.authenticate("dba"."dba")
collection = db.px500 # Ready to insert data
# # # # # # # # # mongo part # # # # # # # # # # # # # # # # # # # # # # # # #
# # # # # # # # # redis part # # # # # # # # # # # # # # # # # # # # # # # # #
redis = StrictRedis(host="localhost",port=6379,db=1,decode_responses=True)
# # # # # # # # # redis part # # # # # # # # # # # # # # # # # # # # # # # # #
######### Global parameters section #########################
START_URL = "https://500px.me/community/v2/user/indexInfo? queriedUserId={}" # Entry link
COMMENT = "https://500px.me/community/res/relation/{}/follow? startTime=&page={}&size=100&type=json"
HEADERS = {
"Accept":"application/json"."User-Agent":"You just have to find what's available."."X-Requested-With":"XMLHttpRequest"
}
need_crawlids = [] Userid to be crawled
lock = threading.Lock() # thread lock
######### Global parameters section #########################
Copy the code
def get_followee(a):
try:
res = requests.get(START_URL.format("5769e51a04209a9b9b6a8c1e656ff9566"),
headers=HEADERS,timeout=3)
data = res.json()
if data:
totle = int(data["data"] ["userFolloweeCount"]) # Return the number of concerns
userid = data["data"] ["id"] # Return user ID
return {
"userid":userid,
"totle":totle
} # Return total data
except Exception as e:
print("Data acquisition error")
print(e)
if __name__ == '__main__':
start = get_followee() # Get entry
need_crawlids.append(start)
Copy the code
Code above logic, there is a very important in seed is why need to match the address of [attention] and [user ID], the two values is for the sake of joining together the following URL https://500px.me/community/res/relation/ {} / follow? StartTime =&page={}&size=100&type=json. You already know that the first parameter is the user ID and the second parameter is the page number. Won’t calculate, write well on the paper ~
There is a way to get a list of the seed users’ concerns and then crawl down to refine the producer code. Key code is annotated.
The idea is as follows:
- It’s an infinite loop
need_crawlids
Variable, and then get the list of followers for that user. - Crawl to the information, write
redis
Convenient verification duplication, fast storage.
class Product(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self._headers = HEADERS
def get_follows(self,userid,totle):
try:
res = requests.get(COMMENT.format(userid,totle),headers=HEADERS,timeout=3)
data = res.json()
if data:
for item in data:
yield {
"userid":item["id"]."totle":item["userFolloweeCount"]}except Exception as e:
print("Error message")
print(e)
self.get_follows(userid,totle) # Call again after error
def run(self):
while 1:
global need_crawlids # Call global content waiting to be crawled
if lock.acquire():
if len(need_crawlids)==0: # If 0, cannot enter the loop
continue
data = need_crawlids[0] # Get the first one
del need_crawlids[0] # Delete after use
lock.release()
if data["totle"] = =0:
continue
for page in range(1,data["totle"] / /100+2) :for i in self.get_follows(data["userid"],page):
if lock.acquire():
need_crawlids.append(i) # Newly acquired items are appended to the list waiting to be climbed
lock.release()
self.save_redis(i) # store in redis
def save_redis(self,data):
redis.setnx(data["userid"],data["totle"])
#print(data," insert successfully ")
Copy the code
Because 500px no anti crawler, so running speed is also very fast, after a while to climb a large number of data, visual estimated about 40,000 people, because we are writing the tutorial, I stopped climbing.
These data can’t just sprawled in redis, we want to use it to get all the information of the user, so to find information on the user interface, actually in the above have been used for a https://500px.me/community/v2/user/indexInfo? QueriedUserId ={}; queriedUserId={};
class Consumer(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
while 1:
key = redis.randomkey() # Get a random key
if key:
# Delete the obtained key
redis.delete(key)
self.get_info(key)
def get_info(self,key):
try:
res = requests.get(START_URL.format(key),headers=HEADERS,timeout=3)
data = res.json()
if data['status'] = ="200":
collection.insert(data["data"]) Insert into mongodb
except Exception as e:
print(e)
return
if __name__ == '__main__':
start = get_followee() # Get entry
need_crawlids.append(start)
p = Product()
p.start()
for i in range(1.5):
c = Consumer()
c.start()
Copy the code
There is no particular need to pay attention to the code, can be said to be very simple, about the use of Redis is not much.
Redis.randomkey () # Delete key() # Delete keyCopy the code
(it’s even… After a few minutes of waiting, a lot of user information came to my local.
Complete code comment message sent.
Write in the back
emmmmmm…… Write blog in CSDN every day, climb CSDN blog tomorrow ~~~