Directly introduce the specific steps and points to note:

Pay attention to Instagram crawlers

  • The data of Instagram’s home page is rendered by the server, so the 11 or 12 pieces of data appearing on the home page are additionalData in A JSON structure in HTML, and the subsequent posts are loaded by Ajax request

  • Before 2019/06, Instagram was crawling back, requiring an ‘X-Instagram-gis’ field in the request header. The algorithm is as follows: 1. Combine RHX_GIS and queryVariables

    Rhx_gis is available in the JSON structure sharedData on the home page

    2. Perform the MD5 hash.

        queryVariables = '{"id":"' + user_id + '","first":12,"after":"' +cursor+ '"}'
        print(queryVariables)
        headers['X-Instagram-GIS'] = hashStr(GIS_rhx_gis + ":" + queryVariables)Copy the code
  • However, after 2017/06, Instagram has removed x-Instagram-GIS validation, so there is no need to regenerate x-Instagram-GIS. The previous content can be read as history

  • Some cookies will be set when the INS home page is initially accessed, and the content (Response header) is set as follows:

        set-cookie: rur=PRN; Domain=.instagram.com; HttpOnly; Path=/; Secure
        set-cookie: ds_user_id=11859524403; Domain=.instagram.com; expires=Mon, 15-Jul-2019 09:22:48 GMT; Max-Age=7776000; Path=/; Secure
        set-cookie: urlgen="{\" 45.63.123.251 \ ": 20473} : 1 hgkii: 7 bh3meau4gmvhrzwrtvtjs9hj2q"; Domain=.instagram.com; HttpOnly; Path=/; Secure
        set-cookie: csrftoken=Or4nQ1T3xidf6CYyTE7vueF46B73JmAd; Domain=.instagram.com; expires=Tue, 14-Apr-2020 09:22:48 GMT; Max-Age=31449600; Path=/; SecureCopy the code
  • What about query_hash, you don’t have to worry about this hash, you can just write it to death, right

  • Special attention must be paid to the user-defined header with user-agent in the header for each request, so that rhX_GIS can be used for signature access and data acquisition. Remember that! Every visit! Such as:

    headers = {
        'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }Copy the code
  • Most APIS need to carry session-ID in the cookie of the request header to obtain data. A normal request header is as follows:

    :authority: www.instagram.com :method: GET :path: /graphql/query/?query_hash=ae21d996d1918b725a934c0ed7f59a74&variables=%7B%22fetch_media_count%22%3A0%2C%22fetch_suggeste d_count%22%3A30%2C%22ignore_cache%22%3Atrue%2C%22filter_followed_friends%22%3Atrue%2C%22seen_ids%22%3A%5B%5D%2C%22include_reel%22%3Atrue%7D :scheme: https accept: */* accept-encoding: gzip, deflate, br accept-language: zh-CN,zh; Q = 0.9, en. Q = 0.8, la; Q = 0.7 cache-control: no - cache cookies: mid = XI - joQAEAAHpP4H2WkiI0kcY3sxg; csrftoken=Or4nQ1T3xidf6CYyTE7vueF46B73JmAd; ds_user_id=11859524403; sessionid=11859524403%3Al965tcIRCjXmVp%3A25; rur=PRN; urlgen="{\" 45.63.123.251 \ ": 20473} : 1 hgkij: JvyKtYz_nHgBsLZnKrbSq0FEfeg"Pragma: no cache referer: https://www.instagram.com/ the user-agent: Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 x-ig-app-id: 936619743392459 x-instagram-gis: 8f382d24b07524ad90b4f5ed5d6fccdb x-requested-with: XMLHttpRequestCopy the code

    Note the configuration of user-agent, X-ig-app-ID (obtained from sharedData in HTML), X-Instagms-GIS, and session-ID in cookie

  • A pagination of an API (request for next page data), such as a paginated Ajax request in the user’s post list INS, would typically have parameters similar to the following:

    query_hash: a5164aed103f24b03e7b7747a2d94e3c
    variables: {
    "id":"1664922478"."first": 12."after":"AQBJ8AGqCb5c9rO-dl2Z8ojZW12jrFbYZHxJKC1hP-nJKLtedNJ6VHzKAZtAd0oeUfgJqw8DmusHbQTa5DcoqQ5E3urx0BH9NkqZFePTP1Ie7A"}Copy the code

    — id indicates the user ID, which can be obtained from sharedData in HTML — first indicates how many records were initially fetched, like 50 at most — after indicates the paging cursor, which records the location of the paging fetch

    Of course, the parameters in the variables section may vary (more than that) depending on the API being requested, but only those related to paging are listed here.

    The paging request parameters are first obtained from sharedData in HTML:

        # Web page information
        page_info = js_data["entry_data"] ["ProfilePage"] [0] ["graphql"] ["user"] ["edge_owner_to_timeline_media"] ['page_info']
        # the index values on the next page AQCSnXw1JsoV6LPOD2Of6qQUY7HWyXRc_CBSMWB6WvKlseC ibkho3em0peg7_ep8vwoxw5zwzsav_mnmr8yx2ugfz5j6yxdyoffdbhc6942w
        cursor = page_info['end_cursor']
        Is there a next page
        flag = page_info['has_next_page']Copy the code

    End_cursor is the value of after, has_next_PAGE is the value of after, has_next_page is the value of after, has_next_page is the value of after, has_next_page is the value of after. After becomes the value of end_CURSOR in the page_info of the response data, constructs variables, and determines the value of has_next_PAGE in the page_INFO of the response data together with query_hash to initiate a request for the next page. Loop on, you can get all the data. If you don’t want to finish, you can use the count value in edge_owner_to_timeline_media in the response data to determine how many media the user has in total

  • Video posts and picture posts have different data structures, notice the IS_Video field in the response data

  • As long as the cookie in the request header has a valid and unexpired Session_ID, you can directly access the interface without calculating the signature. The most direct way is: open a browser, log in to Instagram, F12 check the XHR request, copy the cookie from the request header, and use it:

    headers = {
        'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'.'cookie': 'mid=XLaW9QAEAAH0WaPDCeY490qeeNlA; csrftoken=IgcP8rj0Ish5e9uHNXhVEsTId22tw8VE; ds_user_id=11859524403; sessionid=11859524403%3A74mdddCfCqXS7I%3A15; rur=PRN; Urlgen = "{\" 45.63.123.251 \ ": 20473} : 1 hgxr6: Phc4hR68jNts4Ig9FbrZRglG4YA" '
    }Copy the code

    Put a request header like the one above when the request is sent

  • Error log table in the ins_error_log of zk_flock library in 192.168.1.57, currently there are many unknown SSL protocol type errors. I suspect that the reason is that the crawl is too fast, and an agent is needed to switch

Give me code that works? (Set the FQ proxy, do not need to remove oh) :

# -*- coding:utf-8 -*-
import requests
import re
import json
import urllib.parse
import hashlib
import sys

USER_AGENT = 'the Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'

BASE_URL = 'https://www.instagram.com'
ACCOUNT_MEDIAS = "http://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables=%s"
ACCOUNT_PAGE = 'https://www.instagram.com/%s'

proxies = {
    'http': 'http://127.0.0.1:1087'.'https': 'http://127.0.0.1:1087',}Proxies can be set to proxies in a single session, thereby eliminating the need to specify proxies whenever requests are called
# s = requests.session()
# s.proxies = {' HTTP ': '121.193.143.249:80'} # s.proxies = {' HTTP ': '121.193.143.249:80'}

def get_shared_data(html=' ') :"""get window._sharedData from page,return the dict loaded by window._sharedData str """
    if html:
        target_text = html
    else:
        header = generate_header()
        response = requests.get(BASE_URL, proxies=proxies, headers=header)
        target_text = response.text
    regx = r"\s*.*\s*
      
       .*_sharedData\s*=\s*(.*?) ; <\/script>"
      *?>
    match_result = re.match(regx, target_text, re.S)
    data = json.loads(match_result.group(1))

    return data

# def get_rhx_gis():
# """get the rhx_gis value from sharedData
# "" "
# share_data = get_shared_data()
# return share_data['rhx_gis']

def get_account(user_name):
    """get the account info by username :param user_name: :return: """
    url = get_account_link(user_name)
    header = generate_header()
    response = requests.get(url, headers=header, proxies=proxies)
    data = get_shared_data(response.text)
    account = resolve_account_data(data)
    return account

def get_media_by_user_id(user_id, count=50, max_id=' ') :"""get media info by user id :param id: :param count: :param max_id: :return: """
    index = 0
    medias = []
    has_next_page = True
    while index <= count and has_next_page:
        varibles = json.dumps({
            'id': str(user_id),
            'first': count,
            'after': str(max_id)
        }, separators=(', '.':'))  # separators (', ', ': ')
        url = get_account_media_link(varibles)
        header = generate_header()
        response = requests.get(url, headers=header, proxies=proxies)

        media_json_data = json.loads(response.text)
        media_raw_data = media_json_data['data'] ['user'] ['edge_owner_to_timeline_media'] ['edges']

        if not media_raw_data:
            return medias

        for item in media_raw_data:
            if index == count:
                return medias
            index += 1
            medias.append(general_resolve_media(item['node']))
        max_id = media_json_data['data'] ['user'] ['edge_owner_to_timeline_media'] ['page_info'] ['end_cursor']
        has_next_page = media_json_data['data'] ['user'] ['edge_owner_to_timeline_media'] ['page_info'] ['has_next_page']
    return medias

def get_media_by_url(media_url):
    response = requests.get(get_media_url(media_url), proxies=proxies, headers=generate_header())
    media_json = json.loads(response.text)
    return general_resolve_media(media_json['graphql'] ['shortcode_media'])

def get_account_media_link(varibles):
    return ACCOUNT_MEDIAS % urllib.parse.quote(varibles)

def get_account_link(user_name):
    return ACCOUNT_PAGE % user_name

def get_media_url(media_url):
    return media_url.rstrip('/') + '/? __a=1'

# def generate_instagram_gis(varibles):
# rhx_gis = get_rhx_gis()
# gis_token = rhx_gis + ':' + varibles
# x_instagram_token = hashlib.md5(gis_token.encode('utf-8')).hexdigest()
# return x_instagram_token

def generate_header(gis_token=' ') :# todo: if have session, add the session key:value to header
    header = {
        'user-agent': USER_AGENT,
    }
    if gis_token:
        header['x-instagram-gis'] = gis_token
    return header

def general_resolve_media(media):
    res = {
        'id': media['id'].'type': media['__typename'][5:].lower(),
        'content': media['edge_media_to_caption'] ['edges'] [0] ['node'] ['text'].'title': 'title' in media and media['title'] or ' '.'shortcode': media['shortcode'].'preview_url': BASE_URL + '/p/' + media['shortcode'].'comments_count': media['edge_media_to_comment'] ['count'].'likes_count': media['edge_media_preview_like'] ['count'].'dimensions': 'dimensions' in media and media['dimensions'] or {},
        'display_url': media['display_url'].'owner_id': media['owner'] ['id'].'thumbnail_src': 'thumbnail_src' in media and media['thumbnail_src'] or ' '.'is_video': media['is_video'].'video_url': 'video_url' in media and media['video_url'] or ' '
    }
    return res

def resolve_account_data(account_data):
    account = {
        'country': account_data['country_code'].'language': account_data['language_code'].'biography': account_data['entry_data'] ['ProfilePage'] [0] ['graphql'] ['user'] ['biography'].'followers_count': account_data['entry_data'] ['ProfilePage'] [0] ['graphql'] ['user'] ['edge_followed_by'] ['count'].'follow_count': account_data['entry_data'] ['ProfilePage'] [0] ['graphql'] ['user'] ['edge_follow'] ['count'].'full_name': account_data['entry_data'] ['ProfilePage'] [0] ['graphql'] ['user'] ['full_name'].'id': account_data['entry_data'] ['ProfilePage'] [0] ['graphql'] ['user'] ['id'].'is_private': account_data['entry_data'] ['ProfilePage'] [0] ['graphql'] ['user'] ['is_private'].'is_verified': account_data['entry_data'] ['ProfilePage'] [0] ['graphql'] ['user'] ['is_verified'].'profile_pic_url': account_data['entry_data'] ['ProfilePage'] [0] ['graphql'] ['user'] ['profile_pic_url_hd'].'username': account_data['entry_data'] ['ProfilePage'] [0] ['graphql'] ['user'] ['username'],}return account

account = get_account('shaq')

result = get_media_by_user_id(account['id'], 56)

media = get_media_by_url('https://www.instagram.com/p/Bw3-Q2XhDMf/')

print(len(result))
print(result)
Copy the code

Packaged into a library!

In addition, I wrote a library on Github for convenience, which contains a lot of operations, I hope you can have a look and give some suggestions. If it is useful to you, welcome star and PR~ thanks nimeng!