This article was originally published in the wechat public account “Geek Monkey”, welcome to follow the first time to get more original sharing

(一)

The little yellow flower of the story

It’s been floating since the year we were born

Childhood swing

Along with the memory has been swaying up to now

Re So So Si Do Si La

So La Si Si Si Si La Si La So

Blowing the prelude and looking up at the sky

I think of petals trying to fall

Xiaobian Sun Wukong has a hobby, like to listen to the familiar melody, while watching the comments of netease cloud music songs, especially the wonderful comments.

The reviews, the SOB stories, are the ones that make you think.

(2)

One day, Sun Wukong had a whim and wanted to climb down the wonderful comments of the songs he usually liked to listen to. The comments can then be read directly without having to open a web page.

No sooner said than done. Sun Wukong opens a browser to visit netease Cloud Music and randomly clicks on a song page. Most web sites now use Ajax technology to retrieve data. Therefore, it is necessary to determine whether the web page uses this technology.

There is a Google Chrome plugin called Toggle JavaScript that enables or disables JavaScript on a page.

A normal page would look like this:

When page JavaScript is disabled, the normally displayed data page becomes a blank page.

Therefore, it can be concluded that netease cloud music load data using Ajax.

Ajax techniques can use JavaScript scripts embedded in HTML documents to request data from the server and then update it to the page without refreshing the page. To further identify the data source, you need to know the request domain name and request parameters.

This requires the use of browser developer tools (generally F12 key will be displayed), according to capture data packets for analysis. Because we have determined that the site is Ajax, we filter out all requests directly after selecting the XHR filter.

The HTTP request for each URL link is then analyzed in turn, focusing on the Headers and Preview options. Finally, the Monkey King discovers that R_SO_4_186001? The cSRF_Token = request has the information we need. Preview has the same field as the highlight comment user name.

Continue to switch to Headers to confirm the request domain name and the parameters the request needs to carry.

Then crawl idea is: use the params carry parameters and POST way encSecKey music.163.com/weapi/v1/re to this address… Initiate an HTTP request. The Json data in the returned result is the user comment data.

(c) Now that you have a clear idea, writing code is much easier.

Here, Sun Wukong uses lists to save songs that he wants to crawl for great reviews.

songs_url_list = [
    'http://music.163.com/#/song?id=186016'.# sunny
    'http://music.163.com/#/song?id=186001'.# to indicate
    'http://music.163.com/#/song?id=27876900'.# Here We Are Again
    'http://music.163.com/#/song?id=439915614'.# Meet you
    'http://music.163.com/#/song?id=139774'.# The truth that you leave
    'http://music.163.com/#/song?id=29567189'.# ideal
    'http://music.163.com/#/song?id=308353'.# hd
    'http://music.163.com/#/song?id=31445772'.# Ideal thirty
    'http://music.163.com/#/song?id=439915614'.# Meet you
    'http://music.163.com/#/song?id=28815250'.# The Road to ordinariness
    'http://music.163.com/#/song?id=25706282'.# Brightest star in the night sky
    'http://music.163.com/#/song?id=436514312'.# chengdu
]
Copy the code

It then intercepts the value of the ID field in each link.

def get_song_id(url):
    """ Intercepts the song ID from the URL """
    song_id = url.split('=') [1]
    return song_id
Copy the code

Then use Requests to initiate an HTTP request based on the REQUESTED URL at the ID concatenation.

for each in songs_url_list:
    start_spider(get_song_id(each))
    time.sleep(random.randint(5.8))

def start_spider(song_id):
    """ The comment data is obtained using AJAX technology. Below is the requested address to obtain the comment.
    url = 'http://music.163.com/weapi/v1/resource/comments/R_SO_4_{}?csrf_token='.format(song_id)

    headers = {
        'User-agent': 'the Mozilla / 5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36'.'Origin': 'http://music.163.com'.'Referer': 'http://music.163.com/song?id={}'.format(song_id),
    }

    formdata = {
        'params': '57Wh2mgebLOOPQVBc+B2wz4sCCH/nXZFEoTc/XNySiqT0V7ZxUADzDNgTXXhYgAJ5BNMryMgxhdwNzF1GyxDZo3iR9/YYbWgCAQHC5DCDuObqvxNcOcnQDa RqJCrqQcrEABW1SwKitfbD3wMEyB4tJu+rU8goSwg2FP/PBBLs9DVs1iWdWGjV6CdrocA36Rs'.'encSecKey': '63774137ba4f5cc60d1b6a3bc14985a9563a7bfdec4f3e74297ffc07514adf18f90620933a01c2db4ca989cc4e1dfc49789981424c294a34e48c2cb e7aa51533a5cc5b5776a9e499cd08770bc596655dbe8e001d1ed5fd47a27dd195128480820cc67a799d341f95d447e3522851f2b64ad1cb8350e2015 b265b9e684179351c',
    }

    response = requests.post(url, headers=headers, data=formdata)
    print('Request [' + url + '], status code is')
    print(response.status_code)
    # get_hot_comments(response.text)
    Write data to a CSV file
    write_to_file(get_hot_comments(response.text))
Copy the code

Since the request returns Json data, we only need HotComments, so we need to process the data.

def get_hot_comments(response):
    Loads (response) are returned in Json format, loads(response) are converted to dictionary type, and key-value is used.
    data_list = []
    data = {}

    for comment in json.loads(response)['hotComments']:
        data['userId'] = comment['user'] ['userId']
        data['nickname'] = comment['user'] ['nickname']
        data['content'] = comment['content']
        data['likedCount'] = comment['likedCount']
        data_list.append(data)
        data = {}
    # print(data_list)
    return data_list
Copy the code

Finally, the data is saved to a CSV file.

def write_to_file(datalist):
    print('Start persisting data... ')
    file_name = 'netease Cloud Music wonderful comments.csv'

    with codecs.open(file_name, 'a+'.'GBK') as csvfile:
        filednames = ['user Id'.'nickname'.'Comment content'.'Likes']
        writer = csv.DictWriter(csvfile, fieldnames=filednames)

        writer.writeheader()
        for data in datalist:
            print(data)
            try:
                writer.writerow({filednames[0]: data['userId'],
                                 filednames[1]: data['nickname'],
                                 filednames[2]: data['content'],
                                 filednames[3]: data['likedCount']})
            except UnicodeEncodeError:
                print("Encoding error. Data cannot be written to file. Ignore it.")
Copy the code

At this point, you should know how to crawl websites that use Ajax to load data. Some web sites may only be able to carry the parameters of the request once, then further packet in the JS code. Figure out how to encrypt it and restore it with code yourself.

Ha ha, here please allow me to post the crawl results.

Github repository address: 163MusicCrawler


This article was first published on wechat, the original address is climb netease cloud music highlights. Welcome to reprint the article at any time, reprint please contact number to open the white list, respect the author’s original. I use my wechat account “Geek Monkey” to share original Python works every week. Related to web crawler, data analysis, Web development and other directions.