Python crawls audio data from a website

preface

Recently, I shared some crawler knowledge points. So today we can use the crawler technology and Python technology to crawl a platform of audio data.

We all know that most data is in the source code of the web page corresponding to the URL address. The following figure

Web site source code

But it can be a pain to crawl data that is not in the source code of the current URL. In this case, the data is either dynamically added to the web page later, or loaded via Ajax. No matter which one is a headache for beginners. Today we are going to take a look at the characteristics of Ajax data in a very simple way (everything is special, so different scenarios need to be treated differently)

In actual combat

What is Ajax?

Let’s start with the question what is Ajax? Here is an explanation from Baidu Baike

Ajax, also known as Asynchronous Javascript And XML, is a web development technique for creating interactive, fast, dynamic web applications that update parts of a web page without having to reload the entire page.

One of the key points in this paragraph is the ability to update parts of a page without loading the entire page. For example, here is a list of rides on 12306

In this picture, we can see that 12306’s webpage is like this. But when we click the “query” button, we will see the effect of the page

We found that the page had not been refreshed, but the local structure had changed. So the train list data is loaded through Ajax. The technology to update parts of a web page without having to reload the entire web page is not too difficult to understand.

Case presentation

Let’s use a case study to further analyze and summarize how to crawl ajax data

The figure below is an audio site, in which the data is mainly audio. So how do we do that if we want to crawl the audio data on the site?

We further optimize the goal, there is a << Piano of the Night >> such an audio topic, today we will climb the topic inside the data

After clicking on the album, we found a lot of audio data. According to our thinking, as long as we find the URL of each audio, it is easy to crawl.

But what if the data is above the current URL? Let’s look at the source of the web page, this is also very simple, as long as we search in the source mp3 or M4A audio format can be. As you can see from the image below, these audio formats are not available on the web. Then we guessed that this data might be loaded via Ajax

The next step is to analyze the data interface through NetWork. We click on a song and find some data refreshed in the NetWork. Let’s look at the first piece of data

The data is actually in this interface

Let’s take a closer look and see if there’s an audio format like MP3 or M4A. To our disappointment, it didn’t. So what to do?

There’s another piece of data underneath this data interface, so what’s in it?

When we click on this data, we find that it’s an audio data

To verify that this data is what we want to crawl, we can look at it. OK, there’s no problem with the data we’re going to crawl

But this is just one audio, what if we want to crawl the other audio? What we found is that the iD is different for different audio. So that’s easy. Let’s first access the content of the first data interface to get the id and name of the audio.

Then we go to the second data interface, to its URL to dynamically replace the audio id can be used

Code demo

import requests

import json

url = 'https://www.ximalaya.com/revision/play/v1/show?id=231012348&sort=1&size=30&ptype=1'

headers = {

    'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'

}

r = requests.get(url,headers=headers)

ret = r.text

result = json.loads(ret)

content_list = result['data'] ['tracksAudioPlay']

for content in content_list:

    t_id = content['trackId'] # Id of audio

    name = content['trackName'] # Audio name

    # print(t_id,type(t_id))

    url = 'https://www.ximalaya.com/revision/play/v1/audio?id=%d&ptype=1'%t_id

    mus_response = requests.get(url,headers=headers)

    mus_html = mus_response.text

    mus_result = json.loads(mus_html)

    # print(mus_html,type(mus_result))

    mus_src = mus_result['data'] ['src']

    # print(mus_src)

    with open('mp3/%s.m4a'% name,'wb') as f:

        mus = requests.get(mus_src)

        f.write(mus.content)

Copy the code

Running effect

Part of the core code is listed here. On the basis of this code you can upgrade and expand. The case is not very difficult, everyone can start to try. However, we still want to be grateful, do not malicious to climb the site above the data.

Python crawls audio data from a website

preface

In actual combat

What is Ajax?

Case presentation

Code demo

Related Posts

A preliminary study of Java proxy pattern

A Probe into Automatic Refactoring of code — Jscodeshift

Real experience of working at Bytedance for the Class of 19!