Hot live streaming

In recent years, live broadcasting can be said to be quite popular, but the popularity of millions of people can not help but wonder, how can there be so high popularity? Is the whole country watching it live?

In order to solve this problem that has bothered me for a long time, I went to learn the related knowledge of node crawler, and I will share it with you.

bettas

We started with the most common betta fish, which proved to be the easiest.

Through to analysis the betta website, found that has given rise to a classification of all page https://www.douyu.com/directory

Cheerio package is used here, which is equivalent to jQuery on the server side. It is used for DOM manipulation on the server side, and the usage method is similar to jQuery.

Because douyu web page after gzip compression, but also use zlib, this package is decompressed.

// Import the HTTPS module
const https = require('https')
// zlib package for decompression
const zlib = require('zlib')
// Cheerio package, which provides jquery-like functionality
const cheerio = require("cheerio");

function douyu () {
    // Create the request object
    let req = https.request('https://www.douyu.com/directory', res => {
        // Receive data
        let chunks = []
        // When the data is detected, it is stored
        res.on('data', chunk => {
            chunks.push(chunk)
        })
        // Data transfer is complete
        res.on('end', () = > {// Splice data
            var buffer = Buffer.concat(chunks)
            // Use zlib to decompress
            zlib.gunzip(buffer, function (err, decoded) {
                // Gzip decompressed HTML text
                let html = decoded.toString()
                // Use cheerio to parse HTML
                let $ = cheerio.load(html)
                // Get a list of elements that contain live data
                let list = $('#allCate .layout-Module-container .layout-Classify-list .layout-Classify-item .layout-Classify-card')
                // Parse the DOM to retrieve the data in the tag
                const dataList = {}
                Array.prototype.map.call(list, item => {
                    let key = ' ', value = ' '
                    item.children.forEach(childrenItem= > {
                        if (childrenItem.name === 'strong') {
                            key = childrenItem.children[0]? childrenItem.children[0].data : 'empty'
                        } else if (childrenItem.name === 'div') {
                            value = $(childrenItem).find('span').html()
                            value = unescape(value.replace(/&#x/g.'%u').replace(/; /g.' '))
                        }
                    })
                    dataList[key] = value
                })
                // Add up the total number of people
                let total = 0
                for (let key in dataList) {
                    let value = dataList[key]
                    // Process numbers in ten thousand units
                    if (value.indexOf('万') != - 1) value = Number.parseFloat(value) * 10000
                    total += Number.parseFloat(value) ? Number.parseFloat(value) : 0
                }
                console.log(` bettas are:${total}`)})})})// Send the request
    req.end()
}
Copy the code

Tiger tooth

A look at tiger Tooth’s website shows that there is no statistical classification list like betta Fish, But found the web site https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&tagAll=0&page=1 information query studio list interface. Through this interface, you can get the information of all broadcast rooms, including the number of people in each broadcast room. Next, you can get the total number of people on this platform by adding up the number of people in each broadcast room.

// Import the HTTPS module
const https = require('https')

function huya {
    // Initialize the total number
    let total = 0
    // Initialize the total number of pages
    let totalPage = 1
    // Initializes the current page count
    let currentPage = 1
    // Start processing data like betta fish
    huyaGetData(currentPage)
    function huyaGetData(currentPage) {
        let req = https.request(`https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&tagAll=0&page=${currentPage}`, res => {
            const chunks = []
            res.on('data', chunk => {
                chunks.push(chunk)
            })
            res.on('end', () = > {const data = JSON.parse(Buffer.concat(chunks).toString('utf-8')).data
                const dataList = data.datas
                // Get the total page count
                totalPage = data.totalPage
                // Add up the number of people in the studio
                total = dataList.reduce((total, item) = > {
                    return total + Number.parseInt(item.totalCount)
                }, total)
                // Get the data for the next page
                currentPage += 1
                if (currentPage < totalPage) {
                    huyaGetData(currentPage)
                } else {
                    console.log(` canine teeth:${total}`)
                }
            })
        })
        req.end()
    }
}
Copy the code

Bi li bi

Bilibili page also has all the direct broadcast query interface similar to tiger tooth, statistical methods and tiger tooth are similar, here is no longer described.

YY

The webpage of YY does not provide query interfaces for all live broadcast rooms. The query interface of https://www.yy.com/more/page.action?biz=sing&subBiz=idx&page=3&moduleId=308&pageSize=60 provides only a single classification, And each classification query to pass the parameter value is not the same, and there is no law to follow, so now to get the information needed for each classification query, the query after the same as tiger teeth.

Now what we need to do is to get the information required by the interface. By analyzing the page, we find that each list page of classified live broadcast has a pageInfo global variable, which contains all the information required by the query. We can take the HTML files of these classified pages and parse out the pageInfo variable in them. But for demonstration purposes, we use another method, Selenium, to solve this problem.

Selenium is an automated testing framework for Web applications. It can be used in crawlers to open a browser and use code to simulate real human operations to crawl the required information, breaking through the limitations of anti-crawler methods.

To use Selenium, you need to download the Corresponding WebDriver based on your platform. Download the corresponding Chrome Driver based on the version of Chrome on your computer and copy it to the root directory of your project. Other browsers can find the corresponding package to download.

The selenium-WebDriver package dependencies are then installed in the project.

const { Builder, By } = require('selenium-webdriver')

async function getYYPageInfoList() {
    // Build the WebDriver object
    let driver = await new Builder().forBrowser('chrome').build();
    // Open the page
    await driver.get('https://www.yy.com/catalog');
    // Get the list of category tags
    let aList = await driver.findElements(By.css('.w-video-module-cataloglist a'))
    // Get the category page address list
    let hrefList = []
    for (let i = 0; i < aList.length - 1; i++) {
        let href = await aList[i].getAttribute('href')
        hrefList.push(href)
    }
    // Open the classification page
    for (let i = 0; i < hrefList.length - 1; i++) {
        await driver.get(hrefList[i])
        // Return pageInfo to pageInfo
        driver.executeScript('return pageInfo').then(function (obj) {
            // Store pageInfo
            pageInfoList.push(obj)
        })
    }
    // Exit the browser
    driver.quiet()
    return pageInfoList
}
Copy the code

The statistical results

The statistical results of address: http://liupenglong.com/live/index.html (once every five minutes statistics)

It can be seen that currently only the statistics of douyu, Huya, bilibili, YY data, the total number of people has exceeded the total population of the country, counting other no statistics… The results can be imagined, it is estimated that people around the world are watching China’s live broadcast.

In the future, I will make statistics on other platforms when I have time, and finally HOPE to calculate the number of people actually watching the live broadcast. If you have any good ideas, please comment and exchange.