Hot live streaming
In recent years, live broadcasting can be said to be quite popular, but the popularity of millions of people can not help but wonder, how can there be so high popularity? Is the whole country watching it live?
In order to solve this problem that has bothered me for a long time, I went to learn the related knowledge of node crawler, and I will share it with you.
bettas
We started with the most common betta fish, which proved to be the easiest.
Through to analysis the betta website, found that has given rise to a classification of all page https://www.douyu.com/directory
Cheerio package is used here, which is equivalent to jQuery on the server side. It is used for DOM manipulation on the server side, and the usage method is similar to jQuery.
Because douyu web page after gzip compression, but also use zlib, this package is decompressed.
// Import the HTTPS module
const https = require('https')
// zlib package for decompression
const zlib = require('zlib')
// Cheerio package, which provides jquery-like functionality
const cheerio = require("cheerio");
function douyu () {
// Create the request object
let req = https.request('https://www.douyu.com/directory', res => {
// Receive data
let chunks = []
// When the data is detected, it is stored
res.on('data', chunk => {
chunks.push(chunk)
})
// Data transfer is complete
res.on('end', () = > {// Splice data
var buffer = Buffer.concat(chunks)
// Use zlib to decompress
zlib.gunzip(buffer, function (err, decoded) {
// Gzip decompressed HTML text
let html = decoded.toString()
// Use cheerio to parse HTML
let $ = cheerio.load(html)
// Get a list of elements that contain live data
let list = $('#allCate .layout-Module-container .layout-Classify-list .layout-Classify-item .layout-Classify-card')
// Parse the DOM to retrieve the data in the tag
const dataList = {}
Array.prototype.map.call(list, item => {
let key = ' ', value = ' '
item.children.forEach(childrenItem= > {
if (childrenItem.name === 'strong') {
key = childrenItem.children[0]? childrenItem.children[0].data : 'empty'
} else if (childrenItem.name === 'div') {
value = $(childrenItem).find('span').html()
value = unescape(value.replace(/&#x/g.'%u').replace(/; /g.' '))
}
})
dataList[key] = value
})
// Add up the total number of people
let total = 0
for (let key in dataList) {
let value = dataList[key]
// Process numbers in ten thousand units
if (value.indexOf('万') != - 1) value = Number.parseFloat(value) * 10000
total += Number.parseFloat(value) ? Number.parseFloat(value) : 0
}
console.log(` bettas are:${total}`)})})})// Send the request
req.end()
}
Copy the code
Tiger tooth
A look at tiger Tooth’s website shows that there is no statistical classification list like betta Fish, But found the web site https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&tagAll=0&page=1 information query studio list interface. Through this interface, you can get the information of all broadcast rooms, including the number of people in each broadcast room. Next, you can get the total number of people on this platform by adding up the number of people in each broadcast room.
// Import the HTTPS module
const https = require('https')
function huya {
// Initialize the total number
let total = 0
// Initialize the total number of pages
let totalPage = 1
// Initializes the current page count
let currentPage = 1
// Start processing data like betta fish
huyaGetData(currentPage)
function huyaGetData(currentPage) {
let req = https.request(`https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&tagAll=0&page=${currentPage}`, res => {
const chunks = []
res.on('data', chunk => {
chunks.push(chunk)
})
res.on('end', () = > {const data = JSON.parse(Buffer.concat(chunks).toString('utf-8')).data
const dataList = data.datas
// Get the total page count
totalPage = data.totalPage
// Add up the number of people in the studio
total = dataList.reduce((total, item) = > {
return total + Number.parseInt(item.totalCount)
}, total)
// Get the data for the next page
currentPage += 1
if (currentPage < totalPage) {
huyaGetData(currentPage)
} else {
console.log(` canine teeth:${total}`)
}
})
})
req.end()
}
}
Copy the code
Bi li bi
Bilibili page also has all the direct broadcast query interface similar to tiger tooth, statistical methods and tiger tooth are similar, here is no longer described.
YY
The webpage of YY does not provide query interfaces for all live broadcast rooms. The query interface of https://www.yy.com/more/page.action?biz=sing&subBiz=idx&page=3&moduleId=308&pageSize=60 provides only a single classification, And each classification query to pass the parameter value is not the same, and there is no law to follow, so now to get the information needed for each classification query, the query after the same as tiger teeth.
Now what we need to do is to get the information required by the interface. By analyzing the page, we find that each list page of classified live broadcast has a pageInfo global variable, which contains all the information required by the query. We can take the HTML files of these classified pages and parse out the pageInfo variable in them. But for demonstration purposes, we use another method, Selenium, to solve this problem.
Selenium is an automated testing framework for Web applications. It can be used in crawlers to open a browser and use code to simulate real human operations to crawl the required information, breaking through the limitations of anti-crawler methods.
To use Selenium, you need to download the Corresponding WebDriver based on your platform. Download the corresponding Chrome Driver based on the version of Chrome on your computer and copy it to the root directory of your project. Other browsers can find the corresponding package to download.
The selenium-WebDriver package dependencies are then installed in the project.
const { Builder, By } = require('selenium-webdriver')
async function getYYPageInfoList() {
// Build the WebDriver object
let driver = await new Builder().forBrowser('chrome').build();
// Open the page
await driver.get('https://www.yy.com/catalog');
// Get the list of category tags
let aList = await driver.findElements(By.css('.w-video-module-cataloglist a'))
// Get the category page address list
let hrefList = []
for (let i = 0; i < aList.length - 1; i++) {
let href = await aList[i].getAttribute('href')
hrefList.push(href)
}
// Open the classification page
for (let i = 0; i < hrefList.length - 1; i++) {
await driver.get(hrefList[i])
// Return pageInfo to pageInfo
driver.executeScript('return pageInfo').then(function (obj) {
// Store pageInfo
pageInfoList.push(obj)
})
}
// Exit the browser
driver.quiet()
return pageInfoList
}
Copy the code
The statistical results
The statistical results of address: http://liupenglong.com/live/index.html (once every five minutes statistics)
It can be seen that currently only the statistics of douyu, Huya, bilibili, YY data, the total number of people has exceeded the total population of the country, counting other no statistics… The results can be imagined, it is estimated that people around the world are watching China’s live broadcast.
In the future, I will make statistics on other platforms when I have time, and finally HOPE to calculate the number of people actually watching the live broadcast. If you have any good ideas, please comment and exchange.