All things can climb – Puppeteer
Prior to the start
The last article explained how to quickly use puppeteer to crawl a simple video website. However, the logic is relatively simple, and the trick is to directly use the details page logic to crawl the TV show, but this can only crawl a TV show, movie. It’s annoying. Trying to do your own video site is still a bit of a chicken. There is, of course, the tricky question of what to do if the details page has token authentication.
This article is related only for learning exchange, invasion joint reform
Content forecast
This paper will answer these questions:
- One click to climb a video website all movies/TV series
- Automatically populate the base form
- page.click()
- Slider verification code cracking
Video station
Here’s what you need to do to climb a video site:
- Find a website you like
- Crawl list page – Get details page address for each video (note: regular list pages are paginated)
- Go to the details page
- Get the video playing address – jump again if you also need to select the resource
- Closing Details page
Repeat 3 to 5. After all the contents of the current list page are climbed, jump to the second page of the list, and so on. After all the list pages are climbed, close the browser.
Once you know the steps, let’s go
Find the website and analyze it
In order to avoid the suspicion of advertising, the video url in the demo uses a fake address, if you want the address, please comment
This is its list page, and as you can see, it’s paginated. There are two ways to do this
- Analyze its paging rules
- Use page. Click ()
Start with a scheme that analyzes paging rules
http://www.xxxxxxx.xx
http://www.xxxxxxx.xx/page/2
http://www.xxxxxxx.xx/page/3
...
Copy the code
I don’t need to tell you what the rule is
Began to climb
List of pp.
If you are not familiar with the Puppeteer API, move around. Everything can be climbed
const findAllMovie = async () => {
console.log('Start visiting this site')
const browser = await (puppeteer.launch({
executablePath: puppeteer.executablePath(),
headless: false})); /* @params * pageSize: How many pages do you want to crawlfor (leti = 1; i <= pageSize; I++) {/ / used to save up the details of the page address var arr = [] const targetUrl = ` https://www.xxxxxxx.xx/page/${i}` const page = await browser.newPage(); // enter page await page.goto(targetUrl, {timeout: 0,waitUntil: 'domcontentloaded'}); // Get the root const baseNode ='ul#post_container'
const movieList = await page.evaluate(sel => {
const movieBox = Array.from($(sel).find('li'))
var ctn = movieBox.map(v => {
const url = $(v).find('.article h2 a').attr('href');
return {url: url}
})
returnctn }, baseNode) arr.push(... MovieList) // Ready to climb the detail page await detailMovie(arr, page)} browser.close(); console.log('Visit Over')
return {msg: 'Sync done'}}Copy the code
Details page
const detailMovie = async (arr, page) => {
var detailArr = []
console.log('Number of movies per page:' + arr.length)
for (let i = 0; i < arr.length; i++) {
await page.goto(arr[i].url, {
timeout: 0,
waitUntil: 'domcontentloaded'
})
const baseNode = '.article_container.row.box'Const movieList = await page.evaluate(sel => {const movieBox = array. from($(sel).find())'#post_content').find('p'))
const urlBox = $(sel).find(# Blu-ray HD TD A).attr('href')
var tmp = [{}]
var ctn = tmp.map((v,i) => {
const imgUrl = $(movieBox[0]).find('a').attr('href');
var info = $(movieBox[1]).text()
return {
imgUrl: imgUrl,
name: info,
urlBox: urlBox
}
})
returnctn }, baseNode) console.log(movieList) detailArr.push(... movieList) console.log('抓取第 ' + detailArr.length + 'Page complete')
}
console.log('Start adding data to database')
await addMovie(detailArr)
page.close()
return detailArr
}
Copy the code
process
Feels good, doesn’t it?!
The results of
Their own video website has been built.
Form
What makes Puppeteer so powerful is that it can simulate operations, such as baidu itself
implementation
It’s simple. One way to do it
page.type(selector, text[, options])
- Selector, the element selector to input. If there are multiple matched elements, enter the first matched element
- Text, the content to enter
- options
- Delay Indicates the delay in entering each character, in milliseconds. The default is 0
Note: KeyDown, KeyPress/Input, and KeyUP events are emitted after each character is entered.
page.type('#mytextarea'.'Hello'); // Enter immediately
page.type('#mytextarea'.'World', {delay: 100}); // Input becomes slow, like a user
Copy the code
Next open Baidu to try
const getForm = async() = > {/ / puppteer validation
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://baidu.com');
await page.type('#kw'.'puppeteer', {delay: 100}); // After opening Baidu, enter puppeteer slowly in the search box.
page.click('#su') // Then click Search
await page.waitFor(1000);
const targetLink = await page.evaluate((a)= > {
let url = document.querySelector('.result a').href
return url
});
console.log(targetLink);
await page.goto(targetLink);
// await page.waitFor(1000);
browser.close();
}
Copy the code
In the example above, we also use the page.click method.
page.click(selector, [options])
- Selector: The selector of the element to be clicked. If there are multiple matching elements, click the first one.
- options
- Button: Left,right or middle
- ClickCount: Default is 1
- Delay: duration between mouseDown and mouseUp, in milliseconds. The default is 0
This method finds an element that matches the Selector, scrolls it visually if needed, and then clicks on it via page.mouse. This method will report an error if the selector does not match any elements.
Note that if click() triggers a jump, there is a separate Page.waitforNavigation () Promise object to wait on. The correct waiting jump looks like this:
const [response] = await Promise.all([
page.waitForNavigation(waitOptions),
page.click(selector, clickOptions),
]);
Copy the code
Simple slider verification code cracking + analog mobile phone
steps
- Find the slider
- Calculates the position of the slider
- Distribute events
- Drag the
- Let it go
implementation
The difficulties in analyzing
Build an emulator
const devices = require('puppeteer/DeviceDescriptors');
const iPhone6 = devices['iPhone 6'];
await page.emulate(iPhone6)
Copy the code
The full version
const getYzm = async() = > {const devices = require('puppeteer/DeviceDescriptors');
const iPhone6 = devices['iPhone 6'];
const conf = {
headless: false.defaultViewport: {
width: 1300.height: 900
},
slowMo: 30
}
puppeteer.launch(conf).then(async browser => {
var page = await browser.newPage()
await page.emulate(iPhone6)
await page.goto('https://www.dingtalk.com/oasite/register_h5_new.htm')
// The sliding captcha checks nabigator.webdriver. So we need to set this property to false before sliding
// The WebDriver read-only property navigator of the interface indicates whether the user agent is controlled by automation.
await page.evaluate(async() = > {Object.defineProperty(navigator, 'webdriver', {get: (a)= > false})})// Incorrect input triggers the verification code
await page.type('#mobileReal'.'15724564118')
await page.click('.am-button')
await page.type('#mobileReal'.' ')
await page.keyboard.press('Backspace')
await page.click('._2q5FIy80')
// Wait for the slider to appear
var slide_btn = await page.waitForSelector('#nc_1_n1t', {timeout: 30000})
// Calculate the slider distance
const rect = await page.evaluate((slide_btn) = > {
// Returns the size of the element and its position relative to the viewport
const {top, left, bottom, right} = slide_btn.getBoundingClientRect();
return {top, left, bottom, right}
}, slide_btn)
console.log(rect)
rect.left = rect.left + 10
rect.top = rect.top + 10
const mouse = page.mouse
await mouse.move(rect.left, rect.top)
// TouchEvent, puppeteer only has mouseEvent. So there needs to be some way to pass the event before sliding.
await page.touchscreen.tap(rect.left, rect.top) // H5 needs to manually distribute events to simulate the event distribution mechanism of app.
await mouse.down()
var start_time = new Date().getTime()
await mouse.move(rect.left + 800, rect.top, {steps: 25})
await page.touchscreen.tap(rect.left + 800, rect.top,)
console.log(new Date().getTime() - start_time)
await mouse.up()
console.log(await page.evaluate('navigator.webdriver'))
console.log('end')
// await page.close()})}Copy the code