preface
I recently got into typescript, learned some basics, and had time on the weekend to wonder if I could do something with TS. Seeing other friends use it to write crawlers, I happened to be interested in this piece, so I also tried to write a simple crawler, and today I share it with you.
preparation
Before writing a crawler, we need to do some preparatory work. Let’s take a look at the following:
The target
The following figure is the content we will crawl this time. We need to get the information of each course and finally write it into a JSON file. Destination URL.
Environment set up
We start by creating a new folder and initialize the project by executing the following commands in sequence
npm init -y
Copy the code
Generates the typescript configuration file tsconfig.json
tsc --init
Copy the code
Install the typescript
npm install typescript -D
Copy the code
Let’s install some more modules that are required for the project
- Ts-node (saves compiling TS for easy debugging)
- Superagent (for sending requests)
- Cheerio (used to parse the retrieved HTML structure)
For example, superagent is written in JS, but we don’t recognize it in typescript. It requires a translation file. The translation file is called the corresponding type declaration module here. Let’s install these two type declaration modules:
- @types/superagent
- @types/cheerio
Once the dependencies listed above are installed, our environment setup is almost complete
Project launch test
Let’s modify the package.json scripts configuration:
"scripts": {
"dev": "ts-node ./src/app.ts"
}
Copy the code
Create an app.ts file in SRC and write:
console.log('I want to write a reptile.');
Copy the code
And then I run NPM run dev, and I can see the output and I want to write the crawler and that means there’s no problem
Analyze web pages and capture data
Analyze page tag structure
Before starting to write crawler, we need to analyze the structure of our target web page and observe which tags the information we need is stored in, which may allow us to crawl more accurately. In our example, we select a random course and right-click to check, as shown below: we can see that each course is placed under a LI tag
- The name of the course is stored in the corresponding LI label
data-name
Properties of the - Course enrollment information is stored in the
class=one
Under the P tag - Course pricing information is stored in the
class=two
Under the P tag ofclass=price
Span tag
Fetching the data
After the above analysis, we can start writing our code by introducing the modules we need
import superagent from 'superagent';
import cheerio from 'cheerio';
const fs = require('fs');// will be used to write files later
Copy the code
- Let’s start by defining a class that defines a method to get the target page. Here we use the SuperAgent to initiate the request, while using es6
async await
To handle asynchronous operations; After we get the data and print the result, we’ll find that the HTML structure is stored in the returned objecttext
Property, so we just return the value stored in text.
class Grabcourse {
// Store the destination web page address
private url: string = 'https://coding.imooc.com/?c=fe&sort=0&unlearn=0&page=1';
// Get the HTML structure of the page to be climbed
async getHtml() {
const courseHtml = await superagent.get(this.url);
return courseHtml.text;
}
constructor() {
this.getHtml(); }}new Grabcourse();
Copy the code
- Then we also need to get the HTML structure for parsing, using
cheerio
This module allows us to easily get the tags you want in HTML, because it supports jquery syntax, familiar friends can be very quick to use.
We also define a method on our class that parses the retrieved HTML, accepts a parameter value of string, and returns the parsed result.
async loadhtml(html: string) {
return cheerio.load(html);
}
Copy the code
- At this point, we now need to get the information for each course. We also define a method to get the information for the course on the class
'.course-list li'
The selector can get all the lessons, and then we walk through it, get the other values we need and store them in an array; I’m going to set the parameter type of the function to any otherwise ts will throw a warning, and we also need to define an array to store the course information, because we know that each item in this array has onlyCourse name
.Course enrollment information
andThe price
, so here we can define an interface
interface Course {
courseName: string;
courseType: string;
coursePrice: string;
}
Copy the code
Then add the following line to the class to specify the required Course type for each item in the array
private courseItems:Course[] = [];
Copy the code
Get relevant codes for all course information
// Get course information
async getCourseInfo($element: any) {
$element('.course-list li').each((idx: any, ele: any) = > {
const courseName = $element(ele).attr('data-name');
// Find Spaces with replace
const courseType = $element(ele).find('.one').text().replace(/\s/g.' ');
const coursePrice = $element(ele).find('.two .price').text();
this.courseItems.push({
courseName,
courseType,
coursePrice,
});
});
return this.courseItems;
}
Copy the code
Next we need to write a method that holds the information we get from the course, we still define a method on the class,
// Save the lessons obtained
async saveCourseItems(result: Course[]) {
const data = {
course: result
}
// Save the logic
fs.writeFile('./course.json'.JSON.stringify(data), (err: any) = > {
if (err) {
console.error(err)
return
}
console.log('File write succeeded. ')})}Copy the code
By virtue of the principle of high cohesion and low coupling, we have defined separate methods. Now we need to put these methods in a single method and execute this single method from constructor. When we new Grabcourse, the crawling logic will be executed one by one. And it’s easier to read
async initSpride() {
const html = await this.getHtml();
const $element = await this.loadhtml(html);
const courseItems = await this.getCourseInfo($element);
this.saveCourseItems(courseItems);
}
constructor() {
this.initSpride();
}
Copy the code
Finally, run the command NPM run dev, and you can see that the course. Json file is generated. When you open the file, you can see the data as shown in the following figure, indicating that our crawler has been successful
The complete code
import superagent from 'superagent';
import cheerio from 'cheerio';
const fs = require('fs');
interface Course {
courseName: string;
courseType: string;
coursePrice: string;
}
class Grabcourse {
private url: string = 'https://coding.imooc.com/?c=fe&sort=0&unlearn=0&page=1';
private courseItems:Course[] = [];
// Get the HTML structure of the page to be climbed
async getHtml() {
const courseHtml = await superagent.get(this.url);
return courseHtml.text;
}
/ / parse HTML
async loadhtml(html: string) {
return cheerio.load(html);
}
// Get course information
async getCourseInfo($element: any) {
$element('.course-list li').each((idx: any, ele: any) = > {
const courseName = $element(ele).attr('data-name');
const courseType = $element(ele).find('.one').text().replace(/\s/g.' ');
const coursePrice = $element(ele).find('.two .price').text();
this.courseItems.push({
courseName,
courseType,
coursePrice,
});
});
return this.courseItems;
}
async saveCourseItems(result: Course[]) {
const data = {
course: result
}
// Save the logic
fs.writeFile('./course.json'.JSON.stringify(data), (err: any) = > {
if (err) {
console.error(err)
return
}
console.log('File write succeeded. ')})}async initSpride() {
const html = await this.getHtml();
const $element = await this.loadhtml(html);
const courseItems = await this.getCourseInfo($element);
this.saveCourseItems(courseItems);
}
constructor() {
this.initSpride(); }}new Grabcourse();
Copy the code