Introduction to the

This article records how to climb douban movie Top250 data list, and store data into MySQL database practice tutorial.

The main points and difficulties are:

Go implements Http concurrent request data design regular expression matching data store to crawl data into MySQL database

  1. Explicitly crawls the destination URL

Top250 官 网 address: movie.douban.com/top250, we can see that each page displays 25 movies. After clicking the button on the second page at the bottom, the page parameter changes to? Start =25&filter=, and the page parameter on the third page? Start =50&filter=…

It can be seen that 250 movies are displayed in 10 pages. To capture all movies, we only need to generate the URL links of these 10 pages in a loop. Here are the links to each page:

# https://movie.douban.com/top250?start=0&filter= home page, Equivalent to https://movie.douban.com/top250 https://movie.douban.com/top250?start=25&filter= page # 2 https://movie.douban.com/top250?start=50&filter= page # 3... https://movie.douban.com/top250?start=225&filter= # page 10Copy the code

By analyzing the link address, we get the rule: start=0 shows the data of 1 to 25 movies, and start=25 shows the data of 26 to 50 movies. Similarly, when capturing the data of the next page, only the value of Start needs to be changed.

  1. Send the request to get the response data

Create a new main.go file and create a new main() function to control the number of loops used to generate the URL’s input parameters:

Func main() {start := 1 end := 10 channel := make(chan int) for I := start; i <= end; i++ { SpiderDouBan(i, channel) } }Copy the code

The next step is to create a new SpiderDouBan() function to crawl the controller:

Func main() {start := 1 end := 10 channel := make(chan int) for I := start; i <= end; i++ { SpiderDouBan(i, channel) } }Copy the code

Encapsulating HTTP Request method Request() : Many websites have anti-crawler measures. If a Request without headers information is considered as a crawler Request, the Request will be prohibited. So every time we climb a page, we add some headers.

But for frequent access request, the client’s IP will be server access is forbidden, so set up agency proxies can also will be asked to pretend to be from different IP access, the premise is to ensure that the IP address of the agent was effective.

// rawUrl request interface address // method request method, GET/POST/bodyMap Request body content // Header Request header information // timeout Timeout func Request(rawUrl, Method String, bodyMaps, headers map[string]string, timeout time.Duration) (result string, err error) { if timeout <= 0 { timeout = 5 } client := &http.Client{ Timeout: Data := url.Values{} for key, value := range bodyMaps {data.set (key, Request, err1 := http.NewRequest(method, rawUrl, strings.NewReader(data.Encode())) // URL-encoded payload if err1 ! = nil {err = err1 return} // Add header information for key, val := range headers {request.header. Set(key, Response, _ := client.do (request) defer response.body.close () if response.statuscode! = http.StatusOK { return "", fmt.Errorf("get content failed status code is %d ", response.StatusCode) } res, err2 := ioutil.ReadAll(response.Body) if err2 ! = nil { err = err2 return } return string(res), nil }Copy the code
  1. Filter tags to extract useful information – for each movie

Through the encapsulation function above, we can obtain all the web page data of Douban movie Top250. In order to test, we analyzed the website data by viewing the source code of the web page: the tags of each movie in the HTML web page were fixed, and we only needed to find out the rules of the fixed format, write regular expressions, and extract the movie name, main actors, pictures, scores and other information.

Source code for one of the movie’s web pages:

<li> <div class="item"> <div class="pic"> <em class="">1</em> <a href="https://movie.douban.com/subject/1292052/"> ! [](https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp) </a> </div> <div class="info"> <div Class = "hd" > < a href = "https://movie.douban.com/subject/1292052/" class = "" > < span class =" title "> shawshank redemption < / span > < span class="title">&nbsp; /&nbsp; The Shawshank Redemption</span> <span class="other">&nbsp; /&nbsp; Black fly (Hong Kong)/exciting 1995 (Taiwan) < / span > < / a > < span class = "playable" > [can play] < / span > < / div > < div class = "bd" > < p class = "" > director: Frank Darabont& NBSP; &nbsp; &nbsp; Tim Robbins /... <br> 1994&nbsp; /&nbsp; The United States and have spent /&nbsp; </p> <div class="star"> < SPAN class=" rating5-T "></span> <span class="rating_num" property=" V: Average ">9.7</span> <span property="v:best" content="10.0"></span> <span> <p class="quote"> <span Class ="inq"> Hope makes a man free. </span> </p> </div> </div> </div> </li>Copy the code

You can see the labels for every movie

  • Of the label
    Node inside, you can use webmaster tools first
    tool.oschina.net/regex/Test whether the data extracted from the written regular expression is correct:

    The div node with class as PIC is the regex for the movie ranking ID number and the movie image and movie name information:

    regExp := `<div class="item">[\s\S]*? <div class="pic">[\s\S]*? <em class="">(.*?) <\/em>[\s\S]*? <a href=".*?" >[\s\S]*? ! [] ((. *?) ) `Copy the code

    The node contains the alias of the movie and other information, and the regular expression is:

    regExp := `<div class="item">[\s\S]*? <div class="pic">[\s\S]*? <em class="">(.*?) <\/em>[\s\S]*? <a href=".*?" >[\s\S]*? ! [] ((. *?) )[\s\S]*? div class="info[\s\S]*? class="hd"[\s\S]*? class="title">(.*?) <\/span>[\s\S]*? class="other">(.*?) <\/span>`Copy the code

    The tag contains the information of the director and the leading actor of the movie. The tag contains the information of the director and the actor of the movie. In the tag, the data of the director and the actor can be extracted by using line breaks.

    regExp := `<div class="item">[\s\S]*? <div class="pic">[\s\S]*? <em class="">(.*?) <\/em>[\s\S]*? <a href=".*?" >[\s\S]*? ! [] ((. *?) )[\s\S]*? div class="info[\s\S]*? class="hd"[\s\S]*? class="title">(.*?) <\/span>[\s\S]*? class="other">(.*?) <\/span>[\s\S]*? <div class="bd">[\s\S]*? <p class=".*?" >([\s\S]*?) <br>([\s\S]*?) <\/p>`Copy the code

    The tag contains the movie’s star rating and rating data. The rules for extracting stars and ratings and the types analyzed above, so the final regular expression is:

    regExp := `<div class="item">[\s\S]*? <div class="pic">[\s\S]*? <em class="">(.*?) <\/em>[\s\S]*? <a href=".*?" >[\s\S]*? ! [] ((. *?) )[\s\S]*? div class="info[\s\S]*? class="hd"[\s\S]*? class="title">(.*?) <\/span>[\s\S]*? class="other">(.*?) <\/span>[\s\S]*? <div class="bd">[\s\S]*? <p class=".*?" >([\s\S]*?) <br>([\s\S]*?) <\/p>[\s\S]*? span class="rating_num".*? average">(.*?) <\/span>`Copy the code

    The complete code to extract all movie data on each page is as follows:

    RegExp := '<div class="item">[\s\ s]*? <div class="pic">[\s\S]*? <em class="">(.*?) <\/em>[\s\S]*? <a href=".*?" >[\s\S]*? ! [] ((. *?) )[\s\S]*? div class="info[\s\S]*? class="hd"[\s\S]*? class="title">(.*?) <\/span>[\s\S]*? class="other">(.*?) <\/span>[\s\S]*? <div class="bd">[\s\S]*? <p class=".*?" >([\s\S]*?) <br>([\s\S]*?) <\/p>[\s\S]*? span class="rating_num".*? average">(.*?) <\/span> '// Use the re to match find := regexp.MustCompile(regexp) // where result is the HTML data obtained in the previous step content := The find. FindAllStringSubmatch (result, 1) / / return detailed two-dimensional slices matchingCopy the code

    The content data format is [][] STRING two-dimensional slice, so we can use the for statement to traverse the number group to obtain useful information.

    1. Use analysis to obtain valid data

    Create a MySQL table to store the saved data

    CREATE TABLE `top250` (
      `id` int(20) NOT NULL AUTO_INCREMENT,
      `title` varchar(20) DEFAULT '',
      `image` varchar(100) DEFAULT '',
      `subtitle` varchar(255) DEFAULT '',
      `other` varchar(255) DEFAULT NULL,
      `personnel` varchar(255) DEFAULT '',
      `info` varchar(255) DEFAULT '',
      `score` varchar(10) DEFAULT '',
      PRIMARY KEY (`id`)
    ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4;
    Copy the code

    Using the database/SQL packages and use the MySQL driver: MySQL software including database tutorial click: Go tutorial SQL database view

    Func insert(movies [][]string) { ,? ,...). Make ([]interface{}, 0, 0) // make([]interface{}, 0, 0) For _, val := range movies {// Placeholder valueStrings = append(valueStrings, "(? ,? ,? ,? ,? ,? ,? ,?) ") for i := 1; i < len(val); i++ { valueArgs = append(valueArgs, val[i]) } } db, err := sql.Open("mysql", "Root: a root @ TCP (127.0.0.1:3306)/test") if err! = nil {log.fatal (err)} defer db.close () // INSERT INTO 'top250' (id,title,image,subtitle,other,personnel,info,score) VALUES %s", strings.Join(valueStrings, ",")) fmt.Println(sqlName) // insert stmt, err := db.Prepare(sqlName) if err ! = nil { log.Fatal(err) } fmt.Println(valueArgs) res, err := stmt.Exec(valueArgs...) if err ! = nil { log.Fatal(err) } lastId, err := res.LastInsertId() if err ! = nil { log.Fatal(err) } rowCnt, err := res.RowsAffected() if err ! = nil { log.Fatal(err) } log.Printf("ID = %d, affected = %d\n", lastId, rowCnt) }Copy the code

    We only need to create a chan channel in main() to call SpiderDouBan() concurrently:

    channel := make(chan int) for i := start; i <= end; i++ { go SpiderDouBan(i, channel) } for i := start; i <= end; I++ {FMT. Println (" is the first "+ strconv. Itoa (< - channel) +" mission accomplished ")}Copy the code

    Here, at the end of the SpiderDouBan() function, we send data to the chan channel:

    func SpiderDouBan(index int, ch chan int) { // ... Omit code //... // if ch < -index}Copy the code

    Which complete Go climb douban Top250 movie code please click: gitee.com/lisgroup/go… To view

    SpiderDouBan() ¶ SpiderDouBan() ¶ SpiderDouBan() ¶ SpiderDouBan() ¶ SpiderDouBan() ¶ SpiderDouBan() ¶ SpiderDouBan() ¶ SpiderDouBan() ¶

    Original link: www.sdk.cn/details/djg…

    SDK is a neutral communities, there are a variety of front knowledge, has the rich API, developers have a love of learning artificial intelligence, have humor developers take you learn python, there will be the next hot HongMeng, when various elements together, let us together to build professional, fun and imaginative valuable developer community, Help developers realize their self-worth!