preface

Some time ago, I received a project to develop and collect website data. As a PHP developer, I immediately thought of using PHP to do crawlers. Although Python crawlers are convenient, PHP is not weak in this respect, because PHP is the best language in the world! Here is a recommended PHP crawler framework, PHPSpider. It is not recommended to write your own crawler, because it is too inefficient. Using frame crawlers is really much more efficient

Official documents:

https://doc.phpspider.org/


1, download

Official github download address:

https://github.com/owner888/phpspider

Download address may not be accessible, here is a web disk download address:

https://pan.baidu.com/s/10n9ZOUQBlrJzOQx0ShOmMQ

Extraction code: B2ZC


2. File structure

After downloading and decompressing, the phpSpider file structure looks like this:


The demo folder contains some examples of phpSpider, as shown in the figure below:


3. Create a crawler and run it

Create a crawler file in the Demo folder. Note that phpSpider has two ways to run crawlers: one is on the command line; The other is visual operation (running in a browser)


3.1 Running crawler files on the command Line

Object links to climb:

https://www.douban.com/photos/album/1616649448/

The content to crawl looks like this:

Crawl the contents contained in the div with id wrapper


3.1.1 Create a file called spider. PHP in the Demo folder. The code is as follows:


      
require_once __DIR__ . '/.. /autoloader.php';
use phpspider\core\phpspider;

/* Do NOT delete this comment */
/* Do not delete this comment */

$configs = array(
    'name'= >'douban'.// Define the current crawler name
    'log_show'= >true.// Displays log debugging information
    'input_encoding'= >'UTF-8'.// Enter the code

    // define which domain pages the crawler crawls. Non-domain urls are ignored to improve crawler speed
    'domains'= >array(
        'www.douban.com'
    ),

    // Defines the entry links from which the crawler starts to crawl, and also the links to monitor the crawler
    'scan_urls'= >array(
        'https://www.douban.com/photos/album/1616649448/'
    ),

    // crawler crawls data export
    'export'= >array( 
        'type'= >'csv'.//type: export type CSV, SQL, or DB
        'file'= >'.. /data/abc.csv'.//file: export the address of a CSV or SQL file. If the file does not exist, it is automatically created
    ),


    // Define extraction rules for content pages
    'fields'= >array(
        array(
            'name'= >"wrapper".'selector'= >"//div[@id='wrapper']"))); $spider =new phpspider($configs);
$spider->start();Copy the code



3.1.2 Directly open the CMD command panel in the Demo folder, enter the command line PHP -f spider.



3.1.3 Viewing the Crawled Data

Find the abc.csv file under the data folder in the phpSpider file structure, and open the file to see the data extracted, as shown in the figure below:


3.2 Visual Operation (running crawler files in the browser)

Object links to climb:

https://movie.douban.com/subject/26588308/?from=showing

The content to crawl looks like this:

Crawls the contents contained in a div whose class is nav-items


3.2.1 Create another file called test.php in the demo folder as follows:

<? php header("Content-Type: text/html; charset=utf-8");
date_default_timezone_set("Asia/Shanghai");
ini_set("memory_limit"."10240M");

require_once __DIR__ . '/.. /autoloader.php'; use phpspider\core\phpspider; use phpspider\core\requests; use phpspider\core\selector; /* Do NOT delete this comment */$html = requests::get('https://movie.douban.com/subject/26588308/?from=showing');
$data = selector::select($html."//div[@class='nav-items']");
echo $data;
Copy the code


3.2.2 Open a Browser and enter the file address


conclusion

The above is just a simple crawler example, but also can carry out multi-process crawling, proxy crawler, a lot of fun, more operations refer to the official documents

https://doc.phpspider.org/


The last

If you think the article is good, please pay attention to it and give me a thumbs-up!

If you have questions about the article or want to technical exchanges, you can pay attention to the public number [GitWeb] and EXPLORE learning with me!