“This is the 26th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

PHP, in addition to server-side development, can also be used to crawl data just like Python.

I think there are two ways to crawl data:

The first is the request data interface, which parses JSON data

The other one is to get the DOM tree of the web page, parse the DOM tree, and get the data.

Simple_html_dom, a PHP plugin, is used to parse json.

Parsing json

Parsing JSON is well understood and can be supported by any development language that supports request networking

The idea is to get the JSON data returned by the crawl link (or any other type of data, json is more common).

After we get the JSON data, we parse it and process the data according to our own needs.

This is more general, so I won’t give you an example.

Parsing the dom tree

Focus on how PHP crawls the target site and parses the DOM tree.

Here’s an example:

Utility class code:

Caching is done. As long as the url requested by my scene is the same, the returned data is unchanged, so CACHING is done

The requested URL will not be repeated, but will be taken from the cache

The purpose of sleep() is to prevent network problems for the site that captures the data and control the frequency of requests

public static function crawlContent($url, $encode = true) { $file_name = '.. /cache/' . md5($url); if (! file_exists($file_name)) { @touch($file_name); } $content = file_get_contents($file_name); if (empty($content)) { $content = Request::curl($url); If (empty ($content)) {sleep (rand (3, 10)); $content = Request::curl($url); } $encode && $content = iconv("GBK", "UTF-8//IGNORE", $content); file_put_contents($file_name, $content); } return $content; }Copy the code

The wrapped curl tool

public static function curl($url , $configs = array()) { $b = microtime(true); $new_ch = curl_init(); self::_setopt($new_ch , $url , $configs); $result = curl_exec($new_ch); $e = microtime(true); if (curl_errno($new_ch)) { Logger::log('CULR_BAD\t' . "curl_errno:" . curl_errno($new_ch) . "\t" . curl_error($new_ch) .  "\t" . ($e - $b) . "\t" . $url); } curl_close($new_ch); return $result; }Copy the code

Sample code for parsing a DOM tree

Above the utility class code we get the target page data

Get the string from the str_get_html function

Get the value the same way simple_html_DOM gets the element

Such as:

<? $ret = $HTML ->find('#container'); $ret = $HTML ->find('.foo'); $ret = $HTML ->find('.foo'); $ret = $HTML ->find('a, img'); / / can also use $ret = $HTML - > find (' a [title], img [title] '); ? >Copy the code

More value methods can be viewed in the official documentation, very simple and easy to use, here is no longer described here.

My sample code

Here is my sample code, sorting out some of the issues to be aware of when crawling data:

  1. If capturing images is involved, it’s best to upload them to your own cloud storage

  2. Reasonable control of crawl frequency (packaged in tool classes)

  3. Proper use of agents (packaged in utility classes)

<? php include '.. /include/Config.php'; include '.. /include/Db.php'; include '.. /include/Logger.php'; include '.. /include/Request.php'; include '.. /include/simple_html_dom.php'; include '.. /include/Utils.php'; $db_aliExpress = new Db($db_xxxx); parseContent($db_xxxx); function parseContent($spider) { for($p=9; $p<50; $p++){ $url = 'https://www.xxxxx.com/store/5435064/search/'.$p.'.html?SortType=bestmatch_sort'; var_dump($url); $m_content = Utils::curlProxy($url, false); $detail_html = str_get_html($m_content); if ($detail_html) { $content = $detail_html->find('.m-o-large-all-detail .ui-box .ui-box-body ul li'); if ($content) { $counts = count($content); for ($i = 0; $i < $counts; $i++) { $name = trim($detail_html->find('.m-o-large-all-detail .ui-box .ui-box-body ul li .detail h3 a', $i)->plaintext); $href = 'HTTPS :'. Trim ($detail_html->find('.m-o-large-all-detail. ui-box.ui-box-body ul li.detail h3 a', $i)->href); //url preg_match_all('/[1-9]\d*/', $href, $matches); $source_id = $matches[0][0]; $price = trim($detail_html->find('.m-o-large-all-detail .ui-box .ui-box-body ul li .cost b', $i)->plaintext); / / price $data = array (' name '= > $name,' source_url $href = > 'source_id' = > $source_id, 'price' = > $price, 'created_at'=>date('Y-m-d H:i:s'), 'date'=>date('m-d'), ); $id = $spider->insert('products', $data); Var_dump (" products_". $id); If ($id){catchThumbs($id,$href,$spider); Function catchThumbs($id,$detail_URL,$spider) {$m_content = Utils::curlProxy($detail_url, false); $before = strpos($m_content, 'imagePathList'); $after = strpos($m_content,'ImageModule',1); $count = $after-$before-24; $thumbs = substr($m_content,$before+15,$count); $data = array( 'thumbs'=>$thumbs, 'updated_at'=>date('Y-m-d H:i:s'), ); $res = $spider->update('products', $data, 'id = '.$id); $res." id=".$id); If ($res){// Download image downloadThumbs($id,$thumbs,$spider); Function downloadThumbs($id,$thumbs,$spider) {$Ymd = date('md'); $thumbs = json_decode($thumbs,true); foreach ($thumbs as $key=> $thumb){ $image_name = $Ymd.'-'.$id.'-'.($key+1).'.jpg'; $image_name = './pics/'.$image_name; saveImage($thumb,$image_name); }} /** * download the image from the Internet to save to the server * @param $path image url * @param $image_name save to the server path './public/upload/users_avatar/'.time() */ function saveImage($path, $image_name) { $ch = curl_init ($path); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_BINARYTRANSFER,1); $img = curl_exec ($ch); curl_close ($ch); $fp = fopen($image_name,'w'); $image_name = fopen($image_name,'w'); Fwrite ($fp, $img); fwrite($fp, $img); fclose($fp); }Copy the code

Summary and attention

That’s how PHP crawls data

Note: It is illegal to crawl data without authorization!

This article only provides PHP implementation ideas for crawling data, the network world is not outlaw, use crawling technology with caution

Do you have any ideas in the comments section

Hardcore articles recommended

PHP to Go mid 2021 summary

How do I receive an interface error at the first time? No need to test the girl to question whether you are not dead interface.

Git use actual combat: collaborative development of many people, emergency repair online bug Git operation guide.

Performance tuning reflection: Do not manipulate DB in a for loop

Performance tuning reflection: Do not operate on DB advanced versions in for loops

The last

๐Ÿ‘๐Ÿป : feel the harvest please point a praise to encourage!

๐ŸŒŸ : Collect articles, easy to look back!

๐Ÿ’ฌ : Comment exchange, mutual progress!