“This is the 26th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”
PHP, in addition to server-side development, can also be used to crawl data just like Python.
I think there are two ways to crawl data:
The first is the request data interface, which parses JSON data
The other one is to get the DOM tree of the web page, parse the DOM tree, and get the data.
Simple_html_dom, a PHP plugin, is used to parse json.
Parsing json
Parsing JSON is well understood and can be supported by any development language that supports request networking
The idea is to get the JSON data returned by the crawl link (or any other type of data, json is more common).
After we get the JSON data, we parse it and process the data according to our own needs.
This is more general, so I won’t give you an example.
Parsing the dom tree
Focus on how PHP crawls the target site and parses the DOM tree.
Here’s an example:
Utility class code:
Caching is done. As long as the url requested by my scene is the same, the returned data is unchanged, so CACHING is done
The requested URL will not be repeated, but will be taken from the cache
The purpose of sleep() is to prevent network problems for the site that captures the data and control the frequency of requests
public static function crawlContent($url, $encode = true) { $file_name = '.. /cache/' . md5($url); if (! file_exists($file_name)) { @touch($file_name); } $content = file_get_contents($file_name); if (empty($content)) { $content = Request::curl($url); If (empty ($content)) {sleep (rand (3, 10)); $content = Request::curl($url); } $encode && $content = iconv("GBK", "UTF-8//IGNORE", $content); file_put_contents($file_name, $content); } return $content; }Copy the code
The wrapped curl tool
public static function curl($url , $configs = array()) { $b = microtime(true); $new_ch = curl_init(); self::_setopt($new_ch , $url , $configs); $result = curl_exec($new_ch); $e = microtime(true); if (curl_errno($new_ch)) { Logger::log('CULR_BAD\t' . "curl_errno:" . curl_errno($new_ch) . "\t" . curl_error($new_ch) . "\t" . ($e - $b) . "\t" . $url); } curl_close($new_ch); return $result; }Copy the code
Sample code for parsing a DOM tree
Above the utility class code we get the target page data
Get the string from the str_get_html function
Get the value the same way simple_html_DOM gets the element
Such as:
<? $ret = $HTML ->find('#container'); $ret = $HTML ->find('.foo'); $ret = $HTML ->find('.foo'); $ret = $HTML ->find('a, img'); / / can also use $ret = $HTML - > find (' a [title], img [title] '); ? >Copy the code
More value methods can be viewed in the official documentation, very simple and easy to use, here is no longer described here.
My sample code
Here is my sample code, sorting out some of the issues to be aware of when crawling data:
-
If capturing images is involved, it’s best to upload them to your own cloud storage
-
Reasonable control of crawl frequency (packaged in tool classes)
-
Proper use of agents (packaged in utility classes)
<? php include '.. /include/Config.php'; include '.. /include/Db.php'; include '.. /include/Logger.php'; include '.. /include/Request.php'; include '.. /include/simple_html_dom.php'; include '.. /include/Utils.php'; $db_aliExpress = new Db($db_xxxx); parseContent($db_xxxx); function parseContent($spider) { for($p=9; $p<50; $p++){ $url = 'https://www.xxxxx.com/store/5435064/search/'.$p.'.html?SortType=bestmatch_sort'; var_dump($url); $m_content = Utils::curlProxy($url, false); $detail_html = str_get_html($m_content); if ($detail_html) { $content = $detail_html->find('.m-o-large-all-detail .ui-box .ui-box-body ul li'); if ($content) { $counts = count($content); for ($i = 0; $i < $counts; $i++) { $name = trim($detail_html->find('.m-o-large-all-detail .ui-box .ui-box-body ul li .detail h3 a', $i)->plaintext); $href = 'HTTPS :'. Trim ($detail_html->find('.m-o-large-all-detail. ui-box.ui-box-body ul li.detail h3 a', $i)->href); //url preg_match_all('/[1-9]\d*/', $href, $matches); $source_id = $matches[0][0]; $price = trim($detail_html->find('.m-o-large-all-detail .ui-box .ui-box-body ul li .cost b', $i)->plaintext); / / price $data = array (' name '= > $name,' source_url $href = > 'source_id' = > $source_id, 'price' = > $price, 'created_at'=>date('Y-m-d H:i:s'), 'date'=>date('m-d'), ); $id = $spider->insert('products', $data); Var_dump (" products_". $id); If ($id){catchThumbs($id,$href,$spider); Function catchThumbs($id,$detail_URL,$spider) {$m_content = Utils::curlProxy($detail_url, false); $before = strpos($m_content, 'imagePathList'); $after = strpos($m_content,'ImageModule',1); $count = $after-$before-24; $thumbs = substr($m_content,$before+15,$count); $data = array( 'thumbs'=>$thumbs, 'updated_at'=>date('Y-m-d H:i:s'), ); $res = $spider->update('products', $data, 'id = '.$id); $res." id=".$id); If ($res){// Download image downloadThumbs($id,$thumbs,$spider); Function downloadThumbs($id,$thumbs,$spider) {$Ymd = date('md'); $thumbs = json_decode($thumbs,true); foreach ($thumbs as $key=> $thumb){ $image_name = $Ymd.'-'.$id.'-'.($key+1).'.jpg'; $image_name = './pics/'.$image_name; saveImage($thumb,$image_name); }} /** * download the image from the Internet to save to the server * @param $path image url * @param $image_name save to the server path './public/upload/users_avatar/'.time() */ function saveImage($path, $image_name) { $ch = curl_init ($path); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_BINARYTRANSFER,1); $img = curl_exec ($ch); curl_close ($ch); $fp = fopen($image_name,'w'); $image_name = fopen($image_name,'w'); Fwrite ($fp, $img); fwrite($fp, $img); fclose($fp); }Copy the code
Summary and attention
That’s how PHP crawls data
Note: It is illegal to crawl data without authorization!
This article only provides PHP implementation ideas for crawling data, the network world is not outlaw, use crawling technology with caution
Do you have any ideas in the comments section
Hardcore articles recommended
PHP to Go mid 2021 summary
How do I receive an interface error at the first time? No need to test the girl to question whether you are not dead interface.
Git use actual combat: collaborative development of many people, emergency repair online bug Git operation guide.
Performance tuning reflection: Do not manipulate DB in a for loop
Performance tuning reflection: Do not operate on DB advanced versions in for loops
The last
๐๐ป : feel the harvest please point a praise to encourage!
๐ : Collect articles, easy to look back!
๐ฌ : Comment exchange, mutual progress!