Acquisition target

Wechat article page title, content, release time, author and other information.

Sample URL

https://mp.weixin.qq.com/s?src=11&timestamp=1523173327&ver=803&signature=6PCxJ*3ojH2ZM8pm56Lquward0mQMwSkPnqCvYlrDkQmL2k AEjGcFJMj2lzvpHyuyT30lczb2Ld0npUWmp*2Gj7bPJY3SCWrpRKlXJA0p4eQWPpAzMPJVmxPcRV5TtLS&new=1Copy the code

Collect legend of content area

Analyze the content selector

Use browser developer tools to analyze the region selectors to be collected, which will not be explained in detail here. If you do not understand, please learn about jQuery selectors and CSS selectors by yourself, as shown in the figure below:

Analysis results:

  • The title selector is:.rich_media_title
  • The release time selector is:#post-date
  • Author selector is:#meta_content>.rich_media_meta:eq(2)
  • The content selector is:.rich_media_content

There is no unique way to write a selector, you can write whatever you want as long as you can select the content selector.

code

After the selector is analyzed, the code is easy to implement.

Install QueryList

composer require jaeger/querylist
Copy the code

Wechat collection code


      

require 'vendor/autoload.php';

use QL\QueryList;

$url = 'https://mp.weixin.qq.com/s?src=11&timestamp=1523173327&ver=803&signature=6PCxJ*3ojH2ZM8pm56Lquward0mQMwSkPnqCvYlrDkQmL2 kAEjGcFJMj2lzvpHyuyT30lczb2Ld0npUWmp*2Gj7bPJY3SCWrpRKlXJA0p4eQWPpAzMPJVmxPcRV5TtLS&new=1';

// Collection rule
$rules = [
	'title'= > ['.rich_media_title'.'text'].'date'= > ['#post-date'.'text'].'author'= > ['#meta_content>.rich_media_meta:eq(2)'.'text'].'content'= > ['.rich_media_content'.'html']]. $data = QueryList::get($url)->rules($rules)->query()->getData(); print_r($data->all());Copy the code

You can easily write the collection code to see the results:

Array ([0] = > Array ([title] = > a € œ e ´ ¸ æ ˜ “ æ ˆ ˜ a €  æ ˆ ˜  « a ° c š æ œ ª c ‡ ƒ e (including · I fell Œ c ‰ creates æ œ — æ ™ ® a. squared e ¢« a  „ a problem § a ª ’ a 1/2 level “ c ‚ ® e 1/2 level ° a ¸ ¨ a problem – a ª ’ e ¯ ´ [date] = > 2018-04-08 [author] = > a  Œ e ¯  › [content] = > < section class ="xmteditor" style="display:none;" data-tools="æ – ° a ª ’ a 1/2 level “ c ® ¡ A ® ¶" data-label="powered by xmt.cn"></section><p style="white-space: normal;"><img class="" data-ratio="0.134375" data-s="300640" data-src="https://mmbiz.qpic.cn/mmbiz_png/IqicDAGdXNibs5wrrbmbVJW8HZB9Qv5ajtuR4C4kIQI43GjtM0ZDsDWzFSCZ7UcthQ1bbPqBSENxEdvRyzzaBav g/640?wx_fmt=png" data-type="png" data-w="640" style="color: rgb(62, 62, 62); text-align: justify; The line - height: 28.4444 px; background-color: rgb(255, 255, 255); box-sizing: border-box ! important; word-wrap: break-word ! important; visibility: visible ! important; width: auto ! important;" width="auto"></p>
<p style="white-space: normal;"><br></p>
<p style="white-space: normal;"><span style="font-size: 15px;"> ‡ e ª c ‰ creates æ œ — æ ™ ® æ Œ ‘ e, including a € œ e ´ ¸ æ ˜ “ æ ˆ ˜ a €  a » selections æ  selections I fell Œ a ¸ € æ Š Š a ˆ © a ‰ ‘ a ° + æ ‚ such a œ ¨ a ¨ c  ƒ a ¸ ‚ a œ DHS c š „ a problem ´ a ¸ Š a € ‚ < / span > < / p > < p style ="white-space: normal;"><br></p>
<p style="white-space: normal;"><span style="font-size: 15px;"> a ¸ c delighted many customers and Ž a ¸ because a › 1/2 level such a ¸ ƒ c š „ a delighted many customers and  ¨ c Ž a • † a “  e § „ æ ¨ ¡ E ¶ Š a  ˜ e ¶ Š a problem § I fell Œ e ‚ ¡ A ¸ ‚ e œ ‡ e  ¡ I fell Œ a ¸ – c • Œ c »  æ (including Ž c š „ a ¸  ¡ c ® ® a š æ € § a ¸ Ž æ — selections ¿ a Plus or minus a ¢ž a € ‚ < / span > < / p > < p style ="white-space: normal;"><br></p>
......
)
Copy the code

The contents were collected correctly as expected, but the contents were garbled.

To solve the code

QueryList’s built-in solutions are encoding() and removeHead(), but the current scenarios do not work, so I use another method to solve the wechat garbled problem, the modified code is as follows:


      

require 'vendor/autoload.php';

use Jaeger\GHttp;
use QL\QueryList;

$url = 'https://mp.weixin.qq.com/s?src=11&timestamp=1523173327&ver=803&signature=6PCxJ*3ojH2ZM8pm56Lquward0mQMwSkPnqCvYlrDkQmL2 kAEjGcFJMj2lzvpHyuyT30lczb2Ld0npUWmp*2Gj7bPJY3SCWrpRKlXJA0p4eQWPpAzMPJVmxPcRV5TtLS&new=1';

// Collection rule
$rules = [
	'title'= > ['.rich_media_title'.'text'].'date'= > ['#post-date'.'text'].'author'= > ['#meta_content>.rich_media_meta:eq(2)'.'text'].'content'= > ['.rich_media_content'.'html']]. $html = GHttp::get($url);// Matches the contents of the body directly
preg_match('/<body[^>]+>(.+)\s+<\/body>/s',$html,$arr);
$html = $arr[0];

$data = QueryList::html($html)->rules($rules)->query()->getData();

print_r($data->all());
Copy the code

Running results:

Array ([0] => Array ([title] => "trade war", U.S. President Donald Trump has been criticized by the media for his role in the election"xmteditor" style="display:none;" data-tools="New Media Steward" data-label="powered by xmt.cn"></section><p style="white-space: normal;"><img class="" data-ratio="0.134375" data-s="300640" data-src="https://mmbiz.qpic.cn/mmbiz_png/IqicDAGdXNibs5wrrbmbVJW8HZB9Qv5ajtuR4C4kIQI43GjtM0ZDsDWzFSCZ7UcthQ1bbPqBSENxEdvRyzzaBav g/640?wx_fmt=png" data-type="png" data-w="640" style="color: rgb(62, 62, 62); text-align: justify; The line - height: 28.4444 px; background-color: rgb(255, 255, 255); box-sizing: border-box ! important; word-wrap: break-word ! important; visibility: visible ! important; width: auto ! important;" width="auto"></p>
<p style="white-space: normal;"><br></p>
<p style="white-space: normal;"><span style="font-size: 15px;""> < p style =" max-width: 100%; clear: both; min-height: 1em; </span></p> <p style="white-space: normal;"><br></p>
<p style="white-space: normal;"><span style="font-size: 15px;""> < p style =" max-width: 100%; clear: both; </span></p> <p style="white-space: normal;"><br></p>
....
)
Copy the code

Original: http://study.querylist.cc/archives/13/