Acquisition target
Wechat article page title, content, release time, author and other information.
Sample URL
https://mp.weixin.qq.com/s?src=11×tamp=1523173327&ver=803&signature=6PCxJ*3ojH2ZM8pm56Lquward0mQMwSkPnqCvYlrDkQmL2k AEjGcFJMj2lzvpHyuyT30lczb2Ld0npUWmp*2Gj7bPJY3SCWrpRKlXJA0p4eQWPpAzMPJVmxPcRV5TtLS&new=1Copy the code
Collect legend of content area
Analyze the content selector
Use browser developer tools to analyze the region selectors to be collected, which will not be explained in detail here. If you do not understand, please learn about jQuery selectors and CSS selectors by yourself, as shown in the figure below:
Analysis results:
- The title selector is:
.rich_media_title
- The release time selector is:
#post-date
- Author selector is:
#meta_content>.rich_media_meta:eq(2)
- The content selector is:
.rich_media_content
There is no unique way to write a selector, you can write whatever you want as long as you can select the content selector.
code
After the selector is analyzed, the code is easy to implement.
Install QueryList
composer require jaeger/querylist
Copy the code
Wechat collection code
require 'vendor/autoload.php';
use QL\QueryList;
$url = 'https://mp.weixin.qq.com/s?src=11×tamp=1523173327&ver=803&signature=6PCxJ*3ojH2ZM8pm56Lquward0mQMwSkPnqCvYlrDkQmL2 kAEjGcFJMj2lzvpHyuyT30lczb2Ld0npUWmp*2Gj7bPJY3SCWrpRKlXJA0p4eQWPpAzMPJVmxPcRV5TtLS&new=1';
// Collection rule
$rules = [
'title'= > ['.rich_media_title'.'text'].'date'= > ['#post-date'.'text'].'author'= > ['#meta_content>.rich_media_meta:eq(2)'.'text'].'content'= > ['.rich_media_content'.'html']]. $data = QueryList::get($url)->rules($rules)->query()->getData(); print_r($data->all());Copy the code
You can easily write the collection code to see the results:
Array ([0] = > Array ([title] = > a e ´ ¸ æ æ a æ « a ° c æ ª c e (including · I fell c creates æ æ ® a. squared e ¢« a a problem § a ª a 1/2 level c ® e 1/2 level ° a ¸ ¨ a problem a ª e ¯ ´ [date] = > 2018-04-08 [author] = > a e ¯ [content] = > < section class ="xmteditor" style="display:none;" data-tools="æ ° a ª a 1/2 level c ® ¡ A ® ¶" data-label="powered by xmt.cn"></section><p style="white-space: normal;"><img class="" data-ratio="0.134375" data-s="300640" data-src="https://mmbiz.qpic.cn/mmbiz_png/IqicDAGdXNibs5wrrbmbVJW8HZB9Qv5ajtuR4C4kIQI43GjtM0ZDsDWzFSCZ7UcthQ1bbPqBSENxEdvRyzzaBav g/640?wx_fmt=png" data-type="png" data-w="640" style="color: rgb(62, 62, 62); text-align: justify; The line - height: 28.4444 px; background-color: rgb(255, 255, 255); box-sizing: border-box ! important; word-wrap: break-word ! important; visibility: visible ! important; width: auto ! important;" width="auto"></p>
<p style="white-space: normal;"><br></p>
<p style="white-space: normal;"><span style="font-size: 15px;"> e ª c creates æ æ ® æ e, including a e ´ ¸ æ æ a a » selections æ selections I fell a ¸ æ a © a a ° + æ such a ¨ a ¨ c a ¸ a DHS c a problem ´ a ¸ a < / span > < / p > < p style ="white-space: normal;"><br></p>
<p style="white-space: normal;"><span style="font-size: 15px;"> a ¸ c delighted many customers and a ¸ because a 1/2 level such a ¸ c a delighted many customers and ¨ c a a e § æ ¨ ¡ E ¶ a e ¶ a problem § I fell e ¡ A ¸ e e ¡ I fell a ¸ c c » æ (including c a ¸ ¡ c ® ® a æ § a ¸ æ selections ¿ a Plus or minus a ¢ a < / span > < / p > < p style ="white-space: normal;"><br></p>
......
)
Copy the code
The contents were collected correctly as expected, but the contents were garbled.
To solve the code
QueryList’s built-in solutions are encoding() and removeHead(), but the current scenarios do not work, so I use another method to solve the wechat garbled problem, the modified code is as follows:
require 'vendor/autoload.php';
use Jaeger\GHttp;
use QL\QueryList;
$url = 'https://mp.weixin.qq.com/s?src=11×tamp=1523173327&ver=803&signature=6PCxJ*3ojH2ZM8pm56Lquward0mQMwSkPnqCvYlrDkQmL2 kAEjGcFJMj2lzvpHyuyT30lczb2Ld0npUWmp*2Gj7bPJY3SCWrpRKlXJA0p4eQWPpAzMPJVmxPcRV5TtLS&new=1';
// Collection rule
$rules = [
'title'= > ['.rich_media_title'.'text'].'date'= > ['#post-date'.'text'].'author'= > ['#meta_content>.rich_media_meta:eq(2)'.'text'].'content'= > ['.rich_media_content'.'html']]. $html = GHttp::get($url);// Matches the contents of the body directly
preg_match('/<body[^>]+>(.+)\s+<\/body>/s',$html,$arr);
$html = $arr[0];
$data = QueryList::html($html)->rules($rules)->query()->getData();
print_r($data->all());
Copy the code
Running results:
Array ([0] => Array ([title] => "trade war", U.S. President Donald Trump has been criticized by the media for his role in the election"xmteditor" style="display:none;" data-tools="New Media Steward" data-label="powered by xmt.cn"></section><p style="white-space: normal;"><img class="" data-ratio="0.134375" data-s="300640" data-src="https://mmbiz.qpic.cn/mmbiz_png/IqicDAGdXNibs5wrrbmbVJW8HZB9Qv5ajtuR4C4kIQI43GjtM0ZDsDWzFSCZ7UcthQ1bbPqBSENxEdvRyzzaBav g/640?wx_fmt=png" data-type="png" data-w="640" style="color: rgb(62, 62, 62); text-align: justify; The line - height: 28.4444 px; background-color: rgb(255, 255, 255); box-sizing: border-box ! important; word-wrap: break-word ! important; visibility: visible ! important; width: auto ! important;" width="auto"></p>
<p style="white-space: normal;"><br></p>
<p style="white-space: normal;"><span style="font-size: 15px;""> < p style =" max-width: 100%; clear: both; min-height: 1em; </span></p> <p style="white-space: normal;"><br></p>
<p style="white-space: normal;"><span style="font-size: 15px;""> < p style =" max-width: 100%; clear: both; </span></p> <p style="white-space: normal;"><br></p>
....
)
Copy the code
Original: http://study.querylist.cc/archives/13/