PHP crawler writing

What is PHP?

PHP (Hypertext Preprocessor) is a common open source scripting language. The syntax absorbs the features of C language, Java and Perl, which is easy to learn and widely used. It is mainly applicable to the field of Web development. PHP’s unique syntax is a mix of C, Java, Perl, and PHP’s own syntax. It can execute dynamic web pages faster than CGI or Perl. Dynamic pages made in PHP compared to other programming languages, PHP embedded programs in HTML (standard common Markup Language) documents to execute, the execution efficiency is much higher than the full generation of HTML markup CGI; PHP can also execute compiled code, which encrypts and optimizes code execution to make it run faster. — Baidu Baike’s description.

What is the use of reptiles?

What is the use of a reptile? First say what crawler is, I think crawler is a network information collection program, maybe I have a mistake in my understanding, please also give me correct. Since crawler is a network information collection program, it is used to collect information, and the information collected is on the network. If it’s still not clear what crawlers are for, I’ll give a few examples of crawler applications: search engines need crawlers to collect web information for people to search; Big data data, where does the data come from? That is, crawlers can crawl (collect) from the web.

When I hear crawlers, I usually think of Python, but why do I use PHP instead of Python?

Python Let me be honest, I don’t know Python. (I really don’t know Python. Maybe you should go to Baidu, because I really don’t know Python.)
I’ve always thought that when you write in PHP, you just come up with an algorithm and you’re done, you don’t have to worry too much about data types.
PHP’s syntax is similar to that of other programming languages, so even if you don’t know PHP at first, you can get started right away.
PHP’s syntax is similar to that of other programming languages, so even if you don’t know PHP at first, you can get started right away. Is wrong.
I am actually a beginner in PHP and want to improve my level by writing something. (There may be some code below that you feel is not standard, welcome to correct, thank you.)

PHP crawler first step

PHP crawler step 1, step 1…… The first step, of course, is to set up a PHP runtime environment. How can PHP run without an environment? Just like fish can’t leave water. (I haven’t seen enough of it, so forgive me if my example of fish isn’t good enough.) On Windows I use WAMP and on Linux I use LNMP or LAMP.

WAMP: Windows + Apache + Mysql + PHP

LAMP: Linux + Apache + Mysql + PHP

LNMP: Linux + Nginx + Mysql + PHP

Apache and Nginx are Web server software.

Apache or Nginx, Mysql, and PHP are all basic configuration environments for PHP Web. There are installation packages for the PHP Web environment on the Web. These packages are very easy to use and do not require everything to be installed and configured. But if you’re worried about security issues with these integration packages, you can download them from their websites and find configuration tutorials online. (To be honest, I really don’t do it alone. I find it a hassle.)

PHP crawler step 2

(Feel so much nonsense, should come a code immediately!!)

<? PHP / / crawler core functions: access to web page source $HTML = file_get_contents (" https://www.baidu.com/index.html "); Echo $HTML; echo $HTML; echo $HTML; // Print $HTML? >Copy the code

The core function of crawler network has been written. Why do you write the core function of crawler with just a few lines of code? I guess someone has already understood it, in fact, because crawler is a data acquisition program, just a few lines of code above can actually get data, so the core function of crawler has been written out. Some people may say, “This is too bad! What’s the use?” Although I am very dish, but please don’t say it, let me put on a good X. (Two more lines of crap, sorry.)

In fact, what a crawler is used for depends on what you want it to do. Just like I wrote a search engine website for fun some days ago, of course, the website is very dish, the results are not regular, many can not be checked. My search engine crawler is to write a suitable search engine crawler. So for the sake of convenience I will also write search engine crawler for the target. Of course, my search engine crawler is still not perfect, not perfect place is to you to create, to improve.

Six, search engine crawler restrictions

Search engine crawlers sometimes can not get the page source of the website page, but there is a robot. TXT file, there is the file site, on behalf of the site master do not want crawlers to crawl page source. (But if you just want to get it, you’ll crawl it anyway!)

In fact, my search engine crawler has a lot of limitations caused by deficiencies, for example, it may not be able to get page source code because it cannot run JS scripts. Or the site has an anti-crawler mechanism that can not get the page source code. A website with anti-crawler mechanism is like Zhihu, and Zhihu is a website with anti-crawler mechanism.

Seven, take the search engine crawler as an example, prepare to write what the crawler needs

PHP Writing Basics
Regular expressions (you can also use Xpath, sorry, I can’t use it)
Database usage (MySql database is used in this article)
Runtime environment (as long as you have an environment and database that can run your PHP site)

Eight, the search engine to obtain the page source and obtain the page title information

<? PHP / / by file_get_contents function for baidu page source $HTML = file_get_contents (" https://www.baidu.com/index.html "); / / by preg_replace function make the page source code by the multi-line single $htmlOneLine = preg_replace (" / | | \ r \ n \ t/", "", $HTML); Preg_match ("/<title>(.*)<\/title>/iU",$htmlOne,$titleArr); $title = $titleArr[1]; Echo $title; ? >Copy the code

Example of error reporting:

Warning: file_get_contents (” https://https://127.0.0.1/index.php “) [function. The file – get – contents] : failed to open stream: Invalid argument in E:\website\blog\test.php on line 25

HTTPS is the SSL encryption protocol. If the error occurs, it means that your PHP may be missing OpenSSL modules. You can find a solution online.

Nine, search engine crawler characteristics

Although I have not seen like “Baidu”, “Google” their crawler, but I guess by myself, as well as in the actual climb to the process of some problems encountered, their summary of several characteristics. (There may be wrong places, or lack of places, welcome to correct, thank you.)

generality

Universality is because I think the crawler of search engine is not aimed at any website at the beginning, so it is required to crawl as many websites as possible, which is the first point. And the second point, is to get the information of the web page is those, at the beginning will not be because of some individual special small sites and give up some information is not extracted, for example: If there is no description or keyword in the meta tag of a page on a small website, the extraction of description or keyword information will be directly abandoned. Of course, if there is no such information on a page, I will extract the text content of the page as filling. Anyway is as much as possible to crawl the page information each page information items should be the same. This is what I think of as the generality of search engine crawlers, although I could be wrong. (I may not speak it very well, I’m always learning.)

uncertainty

Uncertainty is my what web crawler for me is not enough comprehensive control and control only I can think of, it is also because the algorithm is the reason why this, I wrote my algorithm is crawl to get all the links in the page, and then to climb to get to these links, because actually search engine is not found some things, but as much as possible, Because only with more information can we find the most appropriate answer to the user’s needs. So I think search engine crawlers have to have uncertainty. (I read it again, also feel a little said a little let yourself understand, please forgive me, welcome to correct, ask questions, thank you!)

The following video is the use of my search site, and the information was obtained through the PHP crawler I wrote. (I no longer maintain this website, so I apologize for any deficiencies.)

Ten, up to now may appear the problem

Garbled code appears in the obtained source code

<? PHP // garble $HTML = MB_convert_encoding ($HTML,'UTF-8','UTF-8,GBK,GB2312,BIG5'); // There is another kind of garbled code because of gzip, which I will talk about later. >Copy the code

2. The title information cannot be obtained

<? // If you can get the page source code, but you still can't get the page title information // I suppose the problem is: Because I teach is to use regular expressions to obtain, source without becoming a line, get up and can be a problem at $htmlOneLine = preg_replace (" / | | \ r \ n \ t/", "", $HTML); ? >Copy the code

3. The page source code cannot be obtained

<? $opts = array(' HTTP '=>array('method'=>"GET") $opts = array(' HTTP '=>array('method'=>"GET") "timeout"=>20, 'header'=>"User-Agent: Spider \r\n", ) ); $context = stream_context_create($opts); $HTML = file_get_contents ($domain, 0, $context, 0150, 000); // Can I get the page of Sina Weibo? >Copy the code

Eleven, get a web page processing ideas

Let’s not think about a lot of web pages, because a lot of web pages are a loop.

Get the page source
What information about a page is extracted from the source code
What happens to the extracted information
Can not put into the database after processing

Twelve, according to the eleven ideas of the code

<? $HTML = file_get_content("https://www.taobao.com"); / / 2 / / three, extract, extract the title and text information processing / / processing page source, multi-line get single $htmlOneLine = preg_replace (" / | | \ r \ n \ t/", "", $HTML); / / title information preg_match (" / < title > (. *) < \ / title > / iU ", $htmlOneLine, $titleArr); $titleOK = $titleArr[1]; $htmlText = preg_replace("/< HTML >(.*)<\/head>/","",$htmlOneLine); $htmlText = preg_replace("/< HTML >(.*)<\/head>/",",$htmlOneLine); / / handle style and script tags and content $htmlText = preg_replace ("/" style (. *) > (. *) < / style > | < script (. *) > (. *) < / script > / iU ", "", $htmlText); $htmlText = preg_replace("/<(\/)? (.+)>/","",$htmlText); // Save to database >Copy the code

13, PHP save page picture idea

Get the page source
Gets an image link for the page
Use functions to save images

Fourteen, save the picture sample code

<? PHP / / use the file_get_contents () function to obtain image $img = file_get_contents (" http://127.0.0.1/photo.jpg "); // use file_put_contents() to save file_put_contents("photo.jpg",$img); ? >Copy the code

To be continued…

Related Posts

JVM strong weak virtual reference records

When deploying idea+ Tomcat, pay attention to ！！！！！

MySQL series (2) – InnoDB data storage structure