How does a reptile live in a house
Attention! The work done in this tutorial is intended only to support geographic information data analysis.
Crawlers should be a required course if you need to analyze geographic data. We can pull data from different websites and analyze it. Crawler in fact, any language can be implemented, as long as you can request the network, can parse the page so you can crawler. We will learn step by step how to crawler and how to use data for geographic information analysis. Here we go!
1 Crawler foundation
1.1 HTML
HTML is the basis of web pages, we excerpted from the basic introduction of W3School. 1) HTML is not a programming language, but a markup language. 2) HTML tags are usually paired with _, such as and. The first tag is _ start tag _, and the second tag is _ end tag _.
After the role of each tag will be encountered in the time to explain, interested in writing their own page to try, help to understand HTML. www.w3school.com.cn/
<html><! -- Root tag -->
<head><! -- Header tag, save meta information, etc. -->
</head>
<body><! -- Body tag, save page content js, etc. -->
<h1>My first headline</h1><! -- H1 tag, save title -->
<p>My first paragraph.</p><! -- p tag, save text -->
</body>
</html>
Copy the code
The above page is a simple HTML example, explaining the basic structure and a few tags.
1.2 CSS selectors
CSS selector is simply used to select node elements. It is mainly used to select the corresponding node to intercept data in crawler.
The selector | example | Case description |
---|---|---|
.class | .intro | Select all elements of class=”intro”. |
#id | #firstname | Select all elements whose ID =” firstName “. |
element | p | Select all the
Elements. |
element_element | div,p | Select all the
Elements and all
Elements. |
element_element | div p | choose
All inside the element
Elements. |
element_element | div>p | Select the parent element as
Ownership of elements
Elements. |
element_element | div+p | The choice is immediately following
Everything after the element
Elements. |
[attribute] | [target] | Select all elements with the target attribute. |
[attribute_value] | [target=_blank] | Select all elements for target=”_blank”. |
[attribute_value] | [title~=flower] | Select the title property to contain all elements of the word “flower”. |
[attribute_value] | [lang|=en] | Select all elements whose lang attribute value begins with “en”. |
[attribute_value] | a[src^=”https”] | Select each whose SRC attribute value begins with “HTTPS”Elements. |
[attribute_value] | a[src$=”.pdf”] | Select all whose SRC attribute ends in “.pdf”Elements. |
[attribute_value] | a[src*=”abc”] | Select each of those whose SRC attribute contains the “ABC” substringElements. |
:nth-child(n) | p:nth-child(2) | Selects each of the second child elements belonging to its parent element
Elements. |
:nth-last-child(n) | p:nth-last-child(2) | Same as above, counting from the last child. |
:nth-of-type(n) | p:nth-of-type(2) | Select the second element belonging to its parent
Each of the elements Elements. |
:nth-last-of-type(n) | p:nth-last-of-type(2) | Same as above, but counting from the last child. |
The above is a sample of several major CSS selectors from W3School, followed by practical examples.
2 tools
A crawler must have a tool, and the tool must be simple and quick to use.
2.1 the node. Js
Node.js is a javascript backend language, why use it? 2. Js has many convenient tool libraries for parsing. It is flexible to write without the crawler framework, and it is also helpful to learn the development technology related to web pages. The installation details are nodejs.org/en/
Test the
2.2 NodeJS Crawler library Cheerio
www.npmjs.com/package/che… This library is an HTML parsing library that parses the pages it crawls and extracts valid information using jquery-like syntax.
Three processes
In fact, the process of reptiles fat bowel simple! 1 download page 2 analysis page 3 storage analysis of the information of all crawlers are actually developed from these three !!!!!! You can think about it yourself!!
4 prepare
Wh.zu.anjuke.com/ ** This is the url we want to climb wuhan Anjukeke, because I want to rent a house recently… . ** 4.1 Viewing the page
The F12 key in the browser opens the debug window for debugging (firefox or Chrome is recommended).
4.2 Analysis Page
<div class="zu-itemmod" link="https://wh.zu.anjuke.com/fangyuan/1314369360?shangquan_id=17888" _soj="Filter_3& hfilter=filterlist">
<a data-company="" class="img" _soj="Filter_3& hfilter=filterlist" data-sign="true" href="https://wh.zu.anjuke.com/fangyuan/1314369360?shangquan_id=17888" title="Kangzhuo New Town without intermediary sBI Chuangye Street, Optics Valley Avenue, south lighting good monthly payment" alt="Kangzhuo New Town without intermediary sBI Chuangye Street, Optics Valley Avenue, south lighting good monthly payment" target="_blank" hidefocus="true">
<img class="thumbnail" src="https://pic1.ajkimg.com/display/b5fdc13f19381f593b46baada1237197/220x164.jpg" alt="Kangzhuo New Town without intermediary sBI Chuangye Street, Optics Valley Avenue, south lighting good monthly payment" width="180" height="135">
<span class="many-icons iconfont"></span></a>
<div class="zu-info">
<h3>
<a target="_blank" title="Kangzhuo New Town without intermediary sBI Chuangye Street, Optics Valley Avenue, south lighting good monthly payment" _soj="Filter_3& hfilter=filterlist" href="https://wh.zu.anjuke.com/fangyuan/1314369360?shangquan_id=17888">Kangzhuo New City without intermediaries Optics Valley Avenue SBI entrepreneurship street south lighting good monthly payment</a></h3>
<p class="details-item tag">1 room 1 hall<span>|</span>47 square meters<span>|</span>Lower level (24 floors in total)<i class="iconfont jjr-icon"></i>Of irreverent comedy</p>
<address class="details-item">
<a target="_blank" href="https://wuhan.anjuke.com/community/view/1039494">Kang Zhuo xincheng</a> Hongshan - No. 694 Xiongchu Avenue, Optics Valley</address>
<p class="details-item bot-tag">
<span class="cls-1">The whole rent</span>
<span class="cls-2">In the southeast</span>
<span class="cls-3">Have elevator</span>
<span class="cls-4">Line 2</span>
</p>
</div>
<div class="zu-side">
<p>
<strong>1600</strong>Yuan/month</p></div>
</div>
Copy the code
Through the above picture and the HTML structure, we found that the rental information on each page was organized by a list. At this time, we need to take out a set of rental information code for analysis, as shown in the above code. We can see a house paragraph tag which contains all the relevant information of a house, this time to start looking for the relevant node, this is to fight with the page, we need to think about the problem is that we want to find those nodes how to get out, will use the magic CSS selector! Let’s look at the code captured below and analyze it in combination with the code. (Please leave a comment if you want to talk about each selector and I’ll cover it in the next tutorial.)
const { URL } = require('url');
const cheerio = require('cheerio');
function dealhtml(text){
const $ = cheerio.load(text);// Convert what is read into parsed objects
let arrs = $('.zu-itemmod');// Select all the contents of the.zu-itemmod selector node
let finalresult = [];// An array of all data
// $('.zu-itemmod') contains all the items whose selector is.zu-itemmod. Cheerio needs each to iterate
// Find the node that is returned using el in the node callback
$('.zu-itemmod').each(function (index, el) {
let result = { // The object that stores each subdata
};
let $el = $(el)// A.zu-itemmod is parsed to extract the object
let url = $el.attr("link"); // Extract the URL from the link attribute of the.zu-itemmod node
let idurl = new URL(url);// URL objects put into NODE can also parse themselves
let pathname = idurl.pathname;
//https://wh.zu.anjuke.com/fangyuan/1314369360?shangquan_id=17888
// Extract the ID from the URL. It's important to extract the ID otherwise you can't undo it
let paths = pathname.split("/");
result.uid=paths[2]; // Resolve unique ids from urls
result.url=url; // Reserve a join for more detailed parsing
// console.log(`-------------------------------- `)
// console.log(' address ${result.uid} ')
// console.log(' unique primary key ${result.url} ')
let info = $el.find('.details-item.tag').text();
//<p class="details-item tag">
// Find the section of the two classes and write the text function next to it in cheerio to get the corresponding text and remove the tag
//
let infoarr = info.replace("\n"."").trim().split("");
// We can debug to see what is parsed and then break them apart
infoarr[0]=infoarr[0].trim();
result.owner = infoarr[1]; / / owner
infoarr=infoarr[0].split("|");
result.apartment = infoarr[0];/ / family
result.totalarea = infoarr[1];// Total area
result.floortype = infoarr[2];// Floor type
// Traversal information parses the relevant information involved in the house
// console.log(' owner ${result.owner} ')
// console.log(' apartment ${result.apartment} ')
// console.log(` 全部面积 ${result.totalarea} `)
// console.log(` 楼层类型 ${result.floortype} `)
let address = $el.find("address[class='details-item']").text().trim()
/ / parse! Here is the element using the Address tag where the class attribute is details-item
// Get the text and remove the Spaces on both sides
address = address.replace("\n"."").split("");
result.address0 = address[0];
result.address1 = address[address.length2 -];
result.address2 = address[address.length- 1];
// console.log(' cell name ${result.address0} ')
// console.log(' address1 ${result.address1} ')
// console.log(' address2 ${result.address2} ')
result.rent = $el.find('.cls-1').text()// Share a flat
// parse!! Look for the node whose class selector is.cls-1 and parse out the same number of things below the text.
// console.log(' total rent ${result.rent} ')
result.orientation = $el.find('.cls-2').text()
// console.log(' orientation ${result.orientation} ')// Orientation
result.elevator = $el.find('.cls-3').text() // Is there an elevator
// console.log(' elevator ${result.elevator} ')
result.metro = $el.find('.cls-4').text()
// console.log(' metro ${result.metro} ')// Metro
let price = $el.find('.zu-side').text()
result.price=price.trim()
// console.log(' price ${result.price} ')
finalresult.push(result)
});
return finalresult;
// console.log(arrs);
}
Copy the code
!!!!!!!!! Tip!!!!!!
Primary key (!!!!) can be found when climbing data Go heavy convenient for the future
4.3 Requesting Data
Now, what we need to think about is the part of requesting data, requesting data actually, for juju this static generated page is very simple, direct request line, for the moment does not need too much fancy tricks. We use the HTTPS toolkit in Node.js nodejs.org/api/https.h… We use the get function
Nodejs.org/api/http.ht…
/** * HTTPS request header */
const options = {
headers: {"Accept" :"text/html"."Accept-Encoding" :"utf-8"."Accept-Language": "zh-CN,zh; Q = 0.8, useful - TW; Q = 0.7, useful - HK; Q = 0.5, en - US; Q = 0.3, en. Q = 0.2"."Cache-Control" :"max-age=0"."Connection": "keep-alive"."Cookie": "aQQ_ajkguid=51C5A875-EDB6-391F-B593-B898AB796AC7; 58tj_uuid=c6041809-eb8d-452c-9d70-4d62b1801031; new_uv=2; __xsptplus8 = 8.2.1556593698.1556593730.4%233%7Ccn.bing.com % 7 c % 7 c % 7 c % 7 c % 23 tcd6uk qprKqvfUShNPiQBlWlzM8BP6_ % 23-23%; als=0; ctid=22; wmda_uuid=f0aa7eb2fd61ca3678f4574d20d0ccfc; wmda_new_uuid=1; wmda_visited_projects=%3B6289197098934; sessid=FE2F0B80-FDC6-0C84-22E5-1448FB0BFA2C; lps=http%3A%2F%2Fwh.xzl.anjuke.com%2Fzu%2F%3Fkw%3D%25E4%25BF%259D%25E5%2588%25A9%25E5%259B%25BD%25E9%2599%2585%25E5%2585 %25AC%25E5%25AF%2593%26pi%3D360-cpcjp-wh-chloupan1%26kwid%3D16309186180%26utm_term%3D%25e4%25bf%259d%25e5%2588%25a9%25e5 %259b%25bd%25e9%2599%2585%25e5%2585%25ac%25e5%25af%2593%7Chttps%3A%2F%2Fcn.bing.com%2F; twe=2; ajk_member_captcha=4b7a103be89a56fd6fdf85e09f41c250; wmda_session_id_6289197098934=1556593697467-eaf2078d-1712-7da9; new_session=0; init_refer=; wmda_uuid=f4d4e8a6e26192682c1ec6d3710362b7; wmda_new_uuid=1; wmda_session_id_6289197098934=1556593697467-eaf2078d-1712-7da9; wmda_visited_projects=%3B6289197098934"."Host": "wh.zu.anjuke.com"."Upgrade-Insecure-Requests": 1."User-Agent" :"Mozilla / 5.0 (Windows NT 10.0; WOW64; The rv: 66.0) Gecko / 20100101 Firefox / 66.0"}};@param {*} url */
function getdata(url){
url = new URL(url);
console.log(url)
url.headers = options.headers;
https.get(url,(res) => {
console.log('Status code :', res.statusCode);
res.setEncoding('utf8');
let rawData = ' ';
res.on('data', (chunk) => {
rawData += chunk;
});
res.on('end', () = > {try {
let result = dealhtml(rawData)
exporttoFile(result,url.href)
} catch (e) {
console.error(e.message); }}); }).on('error', (e) => {
console.error(e);
});
}
Copy the code
!!!!!!!!! Tip!!!!!! The user-Agent in the request header when you crawler is used to identify if you’re a crawler, and we’re going to modify this parameter ourselves. The method is to copy one from any network request, as shown in the figure below.
4.4 Storing Data
function exporttoFile(obj,filename){
filename = filename.replace("https://wh.zu.anjuke.com/"."").split("/").join('_');
let data = JSON.stringify(obj);
fs.writeFileSync(`./result/${filename}.json`,data);
}
Copy the code
Of course, when we collect the data, we need to store it so that we can use it, so we convert the above results into text and store them in a JSON file with whatever name you want.
4.5 Speed Control
Speed control is very important, if you do not do speed control will be detected by the server and then blocked IP. So get a rhythm and do it every once in a while. The ES6 syntax async/await is used to control the operation as await sleep(2000); The function outputs a Promise object and forces it to wait two seconds to execute synchronously with async/await syntax. !!!!!!!!! Tip!!!!!! The speed limit here is elementary crawler skill, and then there are advanced working overtime.
async function sleep(time = 0) {
return new Promise((resolve, reject) = > {
setTimeout((a)= >{ resolve(); }, time); })}async function start(){
// let url = "https://wh.zu.anjuke.com/fangyuan/wuchanga/x1/";
let urls = getUrl();
for(let i=0; i<urls.length; i++){await sleep(2000);
console.log(urls[i]) getdata(urls[i]); }}Copy the code
4.6 to generate the URL
Next is the URL splicing, splicing URL is mainly to deal with the situation of turning the page, please observe the URL after turning the page to adjust. The point is to observe patterns.
let position = ['wuchanga'];
let rent = "x1";
function getUrl(){
let result = [];
let url = "https://wh.zu.anjuke.com/fangyuan";
for (let i=0; i<30; i++){let surl = url+"/"+position+"/"+rent+"-"+`p${i}`+"/";
result.push(surl)
}
return result;
}
Copy the code
5 Final project
In fact, based on the previous code details, we can write the crawler project. We have put crawler projects on Git for you to learn. Github.com/yatsov/craw… Welcome to star. Welcome to fork if you don’t understand, we can discuss or leave a message on the official account. Note that the index file is stored.
6 Contribution Team Profile:
The author:
Zhang Jian, Wu University Laboratory of Information And Environment Theory and Methods. Main directions: Geographic information Software Engineering and spatiotemporal data visualization
Review:
Tong Ying: Laboratory of Theory and Method of Wu-Zi-Huan Pan Yu-cheng: Major in Physical Geography of Wu-Zi-Huan