Original link :tecdat.cn/?p=4559

 

Due to the real-time requirements of e-commerce website data, data analysis is generally directly from the web page. Therefore, the method of using crawler is very important. As a data analysis software, R can directly conduct follow-up processing on the data crawled. With the characteristics of fast loading, R is a good tool for data crawled and analyzed on e-commerce websites.

The following takes an e-commerce platform as an example to share the process of using Rcurl to crawl the website data.

 

First you need to install the required packages in Rgui

Require ("RCurl") require("rjson") require(stringr) require(XML) # Url = "http://cn.shopbop.com/" doc = getURL(url) TXT = htmlParse(doc, asText = TRUE) print(TXT)Copy the code

 

Because the acquisition of commodity data in the web page needs to analyze the HTML source code structure, it can be directly viewed in the browser and then edited in R

 ​

 

In the source code it is easy to find the site navigation in the sub-site url

 

The nodes of the corresponding subsites can therefore be found using the xmlPath language

A < -getNodeset (TXT, path = "//a[@class = 'parent topnav-logged-in ']")#Copy the code

 

 

If the resulting Chinese characters are garbled, the encoding needs to be converted

b <- sapply(a,xmlValue)

c <- iconv(b,"utf-8","gbk")

c
Copy the code

Otherwise, you can get the attributes you want using the xmlGetAttr function

? D < -sapply (a,xmlGetAttr, "href")#Copy the code

 

?? Paste paste is used to link to the root directory of the web site

d1=paste(url,d[1],sep="" )
Copy the code

? The page number of the product display sub-page is also needed to obtain the information of the product display sub-page

? Each page shows 40 items for a total of 1200 items.

 

????? It is easy to understand the address rules of the product display page through the url.

 

# Therefore, it is possible to obtain all product information on each page through a simple loop to obtain the addresses of all web pages. ?????

A < -getNodeset (TXT, path = "//span[@class = 'page-number']")# D < -sapply (a,xmlGetAttr," data-number-link")# fetch subpage directory pagenum=strsplit(d,"=") maxPagenum =0; for(i in 1:length(pagenum)){ maxpagenum[i]= pagenum[[i]][3] } maxpagenum=max(as.numeric(maxpagenum)) #[1] 1200Copy the code

After obtaining all pages, it is easy to obtain information for all products by iterating through the XML keyword crawl for each page

# name information?

 

? Image information

? Price information

? Through text processing and output, it can be saved for subsequent data analysis.