[Big Data Tribe] R language e-commerce website crawler

Original link :tecdat.cn/?p=4559

Due to the real-time requirements of e-commerce website data, data analysis is generally directly from the web page. Therefore, the method of using crawler is very important. As a data analysis software, R can directly conduct follow-up processing on the data crawled. With the characteristics of fast loading, R is a good tool for data crawled and analyzed on e-commerce websites.

The following takes an e-commerce platform as an example to share the process of using Rcurl to crawl the website data.

First you need to install the required packages in Rgui

Require ("RCurl") require("rjson") require(stringr) require(XML) # Url = "http://cn.shopbop.com/" doc = getURL(url) TXT = htmlParse(doc, asText = TRUE) print(TXT)Copy the code

Because the acquisition of commodity data in the web page needs to analyze the HTML source code structure, it can be directly viewed in the browser and then edited in R

In the source code it is easy to find the site navigation in the sub-site url

The nodes of the corresponding subsites can therefore be found using the xmlPath language

A < -getNodeset (TXT, path = "//a[@class = 'parent topnav-logged-in ']")#Copy the code

If the resulting Chinese characters are garbled, the encoding needs to be converted

b <- sapply(a,xmlValue)

c <- iconv(b,"utf-8","gbk")

c
Copy the code

Otherwise, you can get the attributes you want using the xmlGetAttr function

? D < -sapply (a,xmlGetAttr, "href")#Copy the code

?? Paste paste is used to link to the root directory of the web site

d1=paste(url,d[1],sep="" )
Copy the code

? The page number of the product display sub-page is also needed to obtain the information of the product display sub-page

? Each page shows 40 items for a total of 1200 items.

????? It is easy to understand the address rules of the product display page through the url.

# Therefore, it is possible to obtain all product information on each page through a simple loop to obtain the addresses of all web pages. ?????

A < -getNodeset (TXT, path = "//span[@class = 'page-number']")# D < -sapply (a,xmlGetAttr," data-number-link")# fetch subpage directory pagenum=strsplit(d,"=") maxPagenum =0; for(i in 1:length(pagenum)){ maxpagenum[i]= pagenum[[i]][3] } maxpagenum=max(as.numeric(maxpagenum)) #[1] 1200Copy the code

After obtaining all pages, it is easy to obtain information for all products by iterating through the XML keyword crawl for each page

# name information?

? Image information

? Price information

? Through text processing and output, it can be saved for subsequent data analysis.

[Big Data Tribe] R language e-commerce website crawler

Related Posts

Step by step, bring you into Netty’s world!

Illustrated, 10,000 word detailed explanation, take you to master JVM garbage collection!

How to use the idea of “bucket sorting” to achieve the optimal solution (including “sliding window + dichotomy” solution) | Java brush problem punch card