Like again, form a habit 👏👏
Hello everyone, I’m Llamy, today more text, brought a very dry goods.
preface
Speaking of crawler crawler web data, I believe everyone’s first reaction is Python, indeed Python is naturally suited to do this, but many years of experience in Java development do not necessarily know, in fact, Java can also do crawler. The best known is the Jsoup Web extraction framework.
Become attached to
Many years ago, they made a precious metals information kind website, need to various kinds of real-time displaying the latest exchanges of gold, silver, etc., then have to provide such data API interface by a third party service providers, have to pay, just behind the baidu to Jsoup scraping of the page, and then found a great website according to climb the data, Their own form to display on the page, save a sum of money.
Last night, whim, thinking can write an article, so go to the official website review, decided to shell housing start 😏
Jsoup Guide to Eating
Jsoup is really simple, so simple that you don’t want to introduce the development process. Just go to the official website and get the API in 10 minutes
The official website has a guide and examples, you go to see.
Official website: jsoup.org/
Dry goods
The dry goods part I will use the way of actual combat project, detailed explanation, I believe that through this actual combat example, we can easily master the technology.
1, prepare to crawl the web page
Shell housing – Shenzhen station – new house: sz.fang.ke.com/loupan/pg
2. Create a Maven project
3. Add dependencies to Pom files
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>30.0 the jre</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.18</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.14.3</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>easyexcel</artifactId>
<version>2.2.10</version>
</dependency>
Copy the code
4. Create Pojo classes to map data to Excel tables
Because the address captured from the web page will be saved to excel file in the end, we use Alibaba’s EasyExcel, so we need to introduce dependencies in THE Pom file, and need to create a Pojo class to do the mapping of the exported file.
@Data
@Accessors(chain = true)
public class House {
@ExcelProperty(" Property Name ")
private String title;
@excelProperty (" Visit web page ")
private String detailPageUrl;
@ExcelProperty(" Real Estate picture ")
private String imageUrl;
@excelProperty (" address ")
private String address;
@ ExcelProperty (" family ")
private String houseType;
@ExcelProperty(" Property type ")
private String propertyType;
@ ExcelProperty (" state ")
private String status;
@ExcelProperty(" Gross Floor area ")
private String buildingArea;
@ ExcelProperty (" price ")
private String totalPrice;
@ExcelProperty(" Unit Price (Yuan /㎡) ")
private String singlePrice;
@ ExcelProperty (" tag ")
private String tag;
}
Copy the code
The Main method executes the business code
A few things to note here:
- Shell house hunting will trigger frequent visits to the same IP address for a short period of time
Human verification
, so we need to put the thread to sleep for a period of time each time we perform paging. Jsoup
Is aimed atThe HTML element
Crawl, if the shell web page changes, the program may not be able to crawl data.- The program did not crawl the detailed information of the building, interested students can crawl according to
Details page url
Do secondary development. - If you need to crawl the site
The login information
And other special request parameters,Jsoup
Also support Settings, see the official website API.
@SneakyThrows
public static void main(String[] args) {
AtomicInteger pageIndex = new AtomicInteger(1);
int pageSize = 10;
List<House> dataList = Lists.newArrayList();
// Shell housing Shenzhen region website
String beikeUrl = "https://sz.fang.ke.com";
// Shenzhen real estate display page address
String loupanUrl = "https://sz.fang.ke.com/loupan/pg";
// Use Jsoup to fetch the complete page information of this address
Document doc = Jsoup.connect(loupanUrl + pageIndex.get()).get();
// Page title
String pageTitle = doc.title();
// Paging container
Element pageContainer = doc.select("div.page-box").first();
if (pageContainer == null) {
return;
}
// Total number of buildings
int totalCount = Integer.parseInt(pageContainer.attr("data-total-count"));
// paging execution
for (int i = 0; i < totalCount / pageSize; i++) {
log.info("running get data, the current page is {}", pageIndex.get());
// Shell network has man-machine authentication, so it cannot be accessed frequently in a short time. Every page turn will let the thread sleep for 10 seconds
Thread.sleep(10000);
doc = Jsoup.connect(loupanUrl + pageIndex.getAndIncrement()).get();
// Get the ul element of the listing
Element list = doc.select("ul.resblock-list-wrapper").first();
if (list == null) {
continue;
}
// Get the li element of the listing
Elements elements = list.select("li.resblock-list");
elements.forEach(el -> {
// Property introduction
Element introduce = el.child(0);
// Details page
String detailPageUrl = beikeUrl + introduce.attr("href");
// Real estate picture
String imageUrl = introduce.select("img").attr("data-original");
// Property details
Element childDesc = el.select("div.resblock-desc-wrapper").first();
Element childName = childDesc.child(0);
// Property name
String title = childName.child(0).text();
// The property is for sale
String status = childName.child(1).text();
// Property type
String propertyType = childName.child(2).text();
// Address of the building
String address = childDesc.child(1).text();
// Room properties
Element room = childDesc.child(2);
/ / family
String houseType = "";
// Set the size of the apartment
Elements houseTypeSpans = room.getElementsByTag("span");
if (CollectionUtils.isNotEmpty(houseTypeSpans)) {
// Delete the copy:
houseTypeSpans.remove(0);
// Delete text: [face: XXX]
houseTypeSpans.remove(houseTypeSpans.size() - 1);
houseType = StringUtil.join(houseTypeSpans.stream().map(Element::text).collect(Collectors.toList()), "/");
}
// Floor area
String buildingArea = room.select("span.area").text();
// div - tag
Element descTag = childDesc.select("div.resblock-tag").first();
Elements tagSpans = descTag.getElementsByTag("span");
String tag = "";
if (CollectionUtils.isNotEmpty(tagSpans)) {
tag = StringUtil.join(tagSpans.stream().map(Element::text).collect(Collectors.toList()), "");
}
// div - price
Element descPrice = childDesc.select("div.resblock-price").first();
String singlePrice = descPrice.select("span.number").text();
String totalPrice = descPrice.select("div.second").text();
dataList.add(new House().setTitle(title)
.setDetailPageUrl(detailPageUrl)
.setImageUrl(imageUrl)
.setSinglePrice(singlePrice)
.setTotalPrice(totalPrice)
.setStatus(status)
.setPropertyType(propertyType)
.setAddress(address)
.setHouseType(houseType)
.setBuildingArea(buildingArea)
.setTag(tag)
);
});
}
if (CollectionUtils.isEmpty(dataList)) {
log.info("dataList is empty returned.");
return;
}
log.info("dataList prepare finished, size = {}", dataList.size());
// Call export logic to export data to excel file
export(pageTitle, dataList);
}
Copy the code
6, EasyExcel export logic
/** * write the crawl data to Excel *@param pageTitle
* @param dataList
*/
private static void export(String pageTitle, List<House> dataList) {
WriteCellStyle headWriteCellStyle = new WriteCellStyle();
// set the head center
headWriteCellStyle.setHorizontalAlignment(HorizontalAlignment.CENTER);
// Content strategy
WriteCellStyle contentWriteCellStyle = new WriteCellStyle();
// Set the horizontal center
contentWriteCellStyle.setHorizontalAlignment(HorizontalAlignment.LEFT);
HorizontalCellStyleStrategy horizontalCellStyleStrategy = new HorizontalCellStyleStrategy(headWriteCellStyle, contentWriteCellStyle);
// Here you need to set not to close the stream
EasyExcelFactory.write("D:\ Shenzhen Real Estate summary.xlsx". House.class).autoCloseStream(Boolean.FALSE).registerWriteHandler(horizontalCellStyleStrategy).sheet(pageTitle).doWrite(d ataList); }Copy the code
7. Achievement Display
Interested partners can try their own!
Source: making
Original is not easy, please give a lot of praise, thank you! 🙏 🙏