The words written in the front
In 2017, there were many sudden deaths and suicides in THE IT industry, near-death experiences in entrepreneurship, liquidation and layoffs of major companies, etc. The end of the year is always full of activities, year-end reviews, bills, etc. 30 year old programming ape or something. Unfortunately, we might be one of them. To change the status quo, money and investment are inseparable, and when it comes to investment, housing has to be mentioned. Housing prices in the past few decades have upended the idea that hard work changes your life. The advantage of being a programmer is that he can make use of Internet data for data analysis and decision-making. Next, let’s look at the simple data analysis based on the statistics of second-hand houses in Yubei District of Chongqing and the sales area of national commercial houses in China.
The data collection
Lianjia public data
The url of Lianjia is very standard, and the data we want to obtain is as follows (Chongqing Lianjia). On January 27, 2018, the data is 472, a total of 16 pages. Url for https://cq.lianjia.com/ershoufang/renhe/pg1/… https://cq.lianjia.com/ershoufang/renhe/pg16/
Java Jsoup crawler
Now working full time in Java, and for a quick implementation using Jsoup to get Html data, the code is very simple
Document document = Jsoup.connect(domain + uri)
.userAgent("Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36")
.timeout(30000).get();
Copy the code
Save Html data to a local file
The virtuous Lianjia is not doing anti-crawler, or it may be to be friendly to search engines and increase exposure. After data acquisition to save to a local file is simple, after the merge code is as follows
public String saveHtml(String domain, String uri) {
FileOutputStream out = null;
File dest = new File("src/temp_html/" + (uri.endsWith("/")? uri.substring(0, uri.length() - 1) : uri) +".html");
try {
Document document = Jsoup.connect(domain + uri)
.userAgent("Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36")
.timeout(30000).get();
if(! dest.getParentFile().exists()) { dest.getParentFile().mkdirs(); }if(! dest.exists()) { dest.createNewFile(); } out = new FileOutputStream(dest,false);
out.write(document.toString().getBytes("UTF-8"));
out.close();
} catch (IOException e) {
e.printStackTrace();
return null;
} finally {
IOUtils.closeQuietly(out);
}
return dest.getPath();
}
Copy the code
Parse data to Excel through Jsoup
It is relatively easy to parse the data in HTML and write it into Excel using POI, mainly in rule matching
Analytical data
public List<House> parseHtml(String fileName) { List<House> houses = null; File File = new File(fileName); // write a loop to read the names of these files try {if (file.isFile()) {
logger.info("File {} begins parsing", fileName); Parse (file, Jsoup)"UTF-8");
Elements sellListContents = doc.select("ul.sellListContent");
if(! ObjectUtils.isEmpty(sellListContents)) { Element sellListContent = sellListContents.first(); Elements sellContents = sellListContent.select("li.clear");
if(! ObjectUtils.isEmpty(sellListContents)) { houses = new ArrayList<>(sellContents.size());for (int i = 0; i < sellContents.size(); i++) {
Element sellList = sellContents.get(i);
Elements sellElements = sellList.select("div.clear");
if(null ! = sellElements && sellElements.size() == 1) { House house = new House(); Element infoElement = sellElements.first(); Element titleElement = infoElement.select("div.title").first().select("a[href]").first();
house.setTitle(titleElement.text());
house.setUrl(titleElement.attr("href"));
Element houseInfoElement = infoElement.selectFirst("div.houseInfo");
house.setVillage(houseInfoElement.select("a").first().text());
String houseInfo = houseInfoElement.text();
String[] houseInfos = houseInfo.split("\ \ |");
for (int j = 1; j < houseInfos.length; j++
) {
switch (j) {
case 1:
house.setStructure(houseInfos[j].trim());
break;
case 2:
house.setAcreage(houseInfos[j].replace("Square"."").trim());
break;
case 3:
house.setOrientation(houseInfos[j].trim());
break;
case 4:
house.setDecoration(houseInfos[j].trim());
break;
case 5:
house.setElevator(houseInfos[j].trim());
break;
}
}
Element positionInfoElement = infoElement.selectFirst("div.positionInfo");
house.setRegion(positionInfoElement.select("a").first().text());
String position = positionInfoElement.text().split(house.getRegion())[0].trim();
house
.setPosition(position.lastIndexOf("-") == position.length() - 1 ? position.substring(0, position.length() - 1).trim() : position);
house.setTag(infoElement.selectFirst("div.tag").text());
Element priceInfoElement = infoElement.selectFirst("div.priceInfo");
house.setTotalPrice(priceInfoElement.selectFirst("div.totalPrice").text().replace("万".""));
house.setUnitPrice(priceInfoElement.selectFirst("div.unitPrice").text().replace("The price"."").replace("Yuan per square meter"."").trim());
logger.info("Parse the {} element, resulting in: {}", i, house.toString());
houses.add(house);
}
}
}
}
}
} catch (Exception e) {
logger.error("File {} parsing error", fileName, e);
}
return houses;
}
Copy the code
Write to Excel
public String writeExcel(String fileName, List<House> houses) {
logger.info("File {} starts writing", fileName);
try (POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream(fileName))) {
Workbook workbook = new HSSFWorkbook(fs);
HSSFSheet sheet = (HSSFSheet) workbook.getSheetAt(0);
final int preLastRowNum = sheet.getLastRowNum();
logger.info("File {}, last line number {}", fileName, preLastRowNum);
for (int i = 0; i < houses.size(); i++) {
Row row = sheet.createRow(preLastRowNum + i + 1);
House house = houses.get(i);
row.createCell(0).setCellValue(preLastRowNum + i + 1);
row.createCell(1).setCellValue(house.getTitle());
row.createCell(2).setCellValue(house.getTotalPrice());
row.createCell(3).setCellValue(house.getUnitPrice());
row.createCell(4).setCellValue(house.getVillage());
row.createCell(5).setCellValue(house.getStructure());
row.createCell(6).setCellValue(house.getAcreage());
row.createCell(7).setCellValue(house.getOrientation());
row.createCell(8).setCellValue(house.getDecoration());
row.createCell(9).setCellValue(house.getElevator());
row.createCell(10).setCellValue(house.getPosition());
row.createCell(11).setCellValue(house.getRegion());
row.createCell(12).setCellValue(house.getTag());
row.createCell(13).setCellValue(house.getUrl());
}
FileOutputStream fileOut = new FileOutputStream(fileName);
workbook.write(fileOut);
IOUtils.closeQuietly(fileOut);
logger.info("File {} complete,{}-{} line", fileName, preLastRowNum + 1, preLastRowNum + houses.size());
} catch (IOException e) {
logger.error("Excel -{} processing error", fileName, e);
}
return fileName;
}
Copy the code
The data processing
There is nothing wrong with checking the data after writing, and then you need to do some simple processing of the data.
Delete parking Spaces and villas
It is observed that there are parking Spaces and villas in the data, which are not in our investment scope, so we need to screen out and delete them from the categories
Duplicate data
There are words like Palm Springs PHASE 1, Palm Springs Phase 2 and Palm Springs Phase 3 in the name of the community. We do not need such statistics for the time being in our preliminary analysis. In the later specific decision-making, we need to refer to whether a certain phase is closer to the light rail and the school, and the parking ratio is higher. So you have to replace all of them with Palm Springs.
Data analysis and presentation
The data analysis was completed under the guidance of my wife, who has many years of experience in risk control and compliance of Lakala, Zhonghui Payment, and made extensive use of perspective
Analysis of the average unit price of the community
After the preliminary data processing, there will be no more unpleasant content of 500W total price. At this time, we need to make an intuitive bar chart of average unit price of the community to shock
type | Average Unit Price (YUAN) | community |
---|---|---|
The highest | 20000 | Bebiddle Hall |
The minimum | 6480 | And the home |
On average, | 14151.46 |
Of course, this price is not accurate, including the age of the house, transportation, decoration, school indicators and other factors, but at least we can take the first step to let ourselves have a certain understanding of the market price, calculate their affordable price
Housing quantity analysis
Next, let’s take a look at some residential areas with lots of housing supply. After our previous processing, we only take the top 20 regardless of the number of periods, and the distribution is as follows
Analysis of the proportion of house type
As an investment house is extremely important, which house is popular in the market, the new property will also be analyzed to the crowd, the property crowd positioning.
National commercial housing sales area analysis
February 3 to the national commercial housing sales area grab, make a simple line chart
In fact, there are a lot of analysis, presumably the popularity of the house and the price is a certain relationship, how long can recover the cost. Programmers always think too much, turning a simple requirement into a price monitoring and forecasting system. So this article is only a simple analysis, the code is very simple, about 2 hours.
The source code
Github:house