Series catalog:
JAVA Micro blog crawler Basics – Simple Micro blog crawler (manual cookie)
JAVA Micro Blog crawler intermediate – Business Interface (Without Cookies)
JAVA Micro blog crawler advanced – automatic access to Micro blog cookies (no account, daily million magnitude)
One, foreword
Articles are a real pain to write. My language is not good, what sentence, semantic impassability and so on is often, please be sure not to care (you care is useless). It was my first time writing with Markdown and I was going to test the waters, so the layout was a bit messy.
I couldn’t do the language, I couldn’t do the typography, and I was thinking to myself as I was writing these words, “Why are you writing this? Wouldn’t it be easier to just put the code in?” . People always have to try. How do you know you can’t do it until you try?
Finally, this article is about the basic implementation of micro-blog crawler, applicable scope: small-scale use
Second, the principle of
Weibo has a ‘Sina Visitor System,’ which blocks all requests without a specific Microblog cookie. On the other hand, as long as the cookie is obtained, it can climb to the page of weibo.
There are a lot of tutorials out there, but I actually feel like there are a lot of tutorials out there that don’t give a lot of detail, so I’m going to tidy it up.
Three, implementation,
I recommend using Python for crawlers, but I’m going to show you in Java. Because not everyone knows Python
In fact, I have implemented it in Java before, CTRL + CV is ok, I really do not want to write it again.
Next, the real thing. This time, I chose the list page of Weibo
1. Cookie location
Get cookies: Google Browser first log in weibo. Then F12 opens developer mode, CTRL +R refreshes the request, finds the request on the current page, and copies the cookie.
Maven dependency
This time, the request tool is HttpClient and the parsing page is Jsoup
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.10.1</version> < the dependency > < groupId > org, apache httpcomponents < / groupId > < artifactId > httpclient < / artifactId > < version > 4.5.5 < / version > </dependency>Copy the code
3, Java code implementation
- In fact, the request header of most weibo pages only needs to set “user_Agent”, “host” and “cookie”, and the rest can be optional.
- I am not familiar with Markdown, maybe the code is too long, so the code display is too ugly, so some of the code is screenshots.
- This code was written in 18 years, to be honest, there is a kind of black history feeling, the complete class is really ashamed to show you, I will cut some key code.
- The page data of Weibo is filled with fm.view (), so special processing is needed when parsing the page
- If there is any mistake, welcome to correct.
The first step is to create the entity class
Create an entity class based on the field you want to climb
/ * * *@ClassName: entity
* @Description: TODO(here is a one-sentence description of what this class does) *@dateFebruary 23, 2018 at 2:37:27 PM */
@Data
public class WeiboDomain {
private String uid;
private String name;
private String url;
private String gender;
private String location;
private String description;
private String tag;
private String followers_count;
private Integer friends_count;
private Integer statuses_count;
private boolean isVip;
private String updateTime;
}
Copy the code
18 years of code, and suddenly I miss it
The second step is to create a GET request to get the page data
/ * * *@Title: getHtml
* @Description: TODO(return page data via URL) *@param url
* @return* String Mandatory type */
public String getHtml(String url,String cookie) {
HttpGet httpGet = createHttpGet(url, cookie);
return get(httpGet);
}
Copy the code
The createHttpGet method sets the request header to simulate a real user. We simply copy in the request header from above.
SetProxy (httpHost) is used to set the proxy
private RequestConfig getRequestConfig(a) {
return RequestConfig.custom().setSocketTimeout(3000).setConnectTimeout(3000)
.setConnectionRequestTimeout(3000)
.setProxy(httpHost).build();
}
Copy the code
The third step parses the page data
Using the above getHtml method, we can get all the data on the page, but we can’t use the native HTML code directly, so we need to use Jsoup to parse and convert it.
/ * * *@Title: parseData
* @Description: TODO(parses page data into collections) *@param html
* @param domain
* @return* List<WeiboDomainGroup> Return type */
public List<WeiboDomain> parseData(String html){
List<WeiboDomain> result = new ArrayList<>();
Document doc = Jsoup.parse(html);
// Process the fill data
String str = "";
Elements scripts = doc.getElementsByTag("script");
// Find the script containing the data
for (int i=0; i<scripts.size(); i++) { String script = scripts.get(i).html();if (script.contains("pl.content.signInPeople.index")) {
str = getHtml(script);
break; }}// Parse the page data
doc = Jsoup.parse(str);
Elements user = doc.getElementsByTag("dd");
for (Element element : user){
if (element.attr("class").equals("mod_info S_line1")) {
WeiboDomain weiboDomainGroup = new WeiboDomain();
String uid = "";
Elements elements = element.getElementsByTag("div");
for (Element div : elements){
if (div.attr("class").equals("info_name W_fb W_f14")){
Element S_txt1 = div.getElementsByClass("S_txt1").get(0);
uid = S_txt1.attr("usercard").split("&") [0].replaceAll("id="."");
weiboDomainGroup.setUid(uid);
weiboDomainGroup.setUrl(S_txt1.attr("href"));
weiboDomainGroup.setName(S_txt1.attr("title"));
Elements i = div.getElementsByTag("i");
for (Element ele : i){
if (ele.attr("class").equals("W_icon icon_member")){
weiboDomainGroup.setVip(true);
}
if (ele.attr("class").equals("W_icon icon_male")){
weiboDomainGroup.setGender("m");
}
else{
weiboDomainGroup.setGender("f"); }}}if (div.attr("class").equals("info_connect")) {
Elements em = div.getElementsByTag("em");
weiboDomainGroup.setFriends_count(
Integer.parseInt(em.get(0).text()));
weiboDomainGroup.setFollowers_count(em.get(1).text());
weiboDomainGroup.setStatuses_count(
Integer.parseInt(em.get(2).text()));
}
if (div.attr("class").equals("info_add")){
Elements span = div.getElementsByTag("span");
weiboDomainGroup.setLocation(span.get(0).text());
}
if (div.attr("class").equals("info_intro")){
Elements span = div.getElementsByTag("span");
weiboDomainGroup.setDescription(span.get(0).text());
}
if (div.attr("class").equals("info_relation")){
String tag = div.text().split(":") [1];
weiboDomainGroup.setTag(tag);
}
}
weiboDomainGroup.setUpdateTime(LocalDateTime.now()
.format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"))); result.add(weiboDomainGroup); }}return result;
}
/ * * *@Title: getHtml
* @Description: TODO(The microblogging data is populated with fw. view, so it needs to be preprocessed) *@return* String Mandatory type */
private String getHtml(String str) {
str = str.replaceAll("FM.view\\("."").replaceAll("\\)"."");
JSONObject json = JSONObject.fromObject(str);
return json.getString("html");
}
Copy the code
Step 4: Test
And finally, let’s test it a little bit to get it right. over
4. Simulated login
The above method requires manually copying cookies from the page each time, which is too cumbersome and cannot be automated. Therefore, we generally get cookies automatically for crawling through simulated login.
Originally, but today, when TESTING, I found that the simulated login I wrote before is no longer available. I’ll show you how to get cookies automatically in another way. So, you find your own information in this respect, AND I won’t fix it.
Five, the code
If you need all the code, please contact me