Series catalog:

JAVA Micro blog crawler Basics – Simple Micro blog crawler (manual cookie)

JAVA Micro Blog crawler intermediate – Business Interface (Without Cookies)

JAVA Micro blog crawler advanced – automatic access to Micro blog cookies (no account, daily million magnitude)

One, foreword

Articles are a real pain to write. My language is not good, what sentence, semantic impassability and so on is often, please be sure not to care (you care is useless). It was my first time writing with Markdown and I was going to test the waters, so the layout was a bit messy.

I couldn’t do the language, I couldn’t do the typography, and I was thinking to myself as I was writing these words, “Why are you writing this? Wouldn’t it be easier to just put the code in?” . People always have to try. How do you know you can’t do it until you try?

Finally, this article is about the basic implementation of micro-blog crawler, applicable scope: small-scale use

Second, want to say

Now there are many big data or public opinion analysis companies in the market. These companies will inevitably use crawlers, naturally also involved in the micro-blog crawler. Although Weibo has its own commercial interface to provide data, there are many limitations in it, such as frequency, missing fields and unmet requirements. In particular, the frequency, my previous company has several business interfaces, we share the frequency limit, the collection of large amount of time completely unavailable.

At this time need crawler, but weibo anti-crawler mechanism is really show, all content is Fw.view () fill, dynamic cookie encryption, verification code, seal IP and so on, especially a Sina Visitor System. If there is no weibo cookie, the crawler takes back all the pages into Sina Visitor System. Weibo cookie, must be logged in to get, and you climb too much, the account will be frozen.

Use cookies to climb easy to be frozen account, so to have a lot of small; Sina Visitor System page, you can’t crawl the content. I didn’t have a trumpet, so I had to figure out how to skip this mechanic.

Three, salute the big guy

Before, I took over a demand of the company and needed to climb the weibo page. The data level was about 20 million yuan, but it was only a week. There is simply no time for a commercial interface. Luckily, the almighty search engine told me there was such an article

4. Basic principles

Sina Visitor System(Sina Visitor System), determine whether there is a Microblog cookie in the request microblogging page? If there is, jump, if there is not and is not a crawler, create a tourist cookie for access.

Five, the parsing

1. Request jump process

I used weibo’s “Find-find-field page” (d.weibo.com/1087030002_…) “For example,” 1087030002_2975_2017_0 “is the pathname of the page, which is first redirected to 302, then redirected to 200 through a series of requests.

Let’s slightly compare the two “1087030002_2975_2017_0” requests. The request headers are the same. The biggest difference is that the first request is set-cookie, that is, there is no cookie value, the second request has more cookies, there are three values yF-page -G0, SUB, SUBP. In other words, this cookie is the tourist cookie we need to obtain.

Take Chrome, for example. To see these requests clear the cache (mainly cookies), F12 open developer mode, check Preserve Log under Network, and finally CTRL +R reload

2, analyze the Sina Visitor System process

Now, let’s talk a little bit more about how he sets the cookie. Copy the second request and SEND it in Postman. You’ll find this is Sina Visitor System. Blue ‘⊿´ is blue.

As we can see at a glance, the Incarnate () method gave the user the identity of a visitor. It sends a GET request, which, by comparison, is the sixth request above. In other words, if we can successfully send the sixth request, we can successfully get the cookie.

We can see from js that the 6th request is sent with the following parameters:

A, T, w, C, GC, cb, FROM, _rand A, cb, and FROM are fixed values, and _rand is a random number.

For the sixth request, gc is empty, that is, we only need to know t (tid), W (WHERE), and C (conficence) parameters

I have tested that it does not matter whether gc fills the request or not

3, mini_original. Js

But what is TID? Where are these three parameters assigned? After a careful search, I found that it imported a JS file in the body tag, and a comparison discovery was our third request

<script type="text/javascript" src="/js/visitor/mini_original.js? v=20161116"></script>
Copy the code

Since the TID and three parameters are not found there, we can only look in this JS. 1984 lines of source code. We ran it through TID, and we found it.

As shown in figure: w (where) -> recover, it seems that w (where) is equivalent to recover

Next, is to sort out the source code. To facilitate screenshots, I changed the location of some of the JS code, but the implementation remains the same.

If we scroll down, we can find this method, which is obviously the way to get tid. Compare with the request in the Network and find that this is the fifth request.

This is a POST request, passing two values, one is cp, fixed value: gen_callback; The other is FP, which is generated by the getFp() method. I looked at it and basically got constants for browser type, window size, font and so on. It should be set up to determine if it is a crawler. That is, these values will not change as long as you do not change the browser configuration. When you test it, you just copy it in

Then, we use the postman send simulation request https://passport.weibo.com/visitor/genvisitor (fifth request), his return value is zero

{
	"retcode": 20000000,
	"msg": "succ"."data": {
		"tid": "O8DdOkekzzLgrDM2e0HhvBRePB8ZVty6FeowFyc7IR0="."new_tid": true}}Copy the code

Tid = w (where) and C (Conficence); W (where) is 3 when “new_tid” is true and 2 when “false”.

C (Conficence) May, may not, none Default to 100

Although I don’t have it here, there is a field called “Conficence” under data, which is always 95 when I test it

At this point, we found the send request 6 (passport.weibo.com/visitor/vis…). To simulate the sending of the request. Get the result. Ok, sub and subp are all there.

window.cross_domain && cross_domain({
	"retcode": 20000000,
	"msg": "succ"."data": {
		"sub": "_2AkMsBM0Wf8NxqwJRmfgQzm_laoR-yg3EieKaWDzNJRMxHRl-yT83qn04tRB6B4Tj-ZvOcFzfsmjrLJjxv39RkzOyvMzE"."subp": "0033WrSXqPxfM72-Ws9jqgMF55529P9D9Whhkx2zn2ycSbRz3ZvmBTfm"}});Copy the code

Add these two parameters, plus the original yF-page-g0, to get the complete visitor cookie. Will this cookie fill in the request header, again by https://d.weibo.com/1087030002_2975_2017_0. Perfect success

JAVA implementation

This is the principle, so no code. Let me show you a version that is implemented in Java. Crawlers usually use Python, but this time they use Java. There is no excuse, just laziness. The Java version was written before the requirements were completed, CTRL CV is enough. Python has to be written from scratch, too lazy ~~

But if it’s an optical reptile. I still recommend using Python

The following code, all from the original project split, I changed a little bit, some redundant I don’t optimize. Don’t copy; some implementations are completely useless

The first step is maven dependency

Maven relies on importing jar packages. The connection tool I use is HttpClient, and jsoup is used to parse HTML pages. Specific dependencies are as follows

<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.10.1</version> < the dependency > < groupId > org, apache httpcomponents < / groupId > < artifactId > httpclient < / artifactId > < version > 4.5.5 < / version > </dependency>Copy the code

Second, create the connection utility class

SSL verification, proxy IP, cookies and other connection configurations are added here, which is a bit complicated. You can add or subtract by yourself.

/ * * *@Title: generateClient 
 * @Description: TODO(add agent) *@param httpHost
 * @return* CloseableHttpClient return type */
public static CloseableHttpClient generateClient(HttpHost httpHost,CookieStore cookieStore) {
	SSLContext sslcontext = SSLContexts.createSystemDefault();
	Registry<ConnectionSocketFactory> socketFactoryRegistry = RegistryBuilder.<ConnectionSocketFactory>create()
			.register("http", PlainConnectionSocketFactory.INSTANCE)
			.register("https".new SSLConnectionSocketFactory(sslcontext)).build();
	// HTTP connection pool management, which serves connection requests from multiple executing processes
	PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager(
			socketFactoryRegistry);
	connectionManager.setMaxTotal(200);
	connectionManager.setDefaultMaxPerRoute(20);
	
	RequestConfig requestConfig = RequestConfig.custom().setProxy(httpHost).build();
	
	HttpClientBuilder httpClientBuilder = HttpClients.custom().setUserAgent(randomUserAgent())
			.setConnectionManager(connectionManager).setDefaultRequestConfig(requestConfig).setDefaultCookieStore(cookieStore);
	return httpClientBuilder.build();
}
Copy the code

Step 3: Obtain the values of t, w, and c

private JSONObject getTidAndC(a) throws IOException {
	String url = "https://passport.weibo.com/visitor/genvisitor";
	HttpPost httpPost = createHttpPost(url);
	CloseableHttpResponse response = httpclient.execute(httpPost);
	HttpEntity entity = response.getEntity();  
        if(entity ! =null) {  
            // Convert the resulting entity to String with the specified encoding
            String body = EntityUtils.toString(entity, "utf-8");
            body = body.replaceAll("window.gen_callback && gen_callback\\("."");
            body = body.replaceAll(\ \ "");"."");
            JSONObject json = JSONObject.fromObject(body).getJSONObject("data");
            System.out.println(body);
            return json;
        }
        return null;
}
Copy the code

Step 4, get cookies

From the request result returned by the getTidAndC method, we get three parameters. These three parameters are used to retrieve the cookie

public String getCookie(a) throws IOException {
	JSONObject json = getTidAndC();
	String t = "";
	String w = "";

	String c = json.containsKey("confidence")? json.getString("confidence") : "100";
	if (json.containsKey("new_tid")) {
        	w = json.getBoolean("new_tid")?"3" : "2";
	}
	if (json.containsKey("tid")) {
		t = json.getString("tid");
	}
	System.out.println(c);
	String url = "https://passport.weibo.com/visitor/visitor?a=incarnate&t="+t+"&w="+w+"&c=0"+c+"&gc=&cb=cross_domain&from=weibo&_rand="+Math.random();
	HttpGet httpGet = createCookieGet(url, "tid="+t+"__"+c);
	CloseableHttpResponse response = httpclient.execute(httpGet);
	HttpEntity httpEntity = response.getEntity();
	String body = EntityUtils.toString(httpEntity, "utf-8");
	System.out.println(body);
	body = body.replaceAll("window.cross_domain && cross_domain\\("."");
	body = body.replaceAll(\ \ "");"."");
	
	JSONObject obj = JSONObject.fromObject(body).getJSONObject("data");
	System.out.println(obj.toString());
	String cookie = "YF-Page-G0="+getYF()+"; SUB="+obj.getString("sub") +"; SUBP="+obj.getString("subp");
	System.out.println("cookie: "+cookie);
	httpclient.close();
	return cookie;
}
Copy the code

Parameter to yf-page -G0, which gets the value of set_cookie by sending a request. This gets the complete visitor cookie

public String getYF(a) throws IOException {
	String domain = "1087030002 _2975_5012_0";
	String url = "https://d.weibo.com/"+domain;
	HttpGet httpGet = createHttpGet(url, null);
	CloseableHttpResponse response = httpclient.execute(httpGet);
		
	List<Cookie> cookies = cookieStore.getCookies();
	String str = "";
	for (Cookie cookie : cookies){
        	str = cookie.getValue();
	}
	return str;
}
Copy the code

Step 5: Test

Finally, it’s time to call the test.

Just call getCookie() to get the complete visitor cookie. Then you can get the page data

I think you have found that these methods are the test files left by my previous tests, so I am too lazy to change them at will. The following code > run results, above has been sent, will not send.

@Test
public void test(a){
	String domain = "1087030002 _2975_2013_0";
	try {
		/ / get a cookie
		String cookie = getCookie();
		// Pass the cookie to the request header to generate a GET request
		String url = "https://d.weibo.com/"+domain;
		HttpGet httpGet = createHttpGet(url, cookie);
		// Get the response result, output
		CloseableHttpClient httpclient = HttpClients.custom().build();
		CloseableHttpResponse response = httpclient.execute(httpGet);
		HttpEntity httpEntity = response.getEntity();
		String body = EntityUtils.toString(httpEntity, "utf-8");
		//html
		System.out.println(body);
	} catch(Exception e) { e.printStackTrace(); }}Copy the code

Seven, notes

  • Most of the weibo pages can be climbed in this way, search and comment can not get all the data. I think the limit is 20 pages?
  • This method is very easy to block IP addresses, and the proxy IP address must be set for large-scale use
  • If a cookie may fail to be obtained from the same IP address once in five times, judge by yourself
  • The first few attempts at each IP generally fail. Please try again
  • As of 14:10:13, March 8, 2020, the method did not fail. After that, I don’t know

Eight, multi-threaded automated micro-blog crawler

This specific implementation code can not be said too much, the company project is already online, the non-disclosure agreement is in effect, the blogger does not give themselves trouble. The general idea is to set up multiple threads, provide different proxy IP, do an error alert, exception handling, set a reasonable frequency to make the IP live longer and other general operations.

The blogger used to run eight threads and was able to crawl around 200W of data a day. In this regard, I will update it when the nDA is over and I can still remember.

Nine, the end

Transfer, leave a full corpse on it, indicate the source and the author, and the rest at will what problems can contact me, or what mistakes. Welcome to be corrected. Happy last Women’s Day

X. Frequently asked Questions

1, according to the steps also can not see the request to redirect, there is no 302 direct to 200.

That’s because you probably didn’t clean your cookie. Try F12 — Appllication — Expand cookies under storage — right click clear

2. Sometimes tid will return signs such as + or/if the symbols are removed, tid error will be displayed.

In this case, the cookie can only be retrieved

3. Cookies can succeed without adding yF-page-g0

When I tested the cookie, I couldn’t get the value without adding yF-page-g0. Please use your discretion.

4. Why do I keep making mistakes

I don’t know