(1) Implementation ideas

1. Locate the barrage file

Generally, json or XML format is used to save the bullet screen, so we can locate the bullet screen file as long as we find the XML file or JSON file in the video web page.

2. Parse the barrage file

Then parse the file through Jsoup to extract the text content of our danmu.

(two) the first implementation scheme, parsing local files

1. Locate the barrage file

Let’s say we want to crawl the barrage file from the video below.

Open Chrome Network, refresh the web page, and then enter XML in the input box to screen out XML file resources:

Move the cursor to the file, you can see the specific address of the file as follows:

Right-click on this file and Open in New TAB to view the contents of the danmu file in a new browser page:

2. Parse the barrage file

2.1 Create a basic Maven project

Enter GroupId and ArtifactId

This project will use the Jsoup JAR package, so create a lib target in the project root directory, copy the JAR into it, and then do the following to build the JAR package into the project:

Select the JAR and click OK.

2.2 Creating a danmaku file in the root directory of the project

In the root directory to create 3232417. The XML file, copy https://comment.bilibili.com/3232417.xml barrage the content of the page, save to that file. We will only extract useful text when parsing the file later, so the first line is not removed, as follows:

2.3 Code Implementation

The code for parsing the XML file of the local bullet screen is as follows:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.util.ArrayList;

/**
 * Created by shuhu on 2018/1/20.
 */
public class LocalFile {

    public static ArrayList<String> getData(String fileName){
        ArrayList<String> list = new ArrayList<String>();
        try{
            File input = new File(fileName);
            Document doc = Jsoup.parse(input, "UTF-8"); Elements contents = doc.getelementsbyTag ();"d");

            for(Element content : contents) { list.add(content.text()); }} Catch (Exception e) {e.printStackTrace();}} Catch (Exception e) {e.printStackTrace(); }returnlist; }}Copy the code

Call the getData method of LocalFile in the entry class main. Java, pass in the XML file name, parse each projectile and print:

import java.util.ArrayList; /** * Created by shuhu on 2018/1/20. */ public class Main { public static void main(String[] args){ ArrayList<String> items = new ArrayList<String>(); //1, get all items = localfile.getData ("3232417.xml"); // Iterate over each barragefor(String item : items) { System.out.println(item); }}}Copy the code

The following output is displayed:

(three) the second implementation, the resolution of remote server files

1. Add httpClient dependencies

Because you need to access a remote server, you use a dependency that provides access to the HTTP server. Add to the pom.xml file:

<dependencies> <! - provides access to the HTTP server functionality - > < the dependency > < groupId >. Org. Apache httpcomponents < / groupId > < artifactId > httpclient < / artifactId > The < version > 4.3.3 < / version > < / dependency > < / dependencies >Copy the code

2. Implementation code

import org.apache.http.HttpEntity; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.util.ArrayList; /** * Created by shuhu on 2018/1/20. */ public class RemoteFile { public static ArrayList<String> getData(String fileName) throws IOException { ArrayList<String> list = new ArrayList<String>(); / / 1. To create the HttpClient object, we use the Apache HttpClient instance CloseableHttpClient HttpClient = HttpClients.createDefault(); HttpGet HttpGet = new HttpGet(fileName); CloseableHttpResponse = httpClient.execute (HttpGet); // CloseableHttpResponse = httpClient. Try {//4, get the content of the marquee file HttpEntity HttpEntity = httpresponse.getentity (); String httpHtml = EntityUtils.toString(httpEntity); Document doc = jsoup. parse(httpHtml,"UTF-8");
            Elements contents = doc.getElementsByTag("d");
            for(Element content : contents) { list.add(content.text()); }} Catch (Exception e) {e.printStackTrace();}} Catch (Exception e) {e.printStackTrace(); } finally { httpResponse.close(); }returnlist; }}Copy the code

CloseableHttpResponse httpResponse = httpclient.execute(httpGet); After the GET request is executed, the response results are stored in httpResponse. Response.getentity () is the message entity in the response result, because the response result also contains other content, such as Headers, as shown in the following figure, we only need to focus on getEntity() message entity.

EntityUtils.toString(response.getEntity());Returns the server’s response in the form of a stream, for example, the method called on the server ends with:
responseWriter.write("just do it");then
EntityUtils.toString(response.getEntity());What you get is
just do itThis sentence. Here can be simply understood as the HTML code of the web page, that is, right click to view the page source code to see all the HTML code. This is the HTML code we need to parse.

Call the getData method of RemoteFile from the entry class main. Java, pass in the XML file name, parse each barrage and print:

import java.util.ArrayList; /** * Created by shuhu on 2018/1/20. */ public class Main { public static void main(String[] args){ ArrayList<String> items = new ArrayList<String>(); // items = localfile.getData ()"3232417.xml"); //2, get all items = remotefile.getData ("https://comment.bilibili.com/3232417.xml"); // Iterate over each barragefor(String item : items) { System.out.println(item); }}}Copy the code

The following output is displayed:

Project code

Code repository: Complete project code