origin

During the epidemic, everyone was bored at home. A colleague in the company initiated a post recommending movies, and the owner collected all the replies and organized them into a doubandou list. Just recently, I was writing a series of articles on reptiles. Let’s use this as a specific case to introduce the use of another artifact, Jsoup.

What is the Jsoup

Jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very labor-intensive API for retrieving and manipulating data using DOM, CSS, and jquery-like manipulation methods.

Jsoup uses the same syntax as JQuery for node operations, The JQuery selector syntax can refer to https://www.cnblogs.com/zhangziqiu/archive/2009/05/03/jQuery-Learn-2.html for learning, here is not in now.

Page structure analysis

Paging data analysis

F12 Open chrome Developer Tools, slide to the bottom of the page, and select the paging node. You can see that the page style is modified with paginatorCSS style alone. Click on the A node under div, you can directly use CSS selector to select the corresponding Dom node and get the corresponding link address.

Example code, because the paging information contains the previous page, after the page information, this is not what we need, just select the Number type can use the re to filter.

Pattern pattern = Pattern.compile("^ [\ \ +]? [\\d]*$");
Elements page = document.select(".paginator a");
for (Element p : page) {
	String href = p.attr("href");
	String text = p.text();
	if (pattern.matcher(text).matches()) {
		System.out.println(href + ""+ text); }}Copy the code

This will fetch all the page links and save them for later use.

Detailed analysis of individual films

Play the film

A closer look at the information on each movie shows that there are three types of full-movie functionality

  • None Playlist
  • Playlists of up to 3
  • More than three displaysMore and morelink

The same applies to CSS selectors that select nodes and then retrieve the corresponding text.

Element videoItem = item.select(".doulist-video-items").first();
if(videoItem ! =null) {
	Elements videoAtags = videoItem.getElementsByTag("a");
	for (Element e : videoAtags) {
		String href = e.attr("href");
		String text = e.text();
		if (Objects.equals("More", text)) {
			continue;
		}
		if (href.contains("www.douban.com/link2")) {
			String urlDecode = URLDecoder.decode(href);

			href = urlDecode.split("=") [1];
		}
		System.out.println(text + ""+ href); }}Copy the code

For more details

The data we need includes the movie name, score, number of reviewers and the key information shown in the figure

The same applies to CSS selectors.

Elements itemElements = document.select(".article .doulist-item");
int size = itemElements.size();
for (int i = 0; i < size; i++) {
	Element item = itemElements.get(i);
	Element title = item.selectFirst("div.title");
	Element ratingNums = item.selectFirst(".rating_nums");
	Element rating = item.select(".rating").get(0).getElementsByTag("span").last();

	String titleText = title.text();
	String ratingNumsText = ratingNums.text();
	String ratingText = rating.text().replaceAll("\\("."").replaceAll("Man appraises \\)"."");
	System.out.println(titleText);
	System.out.println(ratingNumsText);
	System.out.println(ratingText);
}
Copy the code

TOP10 movie recommendations

The movie name score Evaluation of the number
Blue Planet II Blue Planet II 9.8 31789
The Shawshank Redemption 9.7 1902833
Farewell my concubine 9.6 1398550
Forrest Gump 9.5 1447658
La Vita e Bella 9.5 919234
One Piece ワ ピー shinko 9.5 111412
Attract infestation チ rounding off the building 9.4 361096
The winner is justice. ガ to the list 9.4 224077
Inception 9.3 1396540
La Leggenda del Pianista Sull ‘Oceano 9.3 1159683

Get the source code and all movies

The movie 1024