origin
During the epidemic, everyone was bored at home. A colleague in the company initiated a post recommending movies, and the owner collected all the replies and organized them into a doubandou list. Just recently, I was writing a series of articles on reptiles. Let’s use this as a specific case to introduce the use of another artifact, Jsoup.
What is the Jsoup
Jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very labor-intensive API for retrieving and manipulating data using DOM, CSS, and jquery-like manipulation methods.
Jsoup uses the same syntax as JQuery for node operations, The JQuery selector syntax can refer to https://www.cnblogs.com/zhangziqiu/archive/2009/05/03/jQuery-Learn-2.html for learning, here is not in now.
Page structure analysis
Paging data analysis
F12 Open chrome Developer Tools, slide to the bottom of the page, and select the paging node. You can see that the page style is modified with paginatorCSS style alone. Click on the A node under div, you can directly use CSS selector to select the corresponding Dom node and get the corresponding link address.
Example code, because the paging information contains the previous page, after the page information, this is not what we need, just select the Number type can use the re to filter.
Pattern pattern = Pattern.compile("^ [\ \ +]? [\\d]*$");
Elements page = document.select(".paginator a");
for (Element p : page) {
String href = p.attr("href");
String text = p.text();
if (pattern.matcher(text).matches()) {
System.out.println(href + ""+ text); }}Copy the code
This will fetch all the page links and save them for later use.
Detailed analysis of individual films
Play the film
A closer look at the information on each movie shows that there are three types of full-movie functionality
- None Playlist
- Playlists of up to 3
- More than three displays
More and more
link
The same applies to CSS selectors that select nodes and then retrieve the corresponding text.
Element videoItem = item.select(".doulist-video-items").first();
if(videoItem ! =null) {
Elements videoAtags = videoItem.getElementsByTag("a");
for (Element e : videoAtags) {
String href = e.attr("href");
String text = e.text();
if (Objects.equals("More", text)) {
continue;
}
if (href.contains("www.douban.com/link2")) {
String urlDecode = URLDecoder.decode(href);
href = urlDecode.split("=") [1];
}
System.out.println(text + ""+ href); }}Copy the code
For more details
The data we need includes the movie name, score, number of reviewers and the key information shown in the figure
The same applies to CSS selectors.
Elements itemElements = document.select(".article .doulist-item");
int size = itemElements.size();
for (int i = 0; i < size; i++) {
Element item = itemElements.get(i);
Element title = item.selectFirst("div.title");
Element ratingNums = item.selectFirst(".rating_nums");
Element rating = item.select(".rating").get(0).getElementsByTag("span").last();
String titleText = title.text();
String ratingNumsText = ratingNums.text();
String ratingText = rating.text().replaceAll("\\("."").replaceAll("Man appraises \\)"."");
System.out.println(titleText);
System.out.println(ratingNumsText);
System.out.println(ratingText);
}
Copy the code
TOP10 movie recommendations
The movie name | score | Evaluation of the number |
---|---|---|
Blue Planet II Blue Planet II | 9.8 | 31789 |
The Shawshank Redemption | 9.7 | 1902833 |
Farewell my concubine | 9.6 | 1398550 |
Forrest Gump | 9.5 | 1447658 |
La Vita e Bella | 9.5 | 919234 |
One Piece ワ ピー shinko | 9.5 | 111412 |
Attract infestation チ rounding off the building | 9.4 | 361096 |
The winner is justice. ガ to the list | 9.4 | 224077 |
Inception | 9.3 | 1396540 |
La Leggenda del Pianista Sull ‘Oceano | 9.3 | 1159683 |
Get the source code and all movies
The movie 1024