series
- Develop blog projects based on ABP vNext and.NET Core – build projects using ABP CLI
- Develop blog projects based on ABP vNext and.NET Core – slim the project down and make it run
- Development blog project based on ABP vNext and.NET Core – Refinement and beautification, Swagger enter
- Develop blog project based on ABP vNext and.NET Core – data access and code priority
- Development blog project based on ABP vNext and.NET Core – add, delete, change and check custom warehouse
- Develop blog project based on ABP vNext and.NET Core – Uniform specification API, wrapper back model
- Develop blog projects based on ABP vNext and.NET Core – say Swagger, grouping, description, little green lock
- Develop blog projects based on ABP vNext and.NET Core – access GitHub and protect your API with JWT
- Develop blog project based on ABP vNext and.NET Core – exception handling and logging
- Develop blog projects based on ABP vNext and.NET Core – using Redis to cache data
- Develop blog project based on ABP vNext and.NET Core – integrate Hangfire for timed task processing
- Develop blog projects based on ABP vNext and.NET Core – Use AutoMapper to map objects
- Developing blog projects based on ABP vNext and.NET Core – Best Practices for Timed Tasks (Part 1)
- Developing blog projects based on ABP vNext and.NET Core – Best Practices for Timed Tasks (Part 2)
- Developing blog projects based on ABP vNext and.NET Core – Best Practices for Timed Tasks (PART 3)
- Blog Development project based on ABP vNext and.NET Core
- Abp vNext and.NET Core
- Blog Development project based on ABP vNext and.NET Core
- Blog Development project based on ABP vNext and.NET Core
- Blog Development project based on ABP vNext and.NET Core
- Abp vNext and.NET Core Development Blog Project – Blazor
- Abp vNext and.NET Core Development Blog Project – Blazor – Part 2
- Abp vNext and.NET Core Development Blog Project – Blazor
- Abp vNext and.NET Core Development Blog Project – Blazor
- Abp vNext and.NET Core Development Blog Project – Blazor
- Abp vNext and.NET Core Development Blog Project – Blazor – Part 6
- Abp vNext and.NET Core Development Blog Project – Blazor
- Abp vNext and.NET Core Development Blog Project – Blazor Series (8)
- Abp vNext and.NET Core Development Blog Project – Blazor Series (9)
- Abp vNext and.NET Core development blog project – Final release project
The previous (juejin. Cn/post / 684490…). This paper uses AutoMapper to deal with the mapping relationship between objects. This paper mainly focuses on the knowledge points related to timing task and data capture and combines the practical application to process crawler task data capture in timing task cycle.
Before starting, you can delete several helloWorlds used in the previous test, which have no practical significance, and just kill them. To grab data, I mainly use HtmlAgilityPack and PuppeteerSharp. In general, HtmlAgilityPack can complete most of the data fetching requirements, and PuppeteerSharp can be used when crawling dynamic web pages. PuppeteerSharp also has the awesome ability to save images as images and PDFS.
There is no more information about these two libraries. If you do not know them, please learn them by yourself.
Install two artifacts in the.BackgroundJobs layer: install-package HtmlAgilityPack and install-package PuppeteerSharp. I generally don’t like to specify a version number when using Package Manager to install packages, because it defaults to installing the latest version for me.
I accidentally found that the web version of Ace’s assistant has many mobile phone wallpapers (www.i4.cn/wper_4_0_1_…). , so I moved a small idea, all the mobile phone wallpaper has been captured from hi, you can see the finished product in my personal blog: meowv.com/wallpaper 😝😝😝
I started out doing it in Python, and now we’re in. NET to catch it.
I counted 20 categories, directly in the. Add an enumeration of wallpaperenum.cs for the Domain.Shared layer.
//WallpaperEnum.cs
using System.ComponentModel;
namespace Meowv.Blog.Domain.Shared.Enum
{
public enum WallpaperEnum
{
[Description("The beauty")]
Beauty = 1,
[Description("Stylish men")]
Sportsman = 2,
[Description("' Eva")]
CuteBaby = 3,
[Description("Emotional")]
Emotion = 4,
[Description("Landscape")]
Landscape = 5,
[Description("Animal")]
Animal = 6,
[Description("Plant")]
Plant = 7,
[Description("Food")]
Food = 8,
[Description("The film and television")]
Movie = 9,
[Description("Anime")]
Anime = 10,
[Description("Hand")]
HandPainted = 11,
[Description("Text")]
Text = 12,
[Description("Creative")]
Creative = 13,
[Description("Cars")]
Car = 14,
[Description("Sports")]
PhysicalEducation = 15,
[Description("Military")]
Military = 16,
[Description("Holiday")]
Festival = 17,
[Description("Game")]
Game = 18,
[Description(The word "apple")]
Apple = 19,
[Description("Other")]
Other = 20,}}Copy the code
View the original web page can be very clear to see, each category corresponds to a different URL, so manually create a crawl list, list content including URL and classification, and then I want to use multiple threads to access the URL, return the result. Create a new generic class called wallPaperJobitem.cs for the specification and subsequent wallpaper query interface. Application. The Contracts layer.
//WallpaperJobItem.cs using Meowv.Blog.Domain.Shared.Enum; namespace Meowv.Blog.Application.Contracts.Wallpaper { public class WallpaperJobItem<T> { /// <summary> /// <see cref="Result"/> /// </summary> public T Result { get; set; } /// </summary> // Type /// </summary> public WallpaperEnum Type {get; set; }}}Copy the code
WallpaperJobItem
accepts a parameter T, the type of Result is determined by T, in. Create a new job in the Jobs folder of the BackgroundJobs layer and call it wallpaperJob.cs. Same old, inherited IBackgroundJob.
//WallpaperJob.cs
using Meowv.Blog.Application.Contracts.Wallpaper;
using Meowv.Blog.Domain.Shared.Enum;
using System.Collections.Generic;
using System.Threading.Tasks;
namespace Meowv.Blog.BackgroundJobs.Jobs.Wallpaper
{
public class WallpaperJob : IBackgroundJob
{
public async Task ExecuteAsync()
{
var wallpaperUrls = new List<WallpaperJobItem<string> > {new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_1_1.html", Type = WallpaperEnum.Beauty },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_58_1.html", Type = WallpaperEnum.Sportsman },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_66_1.html", Type = WallpaperEnum.CuteBaby },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_4_1.html", Type = WallpaperEnum.Emotion },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_3_1.html", Type = WallpaperEnum.Landscape },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_9_1.html", Type = WallpaperEnum.Animal },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_13_1.html", Type = WallpaperEnum.Plant },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_64_1.html", Type = WallpaperEnum.Food },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_11_1.html", Type = WallpaperEnum.Movie },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_5_1.html", Type = WallpaperEnum.Anime },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_34_1.html", Type = WallpaperEnum.HandPainted },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_65_1.html", Type = WallpaperEnum.Text },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_2_1.html", Type = WallpaperEnum.Creative },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_10_1.html", Type = WallpaperEnum.Car },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_14_1.html", Type = WallpaperEnum.PhysicalEducation },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_63_1.html", Type = WallpaperEnum.Military },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_17_1.html", Type = WallpaperEnum.Festival },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_15_1.html", Type = WallpaperEnum.Game },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_12_1.html", Type = WallpaperEnum.Apple },
new WallpaperJobItem<string> { Result = "https://www.i4.cn/wper_4_19_7_1.html", Type = WallpaperEnum.Other } }; }}}Copy the code
Start by building a list of wallpaperUrls to crawl. Here we are going to use HtmlAgilityPack to crawl only the latest data on the first page.
public async Task RunAsync()
{...var web = new HtmlWeb();
var list_task = new List<Task<WallpaperJobItem<HtmlDocument>>>();
wallpaperUrls.ForEach(item =>
{
var task = Task.Run(async() = > {var htmlDocument = await web.LoadFromWebAsync(item.Result);
return new WallpaperJobItem<HtmlDocument>
{
Result = htmlDocument,
Type = item.Type
};
});
list_task.Add(task);
});
Task.WaitAll(list_task.ToArray());
}
Copy the code
This code, first new an HtmlWeb object, we mainly use this object to load our URL.
web.LoadFromWebAsync(…) , it will return an HtmlDocument object corresponding to the above list_Task, thus confirming that the WallpaperJobItem added earlier is a generic class to catch items.
Loop through the wallpaperUrls and wait for all requests to complete. Now that you have 20 HTMLDocuments and their categories, you can tackle list_Task.
Before you start processing, think about where the captured image data is stored. I still choose to save it in the database, because WITH the previous experience of adding, deleting, changing and checking custom storage, I can deal with this matter quickly.
Adding entity classes, custom warehousing, DbSet, code-first, etc., I won’t go into detail, but I’m sure anyone who has read the previous article can do this.
The Wallpaper entity class contains the primary key Guid, Title, image address Url, Type Type, and a CreateTime CreateTime.
Custom repositories contain a method for bulk inserts: BulkInsertAsync(…) .
Post the finished image, no code, if you need to go to GitHub.
Going back to WallpaperJob, since we’re grabbing an image, we just need to get the IMG tag in the HTML.
//article[@id=’wper’]/div[@class=’jbox’]/div[@class=’kbox’]. For there is not an article on the quick start: www.cnblogs.com/meowv/p/113… .
Using the XPath Helper tool, we can simulate the correct node selection on the browser.
Article [@id=’wper’]/div[@class=’jbox’]/div[@class=’kbox’]/div/a/img can be successfully highlighted to show that our syntax is correct.
public async Task RunAsync()
{...var wallpapers = new List<Wallpaper>();
foreach (var list in list_task)
{
var item = await list;
var imgs = item.Result.DocumentNode.SelectNodes("//article[@id='wper']/div[@class='jbox']/div[@class='kbox']/div/a/img[1]").ToList();
imgs.ForEach(x =>
{
wallpapers.Add(new Wallpaper
{
Url = x.GetAttributeValue("data-big".""),
Title = x.GetAttributeValue("title".""),
Type = (int)item.Type,
CreateTime = x.Attributes["data-big"].Value.Split("/").Last().Split("_").First().TryToDateTime() }); }); }... }Copy the code
In the foreach loop, we get the current Item object, WallpaperJobItem
.
Through DocumentNode. SelectNodes () syntax to the image list, because in a label below, there are two img tags, take the first.
GetAttributeValue() is an extension of HtmlAgilityPack to get the attribute value directly.
When I look at the image, I find that the rule of the image address is generated according to the timestamp, so I use the TryToDateTime() extension method to convert its processing to the time format.
So we have all the images sorted into the list, and then we call the batch insert method.
Inject the custom repository IWallpaperRepository in the constructor.
.private readonly IWallpaperRepository _wallpaperRepository;
public WallpaperJob(IWallpaperRepository wallpaperRepository)
{ _wallpaperRepository = wallpaperRepository; }...Copy the code
.var urls = (await_wallpaperRepository.GetListAsync()).Select(x => x.Url); wallpapers = wallpapers.Where(x => ! urls.Contains(x.Url)).ToList();if (wallpapers.Any())
{
await _wallpaperRepository.BulkInsertAsync(wallpapers);
}
Copy the code
Because there may be duplicate images captured, we need to do a de-processing, first query all URL lists in the database, and then judge whether the captured URL exists, and finally call BulkInsertAsync(…). Batch insert method.
This completed all the logic of data capture, after saving the data to the database we can further operation, such as: write a log, send email notification and so on, here we play freely.
Write an extension method that executes every 3 hours.
.public static void UseWallpaperJob(this IServiceProvider service)
{
var job = service.GetService<WallpaperJob>();
RecurringJob.AddOrUpdate("Wallpaper data capture", () => job.ExecuteAsync(), CronType.Hour(1.3)); }...Copy the code
Finally called within the module.
.public override void OnApplicationInitialization(ApplicationInitializationContext context)
{... service.UseWallpaperJob(); }Copy the code
Compile and run, open the Hangfire interface and run it manually to see the effect.
Perfect, the database is already storing a lot of data, but a word of caution: crawlers are risky and need to be careful.
Hangfire uses HtmlAgilityPack to grab data and store it in the database. 😁 😁 😁
Open source: github.com/Meowv/Blog/…