series

  1. Develop blog projects based on ABP vNext and.NET Core – build projects using ABP CLI
  2. Develop blog projects based on ABP vNext and.NET Core – slim the project down and make it run
  3. Development blog project based on ABP vNext and.NET Core – Refinement and beautification, Swagger enter
  4. Develop blog project based on ABP vNext and.NET Core – data access and code priority
  5. Development blog project based on ABP vNext and.NET Core – add, delete, change and check custom warehouse
  6. Develop blog project based on ABP vNext and.NET Core – Uniform specification API, wrapper back model
  7. Develop blog projects based on ABP vNext and.NET Core – say Swagger, grouping, description, little green lock
  8. Develop blog projects based on ABP vNext and.NET Core – access GitHub and protect your API with JWT
  9. Develop blog project based on ABP vNext and.NET Core – exception handling and logging
  10. Develop blog projects based on ABP vNext and.NET Core – using Redis to cache data
  11. Develop blog project based on ABP vNext and.NET Core – integrate Hangfire for timed task processing
  12. Develop blog projects based on ABP vNext and.NET Core – Use AutoMapper to map objects
  13. Developing blog projects based on ABP vNext and.NET Core – Best Practices for Timed Tasks (Part 1)
  14. Developing blog projects based on ABP vNext and.NET Core – Best Practices for Timed Tasks (Part 2)
  15. Developing blog projects based on ABP vNext and.NET Core – Best Practices for Timed Tasks (PART 3)
  16. Blog Development project based on ABP vNext and.NET Core
  17. Abp vNext and.NET Core
  18. Blog Development project based on ABP vNext and.NET Core
  19. Blog Development project based on ABP vNext and.NET Core
  20. Blog Development project based on ABP vNext and.NET Core
  21. Abp vNext and.NET Core Development Blog Project – Blazor
  22. Abp vNext and.NET Core Development Blog Project – Blazor – Part 2
  23. Abp vNext and.NET Core Development Blog Project – Blazor
  24. Abp vNext and.NET Core Development Blog Project – Blazor
  25. Abp vNext and.NET Core Development Blog Project – Blazor
  26. Abp vNext and.NET Core Development Blog Project – Blazor – Part 6
  27. Abp vNext and.NET Core Development Blog Project – Blazor
  28. Abp vNext and.NET Core Development Blog Project – Blazor Series (8)
  29. Abp vNext and.NET Core Development Blog Project – Blazor Series (9)
  30. Abp vNext and.NET Core development blog project – Final release project

The previous (juejin. Cn/post / 684490…). HtmlAgilityPack was used to capture wallpaper data and the images were successfully stored in the database. This paper continues to complete the capture of hot news data of various platforms on the whole network.

Similarly, you can preview the finished product in my personal blog: meowv.com/hot 😝😝😝, which is the same as wallpaper scraping.

This time to grab the source has 18, respectively, are blog park, V2EX, SegmentFault, digging gold, wechat popular, Douban Select, IT home, 36 kr, Baidu Post Bar, Baidu hot search, Weibo hot search, Zhihu hot list, Zhihu Daily, netease news, GitHub, Douyin hot, Douyin video, Douyin positive energy.

Once again, the data is stored in the database, and the entity class and custom repository are created step by step. The entity is named HotNews. Post the code:

//HotNews.cs using System; using Volo.Abp.Domain.Entities; namespace Meowv.Blog.Domain.HotNews { public class HotNews : Entity<Guid> {/// <summary> /// Title // </summary> public string Title {get; set; } /// <summary> /// link /// </summary> public string Url {get; set; } /// <summary> /// SourceId /// </summary> public int SourceId { get; set; } /// </summary> public DateTime CreateTime {get; set; }}}Copy the code

The rest is done by yourself, and eventually the database generates an empty table, meowv_hotnews.

Then again, we’ll put our platforms in an enumeration class, hotNewsenum.cs.

//HotNewsEnum.cs
using System.ComponentModel;

namespace Meowv.Blog.Domain.Shared.Enum
{
    public enum HotNewsEnum
    {
        [Description(Blog Park)]
        cnblogs = 1,

        [Description("V2EX")]
        v2ex = 2,

        [Description("SegmentFault")]
        segmentfault = 3,

        [Description("Nuggets")]
        juejin = 4,

        [Description("Wechat hot")]
        weixin = 5,

        [Description("Selected Douban")]
        douban = 6,

        [Description("IT's home")]
        ithome = 7,

        [Description("36 kr")]
        kr36 = 8,

        [Description("Baidu Post Bar.")]
        tieba = 9,

        [Description("Baidu Hot search")]
        baidu = 10,

        [Description("Trending on Weibo")]
        weibo = 11,

        [Description("Zhihu Hot List")]
        zhihu = 12,

        [Description(Zhihu Daily)]
        zhihudaily = 13,

        [Description(Netease News)]
        news163 = 14,

        [Description("GitHub")]
        github = 15,

        [Description("Tiktok Hot Spot")]
        douyin_hot = 16,

        [Description("Douyin Video")]
        douyin_video = 17,

        [Description("Tik Tok Positive Energy")]
        douyin_positive = 18}}Copy the code

As in the last wallpaper grab, do some preparatory work.

Add HotNewsJobItem

to the.application. Contracts layer, at. The BackgroundJobs layer adds the HotNewsJob to handle the crawler logic and uses the constructor to inject the repository IHotNewsRepository.

//HotNewsJobItem.cs
using Meowv.Blog.Domain.Shared.Enum;

namespace Meowv.Blog.Application.Contracts.HotNews
{
    public class HotNewsJobItem<T>
    {
        /// <summary>
        /// <see cref="Result"/>
        /// </summary>
        public T Result { get; set; }

        /// <summary>
        /// 来源
        /// </summary>
        public HotNewsEnum Source { get; set; }
    }
}
Copy the code
//HotNewsJob.CS
using Meowv.Blog.Domain.HotNews.Repositories;
using System;
using System.Net.Http;
using System.Threading.Tasks;

namespace Meowv.Blog.BackgroundJobs.Jobs.HotNews
{
    public class HotNewsJob : IBackgroundJob
    {
        private readonly IHttpClientFactory _httpClient;
        private readonly IHotNewsRepository _hotNewsRepository;

        public HotNewsJob(IHttpClientFactory httpClient, IHotNewsRepository hotNewsRepository)
        {
            _httpClient = httpClient;
            _hotNewsRepository = hotNewsRepository;
        }

        public async Task ExecuteAsync()
        {
            throw newNotImplementedException(); }}}Copy the code

Next, specify the data source address, since some of the above data sources return HTML or JSON data directly. I’ve also injected IHttpClientFactory here for ease of call.

The compiled list of data sources to be captured looks like this.

.var hotnewsUrls = new List<HotNewsJobItem<string> > {new HotNewsJobItem<string> { Result = "https://www.cnblogs.com", Source = HotNewsEnum.cnblogs },
    new HotNewsJobItem<string> { Result = "https://www.v2ex.com/?tab=hot", Source = HotNewsEnum.v2ex },
    new HotNewsJobItem<string> { Result = "https://segmentfault.com/hottest", Source = HotNewsEnum.segmentfault },
    new HotNewsJobItem<string> { Result = "https://web-api.juejin.im/query", Source = HotNewsEnum.juejin },
    new HotNewsJobItem<string> { Result = "https://weixin.sogou.com", Source = HotNewsEnum.weixin },
    new HotNewsJobItem<string> { Result = "https://www.douban.com/group/explore", Source = HotNewsEnum.douban },
    new HotNewsJobItem<string> { Result = "https://www.ithome.com", Source = HotNewsEnum.ithome },
    new HotNewsJobItem<string> { Result = "https://36kr.com/newsflashes", Source = HotNewsEnum.kr36 },
    new HotNewsJobItem<string> { Result = "http://tieba.baidu.com/hottopic/browse/topicList", Source = HotNewsEnum.tieba },
    new HotNewsJobItem<string> { Result = "http://top.baidu.com/buzz?b=341", Source = HotNewsEnum.baidu },
    new HotNewsJobItem<string> { Result = "https://s.weibo.com/top/summary/summary", Source = HotNewsEnum.weibo },
    new HotNewsJobItem<string> { Result = "https://www.zhihu.com/api/v3/feed/topstory/hot-lists/total?limit=50&desktop=true", Source = HotNewsEnum.zhihu },
    new HotNewsJobItem<string> { Result = "https://daily.zhihu.com", Source = HotNewsEnum.zhihudaily },
    new HotNewsJobItem<string> { Result = "http://news.163.com/special/0001386F/rank_whole.html", Source = HotNewsEnum.news163 },
    new HotNewsJobItem<string> { Result = "https://github.com/trending", Source = HotNewsEnum.github },
    new HotNewsJobItem<string> { Result = "https://www.iesdouyin.com/web/api/v2/hotsearch/billboard/word", Source = HotNewsEnum.douyin_hot },
    new HotNewsJobItem<string> { Result = "https://www.iesdouyin.com/web/api/v2/hotsearch/billboard/aweme", Source = HotNewsEnum.douyin_video },
    new HotNewsJobItem<string> { Result = "https://www.iesdouyin.com/web/api/v2/hotsearch/billboard/aweme/?type=positive", Source = HotNewsEnum.douyin_positive }, }; .Copy the code

There are a few more special, digging gold, Baidu hot search, netease news.

The nuggets need to send a Post request, which returns JSON data, and specify a unique request header and request data, all using the IHttpClientFactory to create an HttpClient object.

Baidu hot search, netease news two big brother play routines, the web code is GB2312, so it is necessary to specify the coding mode, otherwise the data are garbled.

.var web = new HtmlWeb();
var list_task = new List<Task<HotNewsJobItem<object> > > (); hotnewsUrls.ForEach(item => {var task = Task.Run(async() = > {var obj = new object(a);if (item.Source == HotNewsEnum.juejin)
        {
            using var client = _httpClient.CreateClient();
            client.DefaultRequestHeaders.Add("User-Agent"."Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.14 Safari/ 537.36EDG /83.0.478.13");
            client.DefaultRequestHeaders.Add("X-Agent"."Juejin/Web");
            var data = "{\"extensions\":{\"query\":{ \"id\":\"21207e9ddb1de777adeaca7a2fb38030\"}},\"operationName\":\"\",\"query\":\"\",\"variables\":{ \"first\":20,\"after\":\"\",\"order\":\"THREE_DAYS_HOTTEST\"}}";
            var buffer = data.SerializeUtf8();
            var byteContent = new ByteArrayContent(buffer);
            byteContent.Headers.ContentType = new MediaTypeHeaderValue("application/json");

            var httpResponse = await client.PostAsync(item.Result, byteContent);
            obj = await httpResponse.Content.ReadAsStringAsync();
        }
        else
        {
            Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
            obj = await web.LoadFromWebAsync(item.Result, (item.Source == HotNewsEnum.baidu || item.Source == HotNewsEnum.news163) ? Encoding.GetEncoding("GB2312") : Encoding.UTF8);
        }

        return new HotNewsJobItem<object>
        {
            Result = obj,
            Source = item.Source
        };
    });
    list_task.Add(task);
});
Task.WaitAll(list_task.ToArray());
Copy the code

So if we loop through hotnewsUrls, we can see HotNewsJobItem we’re returning object because we have JSON and we have HtmlDocument objects. So in order to receive uniformly, object is used.

For nuggets to do a separate processing, using HttpClient to send a Post request, return JSON string data.

Against baidu search and netease news, using Encoding. RegisterProvider (CodePagesEncodingProvider. Instance); Register the encoding provider, then in web.loadFromWebasync (…) I used a ternary expression to specify the page code when loading the page data.

Once you’ve done this, you can loop through list_Task, use XPath syntax, or parse JSON data to retrieve the data.

.var hotNews = new List<HotNews>();
foreach (var list in list_task)
{
    var item = await list;
    var sourceId = (int)item.Source; .if (hotNews.Any())
    {
        await _hotNewsRepository.DeleteAsync(x => true);
        await_hotNewsRepository.BulkInsertAsync(hotNews); }}Copy the code

The crawler is also very simple, just get the title and link, so the main goal is to find the list of A tags on the page. This I think there is no need to analyze one by one, directly on the code.

/ / blog
 if (item.Source == HotNewsEnum.cnblogs)
 {
     var nodes = ((HtmlDocument)item.Result).DocumentNode.SelectNodes("//div[@class='post_item_body']/h3/a").ToList();
     nodes.ForEach(x =>
     {
         hotNews.Add(new HotNews
         {
             Title = x.InnerText,
             Url = x.GetAttributeValue("href".""),
             SourceId = sourceId,
             CreateTime = DateTime.Now
         });
     });
 }
Copy the code
// V2EX
if (item.Source == HotNewsEnum.v2ex)
{
    var nodes = ((HtmlDocument)item.Result).DocumentNode.SelectNodes("//span[@class='item_title']/a").ToList();
    nodes.ForEach(x =>
    {
        hotNews.Add(new HotNews
        {
            Title = x.InnerText,
            Url = $"https://www.v2ex.com{x.GetAttributeValue("href"."")}",
            SourceId = sourceId,
            CreateTime = DateTime.Now
        });
    });
}
Copy the code
 // SegmentFault
 if (item.Source == HotNewsEnum.segmentfault)
 {
     var nodes = ((HtmlDocument)item.Result).DocumentNode.SelectNodes("//div[@class='news__item-info clearfix']/a").Where(x => x.InnerText.IsNotNullOrEmpty()).ToList();
     nodes.ForEach(x =>
     {
         hotNews.Add(new HotNews
         {
             Title = x.SelectSingleNode(".//h4").InnerText,
             Url = $"https://segmentfault.com{x.GetAttributeValue("href"."")}",
             SourceId = sourceId,
             CreateTime = DateTime.Now
         });
     });
 }
Copy the code
/ / the Denver nuggets
if (item.Source == HotNewsEnum.juejin)
{
    var obj = JObject.Parse((string)item.Result);
    var nodes = obj["data"] ["articleFeed"] ["items"] ["edges"];
    foreach (var node in nodes)
    {
        hotNews.Add(new HotNews
        {
            Title = node["node"] ["title"].ToString(),
            Url = node["node"] ["originalUrl"].ToString(), SourceId = sourceId, CreateTime = DateTime.Now }); }}Copy the code
// wechat is popular
if (item.Source == HotNewsEnum.weixin)
{
    var nodes = ((HtmlDocument)item.Result).DocumentNode.SelectNodes("//ul[@class='news-list']/li/div[@class='txt-box']/h3/a").ToList();
    nodes.ForEach(x =>
    {
        hotNews.Add(new HotNews
        {
            Title = x.InnerText,
            Url = x.GetAttributeValue("href".""),
            SourceId = sourceId,
            CreateTime = DateTime.Now
        });
    });
}
Copy the code
// Douban selection
if (item.Source == HotNewsEnum.douban)
{
    var nodes = ((HtmlDocument)item.Result).DocumentNode.SelectNodes("//div[@class='channel-item']/div[@class='bd']/h3/a").ToList();
    nodes.ForEach(x =>
    {
        hotNews.Add(new HotNews
        {
            Title = x.InnerText,
            Url = x.GetAttributeValue("href".""),
            SourceId = sourceId,
            CreateTime = DateTime.Now
        });
    });
}
Copy the code
/ / IT's home
if (item.Source == HotNewsEnum.ithome)
{
    var nodes = ((HtmlDocument)item.Result).DocumentNode.SelectNodes("//div[@class='lst lst-2 hot-list']/div[1]/ul/li/a").ToList();
    nodes.ForEach(x =>
    {
        hotNews.Add(new HotNews
        {
            Title = x.InnerText,
            Url = x.GetAttributeValue("href".""),
            SourceId = sourceId,
            CreateTime = DateTime.Now
        });
    });
}
Copy the code
/ / 36 kr
if (item.Source == HotNewsEnum.kr36)
{
    var nodes = ((HtmlDocument)item.Result).DocumentNode.SelectNodes("//div[@class='hotlist-main']/div[@class='hotlist-item-toptwo']/a[2]|//div[@class='hotlist-main']/div[@class='hotlist-it em-other clearfloat']/div[@class='hotlist-item-other-info']/a").ToList();
    nodes.ForEach(x =>
    {
        hotNews.Add(new HotNews
        {
            Title = x.InnerText,
            Url = $"https://36kr.com{x.GetAttributeValue("href"."")}",
            SourceId = sourceId,
            CreateTime = DateTime.Now
        });
    });
}
Copy the code
// Baidu post bar
if (item.Source == HotNewsEnum.tieba)
{
    var obj = JObject.Parse(((HtmlDocument)item.Result).ParsedText);
    var nodes = obj["data"] ["bang_topic"] ["topic_list"];
    foreach (var node in nodes)
    {
        hotNews.Add(new HotNews
        {
            Title = node["topic_name"].ToString(),
            Url = node["topic_url"].ToString().Replace("amp;".""), SourceId = sourceId, CreateTime = DateTime.Now }); }}Copy the code
// Baidu hot search
if (item.Source == HotNewsEnum.baidu)
{
    var nodes = ((HtmlDocument)item.Result).DocumentNode.SelectNodes("//table[@class='list-table']//tr/td[@class='keyword']/a[@class='list-title']").ToList();
    nodes.ForEach(x =>
    {
        hotNews.Add(new HotNews
        {
            Title = x.InnerText,
            Url = x.GetAttributeValue("href".""),
            SourceId = sourceId,
            CreateTime = DateTime.Now
        });
    });
}
Copy the code
// It's trending on Weibo
if (item.Source == HotNewsEnum.weibo)
{
    var nodes = ((HtmlDocument)item.Result).DocumentNode.SelectNodes("//table/tbody/tr/td[2]/a").ToList();
    nodes.ForEach(x =>
    {
        hotNews.Add(new HotNews
        {
            Title = x.InnerText,
            Url = $"https://s.weibo.com{x.GetAttributeValue("href"."").Replace("#"."% 23")}",
            SourceId = sourceId,
            CreateTime = DateTime.Now
        });
    });
}
Copy the code
// Zhihu hot list
if (item.Source == HotNewsEnum.zhihu)
{
    var obj = JObject.Parse(((HtmlDocument)item.Result).ParsedText);
    var nodes = obj["data"];
    foreach (var node in nodes)
    {
        hotNews.Add(new HotNews
        {
            Title = node["target"] ["title"].ToString(),
            Url = $"https://www.zhihu.com/question/{node["target"] ["id"]}", SourceId = sourceId, CreateTime = DateTime.Now }); }}Copy the code
// Zhihu Daily
if (item.Source == HotNewsEnum.zhihudaily)
{
    var nodes = ((HtmlDocument)item.Result).DocumentNode.SelectNodes("//div[@class='box']/a").ToList();
    nodes.ForEach(x =>
    {
        hotNews.Add(new HotNews
        {
            Title = x.InnerText,
            Url = $"https://daily.zhihu.com{x.GetAttributeValue("href"."")}",
            SourceId = sourceId,
            CreateTime = DateTime.Now
        });
    });
}
Copy the code
// Netease News
if (item.Source == HotNewsEnum.news163)
{
    var nodes = ((HtmlDocument)item.Result).DocumentNode.SelectNodes("//div[@class='area-half left']/div[@class='tabBox']/div[@class='tabContents active']/table//tr/td[1]/a").ToList();
    nodes.ForEach(x =>
    {
        hotNews.Add(new HotNews
        {
            Title = x.InnerText,
            Url = x.GetAttributeValue("href".""),
            SourceId = sourceId,
            CreateTime = DateTime.Now
        });
    });
}
Copy the code
// GitHub
if (item.Source == HotNewsEnum.github)
{
    var nodes = ((HtmlDocument)item.Result).DocumentNode.SelectNodes("//article[@class='Box-row']/h1/a").ToList();
    nodes.ForEach(x =>
    {
        hotNews.Add(new HotNews
        {
            Title = x.InnerText.Trim().Replace("\n"."").Replace("".""),
            Url = $"https://github.com{x.GetAttributeValue("href"."")}",
            SourceId = sourceId,
            CreateTime = DateTime.Now
        });
    });
}
Copy the code
// Douyin hotspot
if (item.Source == HotNewsEnum.douyin_hot)
{
    var obj = JObject.Parse(((HtmlDocument)item.Result).ParsedText);
    var nodes = obj["word_list"];
    foreach (var node in nodes)
    {
        hotNews.Add(new HotNews
        {
            Title = node["word"].ToString(),
            Url = # $"{node["hot_value"]}", SourceId = sourceId, CreateTime = DateTime.Now }); }}Copy the code
// Douyin video & Douyin positive energy
if (item.Source == HotNewsEnum.douyin_video || item.Source == HotNewsEnum.douyin_positive)
{
    var obj = JObject.Parse(((HtmlDocument)item.Result).ParsedText);
    var nodes = obj["aweme_list"];
    foreach (var node in nodes)
    {
        hotNews.Add(new HotNews
        {
            Title = node["aweme_info"] ["desc"].ToString(),
            Url = node["aweme_info"] ["share_url"].ToString(), SourceId = sourceId, CreateTime = DateTime.Now }); }}Copy the code

We cast item.Result to the specified type, and when we finally get the data, we delete all the data before bulk inserts.

Then create a new extension method, UseHotNewsJob(), called in the module class.

//MeowvBlogBackgroundJobsExtensions.cs./// <summary>
        ///Daily capture of hotspot data
        /// </summary>
        /// <param name="context"></param>
        public static void UseHotNewsJob(this IServiceProvider service)
        {
            var job = service.GetService<HotNewsJob>();

            RecurringJob.AddOrUpdate("Daily Hot Data capture", () => job.ExecuteAsync(), CronType.Hour(1.2)); }...Copy the code

The scheduled task is run every 2 hours.

.public override void OnApplicationInitialization(ApplicationInitializationContext context)
        {...varservice = context.ServiceProvider; . service.UseHotNewsJob(); }Copy the code

Compile and run, at which point periodic jobs will appear our scheduled tasks.

It will not be executed until the default time is up. Let’s manually execute it and wait for a while to see the effect.

Hangfire will execute in a loop according to the given rules. Hangfire will execute in a loop according to the given rules. 😁 😁 😁

Open source: github.com/Meowv/Blog/…