1024 is a good site
First of all, the actual combat series is the premise that you can find a scientific 1024 website! I do not provide the website address here, for the record, here is a computer science attitude and method, to analyze a problem. There is no association with 1024 website.
On 1024 website, I don’t know if you like me, usually love to visit the technical discussion area, love to see some daily information summary of the post, so will not be because the plate of the theme post is in accordance with the reply time sorted and can not find their own like to see the post and upset? Is it to find yourself today did not see the post, and again and again from the beginning to turn over it?
Don’t worry, I’ve had these problems before! The community has a large population, the posts brush quickly, in order to see the posts published every day, the layout of the board had to let me come in every time I had to start from scratch to see which posts I had not seen today. And is the left to see the title, the right to see the release time, very tired ah. I don’t like it. It’s a waste of time.
As a programmer, I feel that these are all problems that can be solved by writing a Python crawler by hand.
I feel this bug is the most convenient *** on the Internet,A newbest.The most convenient, *** the best solution to the actual problem *** bug! Live to learn and apply,Making code really convenient for my life is what I write programs for.
## The problem we are meeting now: the forum posts are sorted according to the time of reply. In order to see the latest posts published every day, we always have to read the whole forum from the beginning, which is very annoying and a waste of time.
The way we want to be
Forum posts are arranged in chronological order, so it’s easy to see what’s new each day.
If we were to write a crawler to solve the problem, the general structure would be as follows:
Here are a few parts:
- config.json: this is the configuration file, the current need to fill in the information is: 1.1024 URL 2. Crawler output file location 3. Crawler needs to crawl the maximum page num 4. Section information refers to the section name of the forum (This can be customized) and plate FID
- Url_manager: manages the URL to be crawled by the standby.
- Html_downloader: crawler gets web page information.
- Html_parser: Web page parser for crawlers.
- Html_output: crawler output result.
The above structure is very simple, so the simple process is: We first configure the local config.json file, and then start the program. The crawler will automatically grab the content of the first few pages of each plate and filter out the crawling information according to the Posting time. Open it using a local web browser. In the browser, you can see the id of the post, the title of the post and when the post was published. By clicking on the title of a post, you can jump to a post in the community.
In this way, the rich content of the grass website, directly into our local write the simplest HTML file.
The homepage of our website after finishing:
The new plate looks like this:
In this way, it is much easier and more comfortable, no longer need to find one by one as before. And, we have seen which post, are differentiated by different colors. It saves a lot of time. The following is a brief description of the technical points used in the project.
Comb technology
While there are a lot of mature crawler frameworks on the web, like Scrapy, and I’ve used Scrapy before, Scrapy is great, but it doesn’t feel like it’s fun. So build your own crawler from scratch. Feel the crawler from zero, and feel the fun of Python.
The overall technology
Python 3.6
requests
BeautifulSoup4
webbrowser
json
Config.json
This is the configuration file that will require some basic parameters to be written in the JSON file. The first read class to be closed is ConfigReader from config_Utils.
Url_manager
A dict stores block names and their corresponding block urls, providing some simple ways to manipulate urls.
Html_download
Access to the Web page using the Requests module. In order to get the web page data, for the following steps to provide a basis for the analysis. I added a different HTTP header for the network request, because the 1024 site did anti-crawl. At present still calculate relatively easy to use. The header information is in the user_agents file.
Html_parser
BeautifulSoup is used to parse HTML. Each post has a unique ID. The posts are wrapped into a CaoliuItem, and the result is output to HTML_outputer. This is done using HTML tags, not regular expressions. It might be a little stiff.
Html_outputer
This is a class that collects the results of crawler parsing into HTML files. The end result is an Index page, with each section having its own page. They link to each other, click up cool, easy to fry chicken.
Areas for improvement TODO
- Although the overall structure is clear, the overall structure still needs to be optimized. To do like
Scrapy
With a bug that powerful, you gotta take it one step at a time. - At present, the crawler ability is relatively weak, and multithreaded crawler is not used. Multithreading will be added in the next version to improve both speed and quality.
parser
Parsing is still too dependent on the layout of the site. If the site layout changes,parser
You have to change it. This problem is common to all reptiles, and I’m still trying to make things a little more lively here, not so rigid.output
thehtml
The file is not beautiful enough.- The next version, what you want to parse out, will be able to integrate
MongoDB
Linkage, as a local save a copy. Because then you can see the previous post information. - The next best thing is for each post to go up a level and automatically download the image or the seed file. This download picture and seed bug I used before
Scrapy
I did, but I still need to combine the bugs I wrote. - It would be nice to extend crawlers to other sites, such as Twitter, V2ex, etc. I feel that it is sometimes a waste of time to go back and forth between these websites and open this one and that one. It is better to integrate their daily updates into one website and see enough at a time. This is so cool.
- The final version is to make this program into a background service, and then deployed to the server, through access every day, can see the day of each website updates. “Access one, access all” effect.
This project source, welcome STAR.
Finally to a wave of welfare, concern public number: pique pa shovel excrement officer, reply 1024, can find what you need oh ~! [Follow and reply to 1024 have surprise](bottom qr code. PNG