Up to now, machine learning has accumulated a lot of articles, especially with the popularity of deep learning, many new papers are added every year. If you want to study a certain field, you not only need to read classic papers in this field, but also have to pay close attention to the latest academic progress, such as GAN, which is very popular in the last two years. Not only do you need to find out about its first work, “Generative Adversarial Nets, “but also keep an eye out for the latest research in the field.
Searching for papers, in addition to a direct Google search, is usually done on the ArXiv website, which shows the latest papers in the computer vision and pattern recognition category below:
However, as can be seen from the above figure, it only shows the name of the paper and the author, publication time and other information. If the author is familiar with, such as some big bull, of course, it is worth us to see, but if the author is not familiar, it can only be judged by the name of the paper, click the name of the paper to view the introduction of the paper. Then through the introduction of the paper to determine whether it is worth downloading PDF for intensive reading or extensive reading.
If possible, we would certainly like to present the brief introduction of the paper on this page, so as to reduce one step.
So today I recommend a site that makes machine learning papers easier to read, based on the API provided by arXiv:
Website: www.arxiv-sanity.com/
The site currently has a collection of approximately 62,820 papers from the past few years, all of them on machine learning, and there are several tabs below:
most recent
Present the latest papers. For each paper, it shows the name, author, date of publication, illustration of the paper, and introduction of the paper. You can then download the PDF and search for similar papers and discussion forums.
However, as for the discussion area, it seems that there are not many users or people who make comments, the papers directly displayed will have no comments. It is necessary to directly click the discussions TAB to show the papers with comments, but there will be only one comment, not more than two comments.
top recent
Based on the papers that the logged-in user has saved to their library, you can choose to display a range of recent days, including the last day, three days, a week, a month, a year, and all.
top hype
This is mainly to show the papers that have been mentioned on Twitter. You can check the mentioned users and the contents of the tweets, but I think they are basically the official Tweets of the arxiv that are directly forwarded, similar to the tweets that we directly forwarded.
The last few TAB pages, except discussions, are required to log in. Friends is to show your friend’s paper. Rehali is based on the paper you have stored in your library.
Making project
The code for this site is open source on Github:
Github.com/karpathy/ar…
One of the code files for finding papers through the Arxiv API is fetch_papers.py, where you can change the category of papers you want to find, not just machine learning. For the Arxiv API, you can view the documentation at:
Arxiv.org/help/api/us…
The code structure
According to the author, the code is divided into two parts:
The query code
Download the latest papers of the specified category through the Arxiv API and extract the contents of each paper to extract text and create a TFIDF vector. This part of the code needs to consider the functionality of the backend crawl and calculation:
- Build a database of arxiv papers
- Compute content vector
- Generate thumbnails
- Calculate SVMs for users
- , etc.
The user interface
This part is a web server (based on Flask/Tornado/ SQLite), which can query papers through database and filter users according to similarity, etc.
Dependent libraries
The required dependency libraries include:
- numpy
- Feedparser – Parses XML files
- Scikit Learn — Handle TFIDEF vector and implement SVM algorithm
- Flask — Show the results
- flask_limiter
- tornado
- dateutil
- scipy
- sqlite3
The above dependent libraries can be installed with the following command:
$ virtualenv env # optional: use virtualenv
$ source env/bin/activate # optional: use virtualenv
$ pip install -r requirements.txt
Copy the code
In addition, you also need ImageMagick and pdfToText, which can be installed in Ubuntu using commands:
sudo apt-get install imagemagick poppler-utils
Copy the code
However, this command will also need to continue installing other dependent libraries
Run the process
The process of running the entire project involves running several script files in turn, and it’s a good idea to take a close look at each script code, which contains a number of Settings that you might want to change. Execute the following code in the following order:
fetch_papers.py
: Query through the Arxiv API and create a file containing all the information for each paperdb.p
. This code can modify what you want to query, for example, not for machine learning, but for other computer content, such as databases and other categories. Note here that ** querying too many papers at once is limited by arxiv, ** so it’s best to run the code in batches and pass the parameters--start-index
To set the starting position for each rerun;download_pdfs.py
: Download the paper and save it to a folderpdf
;parse_pdf_to_text.py
: Output all text parts in PDFS and save totxt
folderthumb_pdf.py
: Generates an abbreviated PDFS image and saves it to a folderthumb
analyze.py
Based on:bigrams
To calculate all the documentstfidf
Vector, generate filetfidf.p
.tfidf_meta.p
.sim_dict.p
buildsvm.py
: Train SVMs for all users and output filesuser_sim.p
make_cache.py
: Mainly preprocessing to speed up the server startup, if it is the first time to run the code to ensure that the command is executedsqlite3 as.db < schema.sql
To initialize an empty database- Open one in the background
mongodb
Daemon process. Mongodb can be installed with this tutorialDocs.mongodb.com/tutorials/i…sudo service mongod start
Command to start the mongodb service- Make sure the service is running in the background: in the file
/var/log/mongodb/mongod.log
The last line in the[initandlisten] waiting for connections on port <port>
- run
serve.py
Code to turn onflask
Service. By visitinglocalhost:5000
To see the effect of the final run!
Alternatively, you can run twitter_daemon.py to start a screen session, which will use your Twitter API (saved in the file twitter.txt) to find papers mentioned on Twitter in the database and save the results to the file twitter.p.
The author writes a simple command-line script to execute the above code in turn, which is run daily. This script captures the new paper, saves it to an existing database, and then recalculates all the TFIDF vectors or classifiers.
Note: For code analyze. Py, which uses NUMpy to do many calculations, it is recommended to install libraries for BLAS (such as OpenBLAS) to speed up the calculations, which can be done in a few hours for 25,000 papers and 5,000 + users.
The online operation
If you want to run the Flask server online, such as on AWS, run the command python serve.py –prod.
In addition, you need to create a secret file secret_key.txt and add random text (see the server.py code for details).
Current workflow
At present, the website can not be fully automated, it is necessary to manually run part of the code every day to obtain the latest paper, here the author gives the script file content just mentioned:
python fetch_papers.py
python download_pdfs.py
python parse_pdf_to_text.py
python thumb_pdf.py
python analyze.py
python buildsvm.py
python make_cache.py
Copy the code
The service is then run through a screen session, which requires executing the command screen -s serve to create a session (or the parameter -r to reconnect) and then running the following command:
python serve.py --prod --port 80
Copy the code
The server loads the new file and displays it on the site. However, some systems may require the sudo command to use port 80. There are two solutions. One is to use iptables to change the port, or use setCap to increase the permissions of your Python interpreter.
Stackoverflow.com/questions/4…
However, this method should be used with caution, preferably using a virtual environment such as virtualEnv.
summary
Finally, give the site and project address again:
www.arxiv-sanity.com/
Github.com/karpathy/ar…
Click on the original article to jump directly to Github.
You can also leave a message in the background to get the website and project address, as well as the packaged code, as follows:
- Follow “Machine learning and Computer Vision” public account
- Keywords: arxiv
Welcome to follow my wechat official account – Machine Learning and Computer Vision, or scan the qr code below, we can communicate, learn and progress together!
Previously shared resources and tutorial articles are:
- Several books and courses on data structure algorithms are recommended
- Github deep Learning 500 questions
- The Latest machine Learning Training Secrets by Ng can be downloaded for free in Chinese.
- TensorFlow is now available in Chinese
- Must-read AI and Deep learning blog
- An easy-to-understand TensorFlow tutorial
- Recommend some Python books and tutorials, both beginner and advanced!