[Github Project recommends] a better site for reading and finding papers

Up to now, machine learning has accumulated a lot of articles, especially with the popularity of deep learning, many new papers are added every year. If you want to study a certain field, you not only need to read classic papers in this field, but also have to pay close attention to the latest academic progress, such as GAN, which is very popular in the last two years. Not only do you need to find out about its first work, “Generative Adversarial Nets, “but also keep an eye out for the latest research in the field.

Searching for papers, in addition to a direct Google search, is usually done on the ArXiv website, which shows the latest papers in the computer vision and pattern recognition category below:

However, as can be seen from the above figure, it only shows the name of the paper and the author, publication time and other information. If the author is familiar with, such as some big bull, of course, it is worth us to see, but if the author is not familiar, it can only be judged by the name of the paper, click the name of the paper to view the introduction of the paper. Then through the introduction of the paper to determine whether it is worth downloading PDF for intensive reading or extensive reading.

If possible, we would certainly like to present the brief introduction of the paper on this page, so as to reduce one step.

So today I recommend a site that makes machine learning papers easier to read, based on the API provided by arXiv:

Website: www.arxiv-sanity.com/

The site currently has a collection of approximately 62,820 papers from the past few years, all of them on machine learning, and there are several tabs below:

most recent

Present the latest papers. For each paper, it shows the name, author, date of publication, illustration of the paper, and introduction of the paper. You can then download the PDF and search for similar papers and discussion forums.

However, as for the discussion area, it seems that there are not many users or people who make comments, the papers directly displayed will have no comments. It is necessary to directly click the discussions TAB to show the papers with comments, but there will be only one comment, not more than two comments.

top recent

Based on the papers that the logged-in user has saved to their library, you can choose to display a range of recent days, including the last day, three days, a week, a month, a year, and all.

top hype

This is mainly to show the papers that have been mentioned on Twitter. You can check the mentioned users and the contents of the tweets, but I think they are basically the official Tweets of the arxiv that are directly forwarded, similar to the tweets that we directly forwarded.

The last few TAB pages, except discussions, are required to log in. Friends is to show your friend’s paper. Rehali is based on the paper you have stored in your library.

Making project

The code for this site is open source on Github:

Github.com/karpathy/ar…

One of the code files for finding papers through the Arxiv API is fetch_papers.py, where you can change the category of papers you want to find, not just machine learning. For the Arxiv API, you can view the documentation at:

Arxiv.org/help/api/us…

The code structure

According to the author, the code is divided into two parts:

The query code

Download the latest papers of the specified category through the Arxiv API and extract the contents of each paper to extract text and create a TFIDF vector. This part of the code needs to consider the functionality of the backend crawl and calculation:

Build a database of arxiv papers
Compute content vector
Generate thumbnails
Calculate SVMs for users
, etc.

The user interface

This part is a web server (based on Flask/Tornado/ SQLite), which can query papers through database and filter users according to similarity, etc.

Dependent libraries

The required dependency libraries include:

numpy
Feedparser – Parses XML files
Scikit Learn — Handle TFIDEF vector and implement SVM algorithm
Flask — Show the results
flask_limiter
tornado
dateutil
scipy
sqlite3

The above dependent libraries can be installed with the following command:

$ virtualenv env                # optional: use virtualenv
$ source env/bin/activate       # optional: use virtualenv
$ pip install -r requirements.txt
Copy the code

In addition, you also need ImageMagick and pdfToText, which can be installed in Ubuntu using commands:

sudo apt-get install imagemagick poppler-utils
Copy the code

However, this command will also need to continue installing other dependent libraries

Run the process

The process of running the entire project involves running several script files in turn, and it’s a good idea to take a close look at each script code, which contains a number of Settings that you might want to change. Execute the following code in the following order:

fetch_papers.py: Query through the Arxiv API and create a file containing all the information for each paperdb.p. This code can modify what you want to query, for example, not for machine learning, but for other computer content, such as databases and other categories. Note here that ** querying too many papers at once is limited by arxiv, ** so it’s best to run the code in batches and pass the parameters--start-indexTo set the starting position for each rerun;
download_pdfs.py: Download the paper and save it to a folderpdf;
parse_pdf_to_text.py: Output all text parts in PDFS and save totxtfolder
thumb_pdf.py: Generates an abbreviated PDFS image and saves it to a folderthumb
analyze.pyBased on:bigramsTo calculate all the documentstfidfVector, generate filetfidf.p.tfidf_meta.p.sim_dict.p
buildsvm.py: Train SVMs for all users and output filesuser_sim.p
make_cache.py: Mainly preprocessing to speed up the server startup, if it is the first time to run the code to ensure that the command is executedsqlite3 as.db < schema.sqlTo initialize an empty database
Open one in the backgroundmongodbDaemon process. Mongodb can be installed with this tutorialDocs.mongodb.com/tutorials/i…
- sudo service mongod startCommand to start the mongodb service
- Make sure the service is running in the background: in the file/var/log/mongodb/mongod.logThe last line in the[initandlisten] waiting for connections on port <port>
runserve.pyCode to turn onflaskService. By visitinglocalhost:5000To see the effect of the final run!

Alternatively, you can run twitter_daemon.py to start a screen session, which will use your Twitter API (saved in the file twitter.txt) to find papers mentioned on Twitter in the database and save the results to the file twitter.p.

The author writes a simple command-line script to execute the above code in turn, which is run daily. This script captures the new paper, saves it to an existing database, and then recalculates all the TFIDF vectors or classifiers.

Note: For code analyze. Py, which uses NUMpy to do many calculations, it is recommended to install libraries for BLAS (such as OpenBLAS) to speed up the calculations, which can be done in a few hours for 25,000 papers and 5,000 + users.

The online operation

If you want to run the Flask server online, such as on AWS, run the command python serve.py –prod.

In addition, you need to create a secret file secret_key.txt and add random text (see the server.py code for details).

Current workflow

At present, the website can not be fully automated, it is necessary to manually run part of the code every day to obtain the latest paper, here the author gives the script file content just mentioned:

python fetch_papers.py
python download_pdfs.py
python parse_pdf_to_text.py
python thumb_pdf.py
python analyze.py
python buildsvm.py
python make_cache.py
Copy the code

The service is then run through a screen session, which requires executing the command screen -s serve to create a session (or the parameter -r to reconnect) and then running the following command:

python serve.py --prod --port 80
Copy the code

The server loads the new file and displays it on the site. However, some systems may require the sudo command to use port 80. There are two solutions. One is to use iptables to change the port, or use setCap to increase the permissions of your Python interpreter.

Stackoverflow.com/questions/4…

However, this method should be used with caution, preferably using a virtual environment such as virtualEnv.

summary

Finally, give the site and project address again:

www.arxiv-sanity.com/

Github.com/karpathy/ar…

Click on the original article to jump directly to Github.

You can also leave a message in the background to get the website and project address, as well as the packaged code, as follows:

Follow “Machine learning and Computer Vision” public account
Keywords: arxiv

Welcome to follow my wechat official account – Machine Learning and Computer Vision, or scan the qr code below, we can communicate, learn and progress together!

Previously shared resources and tutorial articles are:

Several books and courses on data structure algorithms are recommended
Github deep Learning 500 questions
The Latest machine Learning Training Secrets by Ng can be downloaded for free in Chinese.
TensorFlow is now available in Chinese
Must-read AI and Deep learning blog
An easy-to-understand TensorFlow tutorial
Recommend some Python books and tutorials, both beginner and advanced!