- Build an Article Recommendation Engine With AI/ML
- Written by Tyler Hawkins
- The Nuggets translation Project
- Permanent link to this article: github.com/xitu/gold-m…
- Translator: jaredliw
- Proofread by: Greycodee, KimYangOfCat
Use ARTIFICIAL intelligence/machine learning to build article recommendation engines
Content platforms are growing rapidly in terms of recommending relevant content to users. The more relevant content a platform offers, the longer users stay on the site, which often translates into advertising revenue for the company.
If you’ve ever visited a news website, digital publisher, or blogging platform, you’ve probably been exposed to recommendation engines. Each platform suggests content you might like, based on your reading history.
To cite a simple solution, a platform could implement a tag-based recommendation engine — say, if you read a “business” article, it would recommend five related articles labeled “business.” However, a better way to build a recommendation engine is to use similarity search and machine learning algorithms.
In this article, we will build a Python Flask application that uses Pinecone, a similarity-searching service, to create our own article recommendation engine.
Demo application overview
The short animation below demonstrates how our application works. Initially ten articles appear on the page. Users can choose any combination of these ten articles to represent their reading history. When the user clicks the submit button, the reading history is used as input data to query the article database, which then presents the user with ten related articles.
As you can see, the relevant articles returned are very accurate! In this example, there are 1,024 possible combinations of reading history available as input, and each combination produces meaningful results.
So how did we do it?
When building the application, we first found a data set of news articles from Kaggle. This dataset contains 143,000 news articles from 15 major publishers, but we used only the first 20,000. (The full dataset contains over 2 million articles!)
After that, we clean up the data set by renaming several columns and removing unnecessary columns. Next, we input the article into the embedding model to get the embedding vector for each article — this is the metadata that the machine learning algorithm uses to determine the similarity between the various inputs. We use the average word embedding model. We then insert these embedding vectors into the vector index managed by Pinecone.
Once the embedding vector is added to the index, we can start looking for relevant content. When the user submits their reading history, a request is sent to the API endpoint, which uses Pinecone’s SDK to query the index of the embedding vector. The endpoint returns ten similar news articles and displays them in the application’S UI. That’s it! Isn’t that easy?
If you want to try it out for yourself, you can find the source code for the app on GitHub. Included in the README are instructions and instructions on how to run the application on the local device.
Code implementation
We’ve seen the inner workings of the application, but how do we actually build it? As mentioned earlier, this is a Python Flask application that uses the Pinecone SDK. The HTML uses template files, and the rest of the front end is built with static CSS and JavaScript. For simplicity, all the back-end code is in the app.py file, which I have copied in its entirety below:
from dotenv import load_dotenv
from flask import Flask
from flask import render_template
from flask import request
from flask import url_for
import json
import os
import pandas as pd
import pinecone
import re
import requests
from sentence_transformers import SentenceTransformer
from statistics import mean
import swifter
app = Flask(__name__)
PINECONE_INDEX_NAME = "article-recommendation-service"
DATA_FILE = "articles.csv"
NROWS = 20000
def initialize_pinecone() :
load_dotenv()
PINECONE_API_KEY = os.environ["PINECONE_API_KEY"]
pinecone.init(api_key=PINECONE_API_KEY)
def delete_existing_pinecone_index() :
if PINECONE_INDEX_NAME in pinecone.list_indexes():
pinecone.delete_index(PINECONE_INDEX_NAME)
def create_pinecone_index() :
pinecone.create_index(name=PINECONE_INDEX_NAME, metric="cosine", shards=1)
pinecone_index = pinecone.Index(name=PINECONE_INDEX_NAME)
return pinecone_index
def create_model() :
model = SentenceTransformer('average_word_embeddings_komninos')
return model
def prepare_data(data) :
Rename the ID column and delete unnecessary columns
data.rename(columns={"Unnamed: 0": "article_id"}, inplace = True)
data.drop(columns=['date'], inplace = True)
Extract only the first few sentences of each article to speed up vector calculation
data['content'] = data['content'].fillna(' ')
data['content'] = data.content.swifter.apply(lambda x: ' '.join(re.split(r'(? < = [. :;] )\s', x)[:4]))
data['title_and_content'] = data['title'] + ' ' + data['content']
The embedding vector is created based on the header and article columns
encoded_articles = model.encode(data['title_and_content'], show_progress_bar=True)
data['article_vector'] = pd.Series(encoded_articles.tolist())
return data
def upload_items(data) :
items_to_upload = [(row.id, row.article_vector) for i, row in data.iterrows()]
pinecone_index.upsert(items=items_to_upload)
def process_file(filename) :
data = pd.read_csv(filename, nrows=NROWS)
data = prepare_data(data)
upload_items(data)
pinecone_index.info()
return data
def map_titles(data) :
return dict(zip(uploaded_data.id, uploaded_data.title))
def map_publications(data) :
return dict(zip(uploaded_data.id, uploaded_data.publication))
def query_pinecone(reading_history_ids) :
reading_history_ids_list = list(map(int, reading_history_ids.split(', ')))
reading_history_articles = uploaded_data.loc[uploaded_data['id'].isin(reading_history_ids_list)]
article_vectors = reading_history_articles['article_vector']
reading_history_vector = [*map(mean, zip(*article_vectors))]
query_results = pinecone_index.query(queries=[reading_history_vector], top_k=10)
res = query_results[0]
results_list = []
for idx, _id in enumerate(res.ids):
results_list.append({
"id": _id."title": titles_mapped[int(_id)]."publication": publications_mapped[int(_id)]."score": res.scores[idx],
})
returnjson.dumps(results_list) initialize_pinecone() delete_existing_pinecone_index() pinecone_index = create_pinecone_index() model = create_model() uploaded_data = process_file(filename=DATA_FILE) titles_mapped = map_titles(uploaded_data) publications_mapped = map_publications(uploaded_data)@app.route("/")
def index() :
return render_template("index.html")
@app.route("/api/search", methods=["POST"."GET"])
def search() :
if request.method == "POST":
return query_pinecone(request.form.history)
if request.method == "GET":
return query_pinecone(request.args.get("history".""))
return "Only GET and POST methods are allowed for this endpoint"
app.run()
Copy the code
Let’s review the important parts of the app.py file so we can understand it.
In lines 1 through 14, we introduce application dependencies. Our application relies on the following packages:
dotenv
For from.env
File to read environment variablesflask
Used to set up the Web applicationjson
For handling JSONos
Also used to get environment variablespandas
Used to process data setspinecone
Used with the Pinecone SDKre
For processing regular expressions (RegEx)requests
Used to send API requests to download our data setstatistics
Some convenient statistical functions are providedsentence_transformers
For our embedded modelswifter
To deal withpandas
的DataFrame
In line 16, we provide some template code to tell Flask the name of our application.
In lines 18 through 20, we define some constants that we will use in our application, including the name of our Pinecone index, the filename of the data set, and the number of lines to read from the CSV file.
In lines 22 through 25, our initialize_Pinecone method takes the API key from the.env file and uses it to initialize Pinecone.
In lines 27 through 29, our delete_existing_pinecone_index method searches the Pinecone instance for an index with the same name as the index we are using (” article recommendation-service “). If an existing index is found, we drop it.
On lines 31 through 35, our create_pinecone_index method creates a new index with the name of our choice (” article-recommendation-service “), using cosine similarity as an indicator and data shard of 1.
In lines 37 through 40, our create_model method uses the Sentence_transformers library to handle the average word embedding model. We will use this model later to encode our embedding vector.
On lines 62-68, our process_file method reads the CSV file and then calls the prepare_data and Upload_items methods. These two methods are described below.
In lines 42 through 56, our prepare_data method adjusts the data set by renaming the first ID column and removing the DATE column. It then grabs the first four lines of each article and combines them with the article title to create a new field as the data to encode. We can create the embedding vector based on the entire body of the article, but four lines are enough to speed up the coding process.
In lines 58 to 60, our upload_items method creates a embedding vector for each article by encoding it using our model, and then inserts the embedding vector into the Pinecone index.
On lines 70 through 74, our map_titles and map_publications methods create dictionaries containing titles and publishers to make it easier to find articles by their IDS later.
Each of the methods described above is called on lines 98 through 104 when the back-end application starts. This work prepares us for the final step, which is to query the Pinecone index based on user input.
In lines 106 to 116, we define two routes for the application: one for the home page and one for the API endpoint. The home page provides index.html template files and JavaSript and CSS assets, while the API endpoint provides search capabilities to query the Pinecone index.
Finally, in lines 76 to 96, our query_pinecone method takes the user’s reading history input, converts it to the embedding vector, and queries the Pinecone index to find similar articles. This method is called when you get the/API /search endpoint. This API is called every time a user submits a new search query.
For visual learners, here’s a diagram outlining how the application works:
The example scenario
So, putting all this together, what is the user experience? Let’s take a look at three scenarios: three groups of users interested in sports, technology, and politics.
Sports users chose the first two articles about Serena Williams and Andy Murray, two famous tennis players, as their reading history. After they submit their choice, the app returns articles about the Wimbledon tennis tournament, the United States Open, Roger Federer and Rafael Nadal. Accurate!
Users interested in technology chose articles about Samsung and Apple. After they submit their choice, the app responds to articles about Samsung, Apple, Google, Intel and the iPhone. Another great recommendation!
A user interested in politics selects an article about election fraud. After they submit their choice, the app returns articles about voter IDS, the U.S. 2020 election, voter turnout, and illegal election claims (and why they didn’t stop them).
After three scenes! Our recommendation engine has proved very useful.
conclusion
We have now created a simple Python application to solve a real-world problem. If a content site can recommend relevant content to users, users will enjoy it more and spend more time on the site, generating more revenue for the company. Win-win situation!
Similarity search helps you make better suggestions to your users. Pinecone, a similarity search service, allows you to easily make recommendations to users so you can focus on what you do best — building a great platform full of content worth reading.
If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.
The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.