Read this article in other languages:English.
Recommendation engines are one of the best-known, most widely used, and most valuable use cases for machine learning applications. While there are many resources available to train recommendation models on a foundation, there are still relatively few resources explaining how to actually deploy these models to create large-scale recommendation systems.
This Code Pattern demonstrates the key elements of creating such a system using Apache Spark and Elasticsearch.
This repository contains a Jupyter Notebook that will demonstrate how to use Spark to train a collaborative filtering recommendation model from scoring data stored in Elasticsearch and save model factors to Elasticsearch, Elasticsearch is then used to provide real-time recommendations through this model. The data you’ll be using comes from MovieLens and is a common benchmark dataset in the recommendation community. The data includes a set of ratings for various movies provided by users of the MovieLens movie-scoring system. It also contains metadata (title and category) for each movie.
After completing this Code Pattern, you will know how to:
- The user event data is ingested and indexed into Elasticsearch using the Elasticsearch Spark connector
- Load the event data into Spark DataFrames and use Spark’s machine learning library (MLlib) to train a collaborative filtering recommendation system model
- Export the trained model to Elasticsearch
- Compute using a custom Elasticsearch pluginPersonalized user 和 A similar itemRecommendations, and combine recommendations with search and content filtering.
Operation process
- Load the movie data set into Spark.
- Clean up the dataset using the Spark DataFrame operation and load it into Elasticsearch.
- Using Spark MLlib, train a collaborative filtering recommendation model.
- Save the resulting model in Elasticsearch.
- Generate some sample recommendations using Elasticsearch queries and a custom vector scoring plug-in. Use the Movie Database API to display the Movie poster image of the recommended Movie.
Included components
- Apache Spark: An open source, fast, and versatile cluster computing system
- Elasticsearch: Open source search and analysis engine
- Jupyter Notebook: An open source Web application for creating and sharing documents containing real-time code, equations, visualizations, and explanatory text.
Selected technology
- Data science: The systematic and scientific method of analyzing structured and unstructured data to extract knowledge and insights.
- Artificial intelligence: ARTIFICIAL intelligence can be applied to different solution domains to deliver disruptive technologies.
- Python: Python is a programming language that allows you to work faster and integrate systems more efficiently.
Watch the video
steps
Follow these steps to create the required service and run the Notebook locally.
- Clone repository
- Set the Elasticsearch
- Download the Elasticsearch Spark connector
- Download Apache Spark
- Download the data
- Start the Notebook
- Run the Notebook
1. Clone the repository
Clone elasticSearch-Spark-Recommender repository locally. On a terminal, run the following command:
$ git clone https://github.com/IBM/elasticsearch-spark-recommender.git
Copy the code
2. Set the Elasticsearch
This Code Pattern currently depends on Elasticsearch 5.3.0. Go to the Download page to download the right package for your system.
For example, on Linux/Mac, you can download the TAR archive and decompress it using the following command:
$$tar XFZ wget HTTP: / / https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.3.0.tar.gz Elasticsearch - 5.3.0. Tar. Gz
Copy the code
Use the following command to change the directory to the newly extracted folder:
$CD elasticsearch - 5.3.0
Copy the code
Next, you need to install the Elasticsearch vector scoring plug-in. To do this, run the following command (Elasticsearch will download the plug-in file for you) :
$ ./bin/elasticsearch-plugin install https://github.com/MLnick/elasticsearch-vector-scoring/releases/download/v5.3.0/elasticsearch-vector-scoring-5.3.0.zip
Copy the code
Next, start Elasticsearch (do this in a separate terminal window to keep it running properly) :
$ ./bin/elasticsearch
Copy the code
You will see some startup logs displayed. Check elasticSearch-vector-scoring -plugin has been loaded successfully:
$./bin/ elasticSearch [2017-09-08T15:58:18.781][INFO][O.E.n.mode][] initializing... . [2017-09-08T15:58:19.406][INFO][O.E.P.luginsservice][2Zs8kW3] Elasticsearch -vector-scoring [2017-09-08T15:58:20.676][INFO][O.E.n.ode] Initialized...
Copy the code
Finally, you will need to install the Elasticsearch Python client. To do this, run the following command (this command should be executed using a different terminal window than the one running Elasticsearch) :
$ pip install elasticsearch
Copy the code
3. Download the Elasticsearch Spark connector
The Elasticsearch Hadoop project provides a connector between Elasticsearch and a variety of Hadoop-compatible systems, including Spark. The project provides a ZIP file for download that contains all of these connectors. When you run the PySpark Notebook, you need to place the Spark-specific connector JAR file on the classpath. Follow these steps to set up the connector:
1. Download the ElasticSearch-hadoop-5.3.0.zip file, which contains all connectors. To do this, you can run:
$wget HTTP: / / http://download.elastic.co/hadoop/elasticsearch-hadoop-5.3.0.zip
Copy the code
2. Run the following command to decompress the file:
$unzip elasticsearch - hadoop - 5.3.0. Zip
Copy the code
3. The Spark connector JAR name is ElasticSearch-Spark-20_2.11-5.3.0.jar, which will be located in the dist subfolder of the directory where you extracted the above files.
4. Download Apache Spark
This Code Pattern should apply to any Spark 2.x version, but it is recommended to download the latest Version of Spark (currently 2.2.0) from the Download page. After downloading the file, run the following command to extract it:
$tar XFZ spark - 2.2.0 - bin - hadoop2.7. TGZ
Copy the code
Note that if you download different versions, you should adjust the relevant commands used above and the rest of this Code Pattern accordingly
You also need Numpy installed to use Spark’s machine learning library, MLlib. If Numpy is not installed, run the following command:
$ pip install numpy
Copy the code
5. Download data
You will use the Movielens dataset, which contains a set of ratings and movie metadata provided by movie users. There are several versions of this data set. You should download the latest Small version.
Run the following command from the base directory of the Code Pattern repository:
$ cd data
$ wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
$ unzip ml-latest-small.zip
Copy the code
6. Start the Notebook
This Notebook should work with Python 2.7 or 3.x (and has been tested on 2.7.11 and 3.6.1)
To run this Notebook, you need to start a PySpark session in a Jupyter Notebook. If Jupyter is not installed, you can run the following command to install it:
$ pip install jupyter
Copy the code
When you start the Notebook, include the Elasticsearch Spark connector JAR from step 3 on your classpath.
Run the following command to start your PySpark Notebook server locally. To run this command correctly, you need to start the Notebook from the base directory of the Code Pattern repository that you cloned in step 1. If you are not in the directory, run the CD command to enter the directory first.
PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" .. /spark-2.2.0-bin-hadoop2.7/bin/pyspark --driver-memory 4G --driver-class-path.. /.. / elasticsearch - hadoop - 5.3.0 / dist/elasticsearch - spark - 20 _2. 11-5.3.0. Jar
Copy the code
This opens a browser window showing the contents of the Code Pattern folder. Click the Notebooks folder, and then click the ElasticSearch-Spark-Recommender. ipynb file to start the Notebook.
Optional:
To display images in The recommendation demo, you need to access The Movie Database API. Obtain the API key according to the instructions. You also need to install the Python client using the following command:
$ pip install tmdbsimple
Copy the code
The demo can be performed without accessing the API, but it doesn’t display images (so it doesn’t look pretty!). .
7. Run the Notebook
When you execute a Notebook, you execute each unit of code in that Notebook from the top down.
You can select each code unit and add a marker in the left margin in front of the code unit. The tag format is In [x]:. Depending on the Notebook’s state, x can be:
- Blank, indicating that the unit was never executed.
- A number that represents the relative order in which the steps of this code are executed.
- a
*
, indicating that the unit is currently executing.
Units of code in the Notebook can be executed in a number of ways:
- One unit at a time.
- Select the unit, then press it in the toolbar
Play
Button. You can also press itShift+Enter
To execute the cell and advance to the next cell.
- Select the unit, then press it in the toolbar
- Batch mode, sequential execution.
Cell
The menu bar contains multiple options. For example, you can chooseRun All
Run all units in the Notebook, optionallyRun All Below
, which starts with the first cell below the currently selected cell and proceeds to all subsequent cells.
The sample output
The sample output in the Data/Examples folder shows the complete output after running the Notebook. You can check it out here.
Note: To see the code and the compact unit with no output, look at the original Notebook in the Github viewer.
troubleshooting
- Error:
java.lang.ClassNotFoundException: Failed to find data source: es.
If you see this error when trying to write data from Spark to the Notebook, This means that Spark did not find the Elasticsearch Spark connector (ElasticSearch-Spark-20_2.11-5.3.0.jar) in the classpath when it started the Notebook.
Solution: First launch the command in Step 6, making sure to run it from the base directory of the Code Pattern repository.
If that doesn’t work, try using a fully qualified JAR file path when starting the Notebook, such as: PYSPARK_DRIVER_PYTHON=”jupyter” PYSPARK_DRIVER_PYTHON_OPTS=” Notebook “.. / spark – 2.2.0 – bin – hadoop2.7 / bin/pyspark – driver – 4 g memory – driver – class – the path Elasticsearch-hadoop-5.3.0 /dist/ elasticSearch-spark-20_2.11-5.3.0.jar The fully qualified (not relative) path to the directory used by the ZIP file.
- Error:
org.elasticsearch.hadoop.EsHadoopIllegalStateException: SaveMode is set to ErrorIfExists and index demo/ratings exists and contains data.Consider changing the SaveMode
If you see this error when you try to write data from Spark to Elasticsearch in the Notebook, it means that you have written data to the relevant index (for example, score data to ratings index).
Solution: Try to continue processing the Notebook from the next unit. You can also delete all indexes first and re-run the Elasticsearch command to create the index map (see Step 2 in Section Notebook: Load data into Elasticsearch).
- Error:
ConnectionRefusedError: [Errno 61] Connection refused
You may see this error when trying to connect to Elasticsearch in the Notebook. This may mean that your Elasticsearch instance is not running.
Solution: In a new terminal window, run the CD command to go to the directory where Elasticsearch is installed and run./bin/ Elasticsearch to start Elasticsearch.
- Error:
Py4JJavaError: An error occurred while calling o130.save. : org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; Tried the [[127.0.0.1:9200]]
You may see this error when you try to read data from Elasticsearch to Spark (or write data from Spark to Elasticsearch) in the Notebook. This may mean that your Elasticsearch instance is not running.
Solution: In a new terminal window, run the CD command to go to the directory where Elasticsearch is installed and run./bin/ Elasticsearch to start Elasticsearch.
- Error:
ImportError: No module named elasticsearch
If you experience this error, it means that the Elasticsearch Python client is not installed or may not be found on PYTHONPATH.
Solution: Try installing the client using $PIP install ElasticSearch (if running in a Python virtual environment such as Conda or Virtualenv), Or use $sudo PIP install ElasticSearch. If that doesn’t work, add your site-Packages folder to the Python path (such as on Mac: Export PYTHONPATH=/Library/Python/2.7/site-packages for Python 2.7). For another example on Linux, see this Stackoverflow problem. Note: The same generic solution works for any other module import errors you may encounter.
- Error:
HTTPError: 401 Client Error: Unauthorized for url: https://api.themoviedb.org/3/movie/1893?api_key=...
If you see this error in your Notebook when testing TMDb API access or generating recommendations, it means that you have the TMDBSimple Python package installed but have not set your API key.
Solution: Follow the instructions at the end of Step 6 to set up your TMDb account and get your API key. Then copy that key to the TMdb. API_KEY = ‘YOUR_API_KEY’ line in the Notebook unit after step 1: Prepare the data (replace YOR_API_KEY with the correct key). When this is done, execute the unit to test your access to the TMDb API.
link
- A demonstration on Youku: Watch a video.
- Standing Meeting video demo: Watch a standing meeting demo that covers some of the background and technical details of this code pattern.
- Stand-up session materials: View the slides of the presentation.
- ApacheCon Big Data Europe 2016: Check out an extended version of this stand-up presentation:
- Data and Analysis: Learn how to integrate this pattern into the data and analysis reference architecture.
Learn more
- Data analysis Code Pattern: Like this Code Pattern? Learn about our other data analysis Code Pattern
- AI and Data Code Pattern Playlists: Bookmark playlists containing all of our Code Pattern videos
- Data Science Experience: Master the art of Data Science with IBM Data Science Experience
- Spark on the IBM Cloud: Need a Spark cluster? Create up to 30 Spark executables on the IBM Cloud through our Spark service.