An overview of the

  • In order to make the similar image retrieval scene of “search by image”, a search by image system is designed based on ES vector index calculation and image feature extraction model VGG16.
  • Open source: github.com/thirtyonele…

Retrieve the scene

  • Reasoning process: the image is read and the algorithm generates feature vectors
  • Feature storage: The feature vector is stored in ES
  • Retrieval process: on-line real-time vector retrieval
  • The specific process is as follows:

ES vector index

  • Dense Vector: Stores Dense vectors as single-valued field arrays with maximum length of 2048 and different array lengths for each document
  • Sparse Vector: stores Sparse vectors as non-nested json objects. Key is the location of the Vector, that is, a string of type integer, ranging from [0,65535], and value is the Vector value. However, sparse vectors are not supported after version 7.6, so please use them with caution

ES retrieval implementation

  • Provide cosine, Manhattan, Euclian and dot product four distance methods, the specific code is as follows:
# cosine distance script_query = {" script_score ": {" query" : {" match_all ": {}}," script ": {" source" : "CosineSimilarity (params.query_vector, doc['image_vector']) + 1.0", "params": {"query_vector": Query_vector}}}} # Manhattan distance script_query = {" script_score": {" query": {"match_all": {}}, "script": {" source": "1 / (1 + l1norm(params.queryVector, doc['image_vector']))", "params": { "queryVector": Query_vector}}}} # Euclidean distance script_query = {" script_score": {" query": {"match_all": {}}, "script": {" source": "1 / (1 + l2norm(params.queryVector, doc['image_vector']))", "params": { "queryVector": Query_vector}}}} # DotProduct implement script_query = {" script_score": {" query": {"match_all": {}}, "script": { "source": """ double value = doc['image_vector'].size() == 0 ? 0 : dotProduct(params.query_vector, doc['image_vector']); return value; """, "params": {"query_vector": query_vector} } } } response = self.client.search( index=self.index_name, body={ "size": search_size, "query": script_query, "_source": {"includes": ["id", "name", "face_vector"]} } )Copy the code

ES server installation

docker run -it -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" Docker. Elastic. Co/elasticsearch/elasticsearch: 7.5.0Copy the code

Introduction to operation

  • Download the project source code: github.com/thirtyonele…
  • Operation 1: Build the base index
Python index.py --train_data: specifies the path to the training images folder. The default path is' <ROOT_DIR>/data/train '--index_file: Custom index file storage path, default is' <ROOT_DIR>/index/train.h5 'Copy the code
  • Operation two: Use similarity search
Python Retrieval. Py --engine=es --test_data: Custom test image details address, default '<ROOT_DIR>/data/test/001_accordion_image_0001.jpg' --index_file: H5 '--db_name: specifies the ES or Milvus index name. The default is' image_retrieval' --engine: User-defined search engine type. The default search engine type is' numpy '. The options are numpy, FAiss, ES, or MilvusCopy the code

conclusion

  • Extend ElasticSearch’s ability to make it support vector retrieval
  • Easy to take advantage of ElasticSearch’s distributed and extensible capabilities
  • ElasticSearch query functions and other plug-ins make it easy to extend the search for other dimensions
  • ES vector calculation is linear scan, time-consuming and the number of documents, hardware performance positive correlation, please verify before use

That’s all!