Milvus, as a similarity search engine for massive feature vectors, can handle billions of data scales on a single server. For billions or billions of data, Milvus cluster with horizontal expansion ability is needed to meet the demand of high performance retrieval of massive vector data. Mishards is a Milvus cluster middleware developed in Python. The Milvus cluster built by Mishards can realize core functions such as request forwarding, read/write separation, horizontal expansion, dynamic expansion and so on, helping users to acquire the ability to process super-large scale vector similarity retrieval. This article will show how to use Mishards sharding middleware to build Milvus clusters to achieve Milvus clustering capability. The paper is divided into three chapters:
-
Overview of Mishards: How Mishards works and the Milvus cluster architecture
-
Setup procedure: Take two servers as an example to set up a Milvus cluster
-
Test: Import 100 million data sets into the built Milvus cluster, and analyze the operation of the cluster
Working principle:
Workflow:
-
Sends client requests to the Proxy
-
Split the client request
-
Routes to internal submolecule instances
-
Each instance sends its own results to the Proxy
-
Summary results
-
Get the final result and return it to the client
Cluster architecture:
Any aOn the server. The overall cluster architecture of Mishards is as follows:
-
You need to install and start Milvus on each server
-
You only need to start Mishards on either server
-
You can select any server as the shared storage
1. The installation of MySQL
Any aStart it on the server. This example is installed on server A (192.168.1.38).
-
Install and start the MySQL service according to the MySQL official website tutorials
-
Or install MySQL from Docker
2. Start Milvus
eachThe Milvus instance must be installed on each server. Different Milvus instances have different read and write permissions. In this example, the Milvus instance on the first server is configured to be writable and the Milvus instance on the second server is configured to be read-only.
Only one Milvus instance in the cluster can be configured as writable. The others are read-only.
server_config.yml
. Modify the parameters as follows:
Version: 0.1# config versionserver_config: Address :0.0.0.0# milvus server IP address (IPv4) Port :19530# Milvus server port, must in range [1025, 65534] deploy_mode: cluster_readonly # deployment type: single, cluster_readonly, cluster_writable time_zone: UTC+8# time zone, must be in format: UTC+Xdb_config: primary_path:/var/lib/milvus # path used to store data and meta secondary_path:# path used to store data only, Split by semicolon backend_URL ://root:[email protected]:3306/milvus # URI format: dialect://username:password@host:port/databaseCopy the code
deploy_mode
Determines whether an instance of Milvus is read-only or writable. In the standalone version, this parameter is set tosingle
; When using Mishards, each Milvus instance is configured tocluster_writable
或cluster_readonly
。
cluster_writable
Indicates that the Milvus instance is writablecluster_readonly
Indicates that the Milvus instance is read-only
backend_url
Change the IP address and port of the server where MySQL is installed in the preceding format. For other configurations, refer to the Milvus standalone configuration.
Any aStart it on the server. This example starts Mishards on server A.
cluster_mishards.yml
Corresponding parameters in the file:
version:"2.3"services: mishards: restart: always image: milvusdb/mishards ports:-"0.0.0.0:19531:19531"-"0.0.0.0:19532:19532"#volumes:#- /tmp/milvus/db:/tmp/milvus/db# - /tmp/mishards_env:/source/mishards/.env command:["python","mishards/main.py"] environment: FROM_EXAMPLE:'true' SQLALCHEMY_DATABASE_URI: Mysql + pymysql: / / root: [email protected]:3306 / milvus? charset=utf8mb4 DEBUG:'true' SERVER_PORT:19531 WOSERVER: TCP: / / 192.168.1.85:19530 DISCOVERY_PLUGIN_PATH: static DISCOVERY_STATIC_HOSTS: 192.168.1.85, 192.168.1.38 DISCOVERY_STATIC_PORT:19530Copy the code
SERVER_PORT
: Defines the service port for Mishards.WOSERVER
: Defines the address of a writable instance of Milvus. Currently only static Settings are supported. Reference format:
TCP: / / 127.0.0.1:19530
。DISCOVERY_PLUGIN_PATH
: User-defined search path for service discovery plug-ins. By default, the system search path is used.DISCOVERY_STATIC_HOSTS
: List of service addresses, separated by commas, for example
192.168.1.188, 192.168.1.190
。DISCOVERY_STATIC_PORT
: Service address Listening port.
SQLALCHEMY_DATABASE_URI
: Change it to the IP address of the MySQL server.
WOSERVER
: Change it to the IP address of Milvus’s writable example.
DISCOVERY_STATIC_HOSTS
: indicates all IP addresses in the cluster.
test
Data preparation
In this test, we extracted 100 million pieces of data from the original data set, which is about 13 GIGABytes in size.
Once you have set up and started Mishards, you can use Milvus to do the same. The Milvus service is connected to the cluster based on the IP address of the Mishards server and the Mishards service port
>>> milvus =Milvus()>>> milvus.connect(host='192.168.1.38', port='19531')Copy the code
Test steps:
$ python3 milvus_toolkit.py --table <table_name>--dim <dim_num>-cCopy the code
milvus_load.py
The path to the file you imported. After the modification, run the following command to import data:
$ python3 milvus_load.py --table <table_name>-bCopy the code
$ python3 milvus_toolkit.py --table <table_name>--index <sq8 or sq8h or flat or ivf>--buildCopy the code
$ python3 milvus_toolkit.py --table <table_name>--nprobe <np_num>-s# execute -s to query performance. Np specifies the number of buckets to search for when queryingCopy the code
Operation of the
According to the run log, IP address 192.168.1.85 and IP address 192.168.1.38 participate in the query.
As shown in the following two figures, all cpus at 192.168.1.85 and 192.168.1.38 are working. It can be observed in the RES column of the PID USER line that the memory usage of 192.168.1.85 is 10.7G, that of 192.168.1.38 is 9344M, and the total memory usage is 19.825G
When the data set was processed using Milvus standalone, the memory footprint was 15.9G. With the Mishards, the memory footprint is nearly 4 gigabytes higher than with the standalone version. This is due to multiple Milvus instances, each of which consumes memory. Although Mishards consume virtually no memory, the memory footprint has increased due to the increase in instances of Milvus.
This paper uses Mishards to build a Milvus cluster, and conducts relevant tests and operation analysis of Milvus cluster using 100 million data sets. When you need to process a large number of feature vectors, you can use the Mishards-based Milvus distributed cluster solution for a better experience. Future releases of Mishards will continue to be updated, and we welcome your input or code to explore a better clustering solution based on your scenario requirements.
github.com/milvus-io/milvus
milvus.io
milvusio.slack.com
“Online Communication”> Add ZILLIZ bot and return to the group
© 2020 ZILLIZ ™