This article provides a short, simple guide on how to set up your first Elasticsearch development environment to quickly get started and start exploring/taking advantage of what the technology has to offer. The introduction will be based on the most important apis provided by Elasticsearch, which are the basis for fetching data and performing queries. The second purpose is to provide links to documentation and other interesting resources to learn about other potential operational aspects, other great features and various tools.

The target audience can be personal data analysts or Web developers, small teams with relevant data use cases who have already heard of Elasticsearch. This is not intended to provide a complete technical overview of the techniques (clustering, nodes, sharding, duplicates, Lucene, reverse indexing, etc.), nor is it intended to delve into specific topics, as many excellent resources can be found in other articles on my blog.

 

How do we start?

In my opinion, the easiest way to start trying Elasticsearch (and many other software technologies) is through Docker, so the example will be able to take advantage of this approach. Typically, containerization (which has been for some time) will take over the actual production deployment/operations, so it can be two-in-one. Uber provides large-scale examples of their fully containerized approach to various parts of an Elastic cluster. So let’s get started…

 

Install the Elasticsearch development environment

I assume you are already familiar with Docker (if not, please install Docker for Desktop for Mac, Windows or Linux distributions and read some introductory content for your pleasure). Note: The examples have been tested on macOS Mojave, keep in mind that there may be some Docker specifications available on Linux distributions or Win.

 

Our first node

Once the Docker environment is ready, simply open the terminal and start the Elasticsearch cluster using the following command:

docker network create elastic-network
Copy the code

This will create a basic “space” for our future containers.

Now, if you only want to do one node deployment, run:

docker run --rm --name esn01 -p 9200:9200 -v esdata01:/usr/share/elasticsearch/data --network elastic-network -e "node.name=esn01" -e "cluster.name=liuxg-docker-cluster" -e "cluster.initial_master_nodes=esn01" -e "bootstrap.memory_lock=true" --ulimit memlock=-1:-1 -e ES_JAVA_OPTS="-Xms512m -Xmx512m" Docker. Elastic. Co/elasticsearch/elasticsearch: 7.5.0Copy the code

Here we have created a Elasticsearch cluster called liuxg-Docker-cluster even though there is currently only one node in it.

It is important to note that the ES_JAVA_OPTS Settings can be set to a smaller size if we are operating on our laptop, otherwise docker will exit. In my case, I had no problem using 512 megabytes of memory.

From the above log we can see that our cluster has been successfully started and running.

Hooray… We start the first Elasticsearch node (ESN01) running in the Docker container, listening on port 9200 (the Docker -p parameter). Thanks to cluster.initial_master_nodes, it creates a new cluster (named in cluster.name) in a process called boot cluster and immediately assumes that cluster is the primary server (something else to do when you’re alone :). The last thing worth mentioning is the -v parameter, which was created by the new Docker volume ESData01 and bound to the running Elasticsearch directory – so that our working data will be preserved after we reboot. The remaining parameters are relevant and important to system Settings – we disable swapping with bootstrap.memory_lock and increase file/process limits (i.e. dedicated memory) via ESLJAVA_OPTS to adjust to your configuration… Keep in mind, however, that Elasticsearch also uses off-heap resources, so don’t set the available memory above 50%. License notes: By default, it comes with a base license (as Elastic claims, it should always be free), but to use a pure open source version, simply add * -OSS at the end of the image name.

Other two nodes (optional)

Now, if you want to try a distributed setup, add two (or more) other nodes to the cluster. This tutorial does not require (running on a local computer), nor does it require actual basic operations. However, since distributed deployment is actually one of the core value-added features of Elasticsearch (since the indexing part is largely handled by the Apache Lucene library), you should at least be aware of this. So, if we decide to give it a try, we can start the other two nodes (running in separate terminals) as follows:

docker run --rm --name esn02 -p 9202:9200 -v esdata02:/usr/share/elasticsearch/data --network elastic-network -e "node.name=esn02" -e "cluster.name=liuxg-docker-cluster" -e "discovery.seed_hosts=esn01" -e "bootstrap.memory_lock=true" - the ulimit memlock = 1:1 - e ES_JAVA_OPTS = "- Xms512m - Xmx512m" docker. Elastic. Co/elasticsearch/elasticsearch: 7.5.0Copy the code
docker run --rm --name esn03 -p 9203:9200 -v esdata03:/usr/share/elasticsearch/data --network elastic-network -e "node.name=esn03" -e "cluster.name=liuxg-docker-cluster" -e "discovery.seed_hosts=esn01,esn02" -e "bootstrap.memory_lock=true" --ulimit memlock=-1:-1 -e ES_JAVA_OPTS="-Xms512m -Xmx512m" Docker. Elastic. Co/elasticsearch/elasticsearch: 7.5.0Copy the code

In my case, DUE to memory limitations, I only started two nodes: ESn01 and ESn02.

We can open localhost:9200 in the browser to see the information:

Obviously both of our nodes are up.

We can also use the following command to view the nodes in our cluster:

A node with an * is called a master node. At this point we have successfully created our liuxg-Docker-Cluster.

Now, as the nodes join our cluster, it becomes more and more interesting. You can observe this process in the logs, where there are messages about additions (from our main cluster, about the newly added node ESN02). Normally, these nodes will join other “existing” nodes listed in the cluster.dised_hosts parameter – this parameter (along with the above cluster.initial_master_nodes parameter) belongs to the Important Discovery Settings parameter. All communication between cluster nodes takes place directly at the transport layer (i.e., no HTTP overhead), so why have port 9300 (as opposed to port 920x, which we expose for communication with the “outside” of the cluster).

{" type ":" server ", "TIMESTAMP" : "2019-08-18T13:40:28, 169+0000", "Level" : "INFO", "Component" : "O.E.C.S.C lusterApplierService", "cluster. The name" : "stanislavs docker - cluster", "node. The name" : "esn01", "cluster. Uuid" : "IlHqkY4UQuSRYnE5hFbhBg", "node id" : "KKnVMTDwQR6tinaGW_8kKg", "message" : "Added Xowc - _fQDO_lwTlpPks2A {{esn02} {4} {FW - 6 ygvsssmget3mo_gy_q} {172.18.0.3} {9300} 172.18.0.3: {dim} {ml machine_memory = 8360488960, Ml.max_open_jobs =20, xpack.installed=true},}, term: 1, version: 16, reason: Publication{term=1, version=16} "}Copy the code

We can check the health of our cluster by using the following command:

curl localhost:9202/_cluster/health? prettyCopy the code

The status shown above is green, which means that if one of our nodes fails for some reason, data will not be lost.

Each node plays a different role in Elasticsearch’s cluster, sometimes even multiple roles for a node.

If you want to learn more about these nodes, please refer to my previous article “Some important concepts in Elasticsearch: Cluster, Node, Index, Document, Shards and Replica”.

 

Load the data into Elasticsearch

In this section, we will use CSV as an example to show how to load data into Elasticsearch.

With the initial environment set up, it’s time to move our data to the running node for search. The rationale for Elasticsearch is that any “unit” we put into our data is called a DOCUMENT. The required format for these documents is JSON. These documents are not stored individually, but grouped into specific collections called indexes. Documents in an index have similar characteristics (i.e., data type mappings, Settings, and so on), so the index also provides a management and API layer for the contained documents.

Create Index

Let’s use the create-index API to Create an Index, and then Index a Document to the index-document API…

curl -X PUT \
  http://localhost:9200/test-csv \
  -H 'Content-Type: application/json' \
  -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 3, 
            "number_of_replicas" : 2 
        }
    }
}'
Copy the code

We created the first index called test-CSV and defined important parameters for the number of shards and copies. Master shards and shard copies are a way of dividing an index (that is, a collection of documents) into fragments to ensure redundancy. The actual value depends on how many nodes we have. The number of shards (that is, how many units we will split into) is fixed at index creation and cannot be changed later. The number of replicated copies determines how many times the master shard (our three shards) will be “copied”. We define a state where each node has a master shard (1/3 of our data), but also copies of our other data. This setting ensures that we lose all data. Tip: If you want to use these Settings to get started, if you try to index a document without an existing index, you’ll have to pay for it – it’s created automatically (with a default of 1 shard and 1 copy).

curl -X PUT \ http://localhost:9203/test-csv/_doc/1? pretty \ -H 'Content-Type: application/json' \ -d '{ "value": "first document" }'Copy the code
curl -X PUT \ http://localhost:9200/test-csv/_doc/1? pretty \ -H 'Content-Type: application/json' \ -d '{ "value": "first document" }'Copy the code

We have indexed the first document into our test-CSV index, and all shards are answered correctly. We’ve indexed a very simple json document with only one field, but as long as it’s valid json, you can index any document (in terms of size, depth, etc.) – but try to keep it below 100MB per document :). You can now easily retrieve individual documents by their IDS using the Get API. It’s up to you.

Use Python to load CSV data

You will now use curl to index the documents one by one. The more this is possible when you already have this data somewhere in your application, database, or storage drive. For this purpose, there are official Elasticsearch clients for various programming languages, such as Java, JS, Go, Python, etc. You can use any of these based on your technology stack and use its API to programmatically index/delete/update /… Operation. However, if you only have the data and want to get it quickly, you have two options… I chose a Python client and a very simple CLI script that lets you index data from a given file – CSV format and NDJSON format supported (newline-delimited JSON, commonly used for logging, streaming, etc.). You can browse through many other methods and helpers that match the Elasticsearch API. Before using the script, don’t forget to install the Python package for ElasticSearch before installing the module:

pip install elasticsearch
Copy the code

Free code/scripts in this repository:

git  clone https://github.com/liu-xiao-guo/load_csv_or_json_to_elasticsearch
Copy the code

There are two important points in this code:

1) It leverages Python generator concepts to iterate over a given file and produce a preprocessed element/line for index execution. This “element mode” operation was chosen to ensure memory efficiency and even the ability to handle large files.

def _csv_generator(self, es_dataset, **kwargs):
    with open(es_dataset.input_file) as csv_file:
        csv_dict_reader = csv.DictReader(csv_file)
        for cnt, row in enumerate(csv_dict_reader):
            yield self._prepare_document_for_bulk(es_dataset, row, cnt)
Copy the code

2) On the other hand, the actual indexing operation uses the BULK API to index multiple multi-document blocks (500 items by default), which is consistent with the speed tuning recommendations. For this operation, it USES python elasticsearch client one of the assistance program (above the original API for convenience abstract) – elasticsearch. Helpers. Streaming_bulk.

for cnt, response in enumerate(streaming_bulk(self.client, generator, chunk_size)):
    ok, result = response
Copy the code

After execution, it is necessary to use the Refresh API to make the document immediately searchable.

self.client.indices.refresh(index=es_dataset.es_index_name)
Copy the code

From a data point of view, there are now a number of options available to get some interesting datasets (Kaggle datasets, AWS datasets, or a select list of other public datasets), but we also wanted to show some text capabilities. Something from the NLP area – good collection can be found here. The hazard problem dataset is not a typical demonstration dataset. Note: If you use this setting, consider also removing Spaces from headings to make column/field names more concise.

run

Now we can run the following command from the terminal and view the STdout log. For the convenience of testing, we use the following data.

Data.cityofchicago.org/api/views/x…

This is a salary sheet for Chicago city employees.

$ ls
Current_Employee_Names__Salaries__and_Position_Titles.csv
README.md
load_csv_or_json_to_elasticsearch.py
Copy the code

To run our Python application:

$ python3 load_csv_or_json_to_elasticsearch.py Current_Employee_Names__Salaries__and_Position_Titles.csv test-csv
Copy the code

We type the following command in the browser’s address bar:

http://localhost:9200/_cat/indices?v
Copy the code

We can see that our test_CSV index has 33,586 documents. We can even use the following command to query:

http://localhost:9200/test-csv/_search?q=Department=POLICE&pretty
Copy the code

Now that we know how to import one of our CSV files into Elasticsearch using Python, what else do we need to know?

Other technical points to know:

  • Index Mapping/Settings: Check index structure data using the Get Mapping API, and check index Settings using the Get Settings API – there are several Settings options.
  • INDEX TEMPLATING: For use cases where similar indexes are constantly being created (i.e., over time), it’s convenient to use a predefined structure for this – this is the case with INDEX templates
  • Index changes: You can use the Reindex API to move data between indexes (useful when you need to change shards – not available after creation), in combination with index aliases to allow you to create “nicknames” so we can point to our queries (this abstraction layer allows you to change indexes behind aliases more freely)
  • Parsers: When we do indexing or full-text search queries, read more about parsers and their core building blocks, including character filters, taggers and tag filters, or creating custom parsers.
  • Elasticsearch is an official Elasticsearch client for various programming languages (e.g. Java, JS, Go, Python, etc.).
  • Data ingestion: Usually for time-based data (such as logs, monitoring, etc.), you can consider using other tools in the Elastic Stack (such as Beats or/and Logstash) to do more complex data preprocessing before indexing to Elasticsearch
  • Memory/storage: If you find that disk space for index data is running low, consider using index lifecycle management, which changes or drops indexes based on defined rules. If memory is low, you can look at the frozen index, which moves the different overhead data structures out of the heap. Or just buy more disk space and memory 🙂
  • Performance: Adjust indexing speed or efficient use of disks

Enjoy the convenience brought by Kibana

So far, we’ve been interacting directly with our Elasticsearch cluster. In the first chapter, this was done with a Python client, but later it was mostly done with a simple Curl command. Curl is great for quick testing and putting it all together when you want to explore your data through mapping, filtering, and more. Or if you need tools to perform administrative checks and updates on clusters and indexes, you may want to choose a more convenient UI-based application. Exactly what Kibana is, or as Elastic put it on Elasticsearch. Without further ado, let’s add it to our Settings and see what we can do:

docker run --rm --link esn01:elasticsearch --name kibana --network elastic-network -p 5601:5601 Docker. Elastic. Co/kibana/kibana: 7.5.0Copy the code

After we have installed our Docker, we can directly type the following address into our browser:

localhost:5601
Copy the code

We can see that our Kibana has been successfully launched. We can use our Kibana to analyze our data on this basis. For more on Kibana, see my article.

 

Reference:

【 1 】 towardsdatascience.com/from-scratc…