Elasticsearch is usually used to index data of strings, numbers, dates, etc. But what if you want to index and make searchable files like.pdf or.doc directly? There is a need for such real-time use cases in HCM, ERP, and e-commerce applications.
In today’s article we will look at how to implement a search for.pdf or.doc files. This solution is available for versions later than Elasticsearch 5.0.
Realize the principle of
Insert a.pdf file into Elasticsearch’s data node as follows:
As shown above, we first Base64 our.pdf file and then upload it to the Ingest node in Elasticsearch for processing. The Ingest Attachment Plugin allows Elasticsearch to extract file attachments in common formats such as PPT, XLS and PDF. Finally, the data is rewound into Elasticsearch’s data node for us to search.
In the following sections, we step through how to do this.
Import the PDF file to Elasticsearch
Preparing PDF files
We can use our Word or other editing software to produce a PDF file. Let’s call this file sample.pdf for now. And it’s not simple:
In our sample.pdf file, we only have one sentence “I like this Useful tool”.
Install the Ingest Attachment Plugin
The Ingest Attachment Plugin allows Elasticsearch to extract file attachments in common formats such as PPT, XLS and PDF using the Apache text extraction library Tika. The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types, such as PPT, XLS, and PDF. All of these file types can be parsed through an interface, making Tika useful for search engine indexing, content analysis, translation, and more.
The source field must be base64 encoded binary. If you don’t want to add the overhead of converting back and forth between Base64, you can use the CBOR format instead of JSON, and specify the fields as byte arrays instead of string representations. The processor then skips base64 decoding.
You can install this plug-in using plug-in Manager:
sudo bin/elasticsearch-plugin install ingest-attachment
Copy the code
The plug-in must be installed on every node in the cluster, and each node must be restarted after installation.
After we have installed the plug-in, we can check whether the plug-in has been successfully installed by the following command.
./bin/elasticsearch-plugin list
Copy the code
If the installation is correct, we should see the following output:
Create attachment pipeline
We can create a Pipleline called pdfAttachment on our Ingest node:
PUT _ingest/pipeline/pdfattachment
{
"description": "Extract attachment information encoded in Base64 with UTF-8 charset",
"processors": [
{
"attachment": {
"field": "file"
}
}
]
}
Copy the code
Convert the PDF file and upload the content of the PDF file to Elasticsearch
For the Ingest Attachment Plugin, its data must be Base64. We can do this in the web site Base64 Encoder. In our case, we go directly to the script method:
indexPdf.sh
#! /bin/bash encodedPdf=`cat sample.pdf | base64` json="{\"file\":\"${encodedPdf}\"}" echo "$json" > json.file curl -XPOST 'http://localhost:9200/pdf-test1/_doc? pipeline=pdfattachment&pretty' -H 'Content-Type: application/json' -d @json.fileCopy the code
In the above script, we base64 transform sample.pdf and generate a file called json.file. Finally, upload the contents of the json.file file to Elasticsearch using the curl command. You can view an index called PDF-test1 in Elasticsearch.
We can run the above script directly in Terminal:
./indexPdf.sh
Copy the code
At this point we have imported the PDF file into Elasticsearch.
View the index and search
To query our pdF-test1 index, run the following command:
GET pdf-test1/_search
Copy the code
The command output is as follows:
As you can see above, our index has a field called Content, which contains the content of our PDF file. This field can be searched with us. We also see a large field file above. It contains content in base64 format that we have converted. If we don’t want this field, we can remove it by adding another remove Processor:
PUT _ingest/pipeline/pdfattachment
{
"description": "Extract attachment information encoded in Base64 with UTF-8 charset",
"processors": [
{
"attachment": {
"field": "file"
}
},
{
"remove": {
"field": "file"
}
}
]
}
Copy the code
So we remove the field called file, and the corrected index is:
Reference:
【 1 】 qbox. IO/blog/how-to…
(2) www.elastic.co/guide/en/el…