We’re excited to release the Beta version of the Elastic App Search Web crawler, a simple but powerful way to extract publicly accessible Web content, with Elastic Enterprise Search 7.11. So you can search for it instantly on your website.
Making content searchable on a website can take several forms. Elastic App Search already allows users to ingest content via JSON uploads, JSON pastes, and API endpoints. In this release, the introduction of Beta Web crawlers gives users another convenient way to extract content.
Elastic scrawler intro
Web crawlers can be used for self-managed and Elastic Cloud deployments to retrieve information from publicly accessible websites and make content searchable in App Search engines. App Search does a lot of heavy lifting in the background on your behalf to make searchable content relevant and easy to tweak using sliders without writing code.
Now, let’s dive into why Web crawlers were introduced to App Search.
What makes this web crawler different?
Short answer: Look at Elastic Cloud.
If you’ve been following Elastic Enterprise Search for years (we love our fan club), you’ll remember that Elastic Site Search is (and still is) a Web crawler. However, only Elastic App Search and Workplace Search are available on the very popular Elastic Cloud.
You may ask, “Really?”
Well, moving a completely redesigned and architecting Web crawler to App Search on Elastic Cloud has the following compelling advantages:
- Enjoy peace of mind: As a hosting service for Elasticsearch and Kibana, Elastic Cloud provides the speed, scale and relevance that defines Elastic. One-click upgrades, simple extensions, and index lifecycle management (ILM) are just some of the reasons customers are flocking to Elastic Cloud. Moreover, if you are already an Elastic Observability or Elastic Security customer, you can manage your entire deployment from one powerful console.
- Your data, your choice: Elastic Cloud is available in more than 40 global territories of the world’s top Cloud providers: Google Cloud (GCP), Microsoft Azure, and Amazon Web Services (AWS). Your data, your cloud, your way.
- Pricing: With Elastic’s innovative resource-based pricing, you don’t have to worry about crazy metrics like number of users, number of queries, document size, or deployed agents. In either case, your costs will come down to the hardware resources needed to store, search, and analyze your data.
Although we focus on cloud deployment in this blog, it’s important to note that App Search Web Search is now also available as a self-hosted deployment method – this option is not available with Elastic Site Search (or Swiftype) (Site Search cannot be deployed by self-hosting).
What exactly does a Web crawler crawl?
Before delve into how to set up a web search engine, let’s first review the content – what a web crawler does on your designated public web site.
When you provide a web address, such as www.elastic.co, a web crawler will visit that page. From here, the Web Search engine will track every new link it finds on the page and extract the content into your App Search engine. This is content discovery. Each link found is crawled in a similar manner. The “tree” diagram below Outlines how it works.
In the figure above, all blue pages can be climbed and indexed. However, no pages are linked to pink pages, so they can’t be crawled or indexed. In order for a Web crawler to access an unlinked page, that page must be provided directly as an entry point or included in the site map. We’ll explain how to set up entry points in a later blog post.
The type of content extracted
For the beta version of the Web crawler, you can extract the following from the HTML page:
- The page title
- Description (YUAN)
- Keywords (yuan)
- Body (standardized, without HTML tags)
- Specification url
- Other web sites (for the same document)
- link
Hands-on: Getting started with web crawlers
Elastic scrawler hand-on
Let’s start from scratch and create a new Elastic Enterprise Search deployment on Elastic Cloud. For existing Elastic Site Search customers, Swiftype customers, or new Elastic Cloud customers, be sure to sign up for a free 14-day trial to experience the beauty of web crawlers.
- On www.elastic.co, select Login from the upper right corner.
- There are several SSO methods available. Or create a new account.
- Once logged in, select “Create Deployment”.
- Select the Elastic Enterprise Search deployment template. The template is optimized for CPU output, storage, and availability areas. After you create a deployment, you can customize all deployment templates to your specific needs.
- Select your cloud provider from the list. The choice is your own: Google Cloud (GCP), Microsoft Azure or Amazon Web Services (AWS)
- Name your Deployment, then click “Create Deployment”.
- You will see a notification screen indicating that your deployment has been created.
Elasticsearch cluster deployment
A: congratulations! You are building your first App Search engine.
The Elastic Enterprise Search solution consists of two applications: App Search and Workplace Search. For this tutorial, select the “Launch App Search” button.
Elastic scrawler demo
Well done! You are now in App Search and you can start creating web crawlers.
The getting started process helps you create your first search engine. Just name your engine (use something like “my-elastic-search-engine”) and you’ll see a screen that offers four ways to extract data: paste JSON, upload A JSON file, index by API, or use a web crawler. Now you know which one to choose.
At this point, you can choose to add your own website or select elastice. co as the URL of the domain you want to crawl. Keep in mind that the web crawler will visit the specified web page as you provide the URL to extract the content. From there, it tracks every new link on the page it finds until the Web crawler gets stuck.
Elastic scrawler creation
This is where the entry point feature comes in handy. If you have an island page that is not linked to other pages, simply add the full URL as a pointcut. From here, the Web crawler will start indexing the content and continue looking for new links for content extraction until it is no longer possible to continue browsing.
On the same console page, you can create crawling rules. These rules allow administrators to include or exclude pages whose urls match the rules. For example, maybe your Marketing Department uses a targeted page-path/LP representation for advertising series. These targeted pages are great for driving new business with targeted content, but may not be the type of content you want to include in your search engine.
In the Crawler Rules section, add a new policy that does not allow indexing of any URL path that contains/LP.
The suspense! Now it’s time to climb the net. After completing all entry points and Crawl rules, select the “Start Crawl” button.
Elastic scrawler manage
Click the “Documents” TAB and watch your content be extracted into the App Search engine. Or click the Query Tester icon at the top right of the screen to Search the engine from anywhere in the App Search UI.
To test the results in the search box immediately, select the Reference UI TAB. Here, you can use the existing React based search box. Or better yet, use the Elastic Search UI JavaScript library to build and customize your own Search experience.
Now it’s your turn
We think you’ll like the powerful and simple design of the Web crawler. So now it’s your turn to try!
The Elastic App Cloud Web crawler is currently in beta and is available at all subscription levels and is available in self-managed and Elastic Cloud deployments. Existing Elastic Cloud customers can access enterprise Search directly from the Elastic Cloud console.
New to Elastic Cloud? Check out our Quick Start guide (training video to get started quickly) and get started with the free 14-day trial of Elastic Enterprise Search. Or download a self-hosted administrative version of App Search or Workplace Search for free.
To read more, see Enterprise: Elastic App Search – Web Crawler