Implementing custom Chinese full-text index in Neo4j

When it comes to the efficiency of database retrieval, the primary optimization approach is to start with the index, and then consider more complex means such as load balancing, read-write separation and distributed horizontal/vertical library/table partition according to the demand. Indexing improves retrieval efficiency through information redundancy, which trades space for time and reduces the efficiency of data writing, so the selection of index fields is very important.

Neo4j creates Index on the Node with the specified Label. The Index will be automatically updated when a Node attribute is added/updated to meet the criteria. Neo4j Index is implemented by Lucene by default (it can be customized, such as RTree Index realized by Spatial Index custom), but the newly built Index by default only supports accurate matching (GET), and fuzzy query requires full-text Index to control the word segmentation behavior of Lucene background.

Neo4j full-text index default word segmentation is for western languages, such as the default exact query using Lucene KeywordAnalyzer (keyword word segmentation), FullText query using white-space Tokenizer (space word segmentation), Capitalization doesn’t mean anything in Chinese; So for Chinese word segmentation need to hang a Chinese word segmentation, such as IK Analyzer, ANSJ, as for the similar Liang factory director based on deep learning word segmentation system Pullword, it is more powerful.

Using the commonly used IK Analyzer word segmentation as an example, this paper introduces how to create a new full-text index on a field in Neo4j to implement fuzzy queries.

IKAnalyzer participle

IkAnalyzer is an open source, lightweight Chinese word segmentation toolkit developed based on the Java language.

IKAnalyzer3.0 features

A unique forward iteration finest-grained segmentation algorithm is adopted, which supports fine-grained segmentation and maximum word length segmentation. It has a high speed processing capacity of 830,000 words/second (1600KB/S).
Adopted multiple sub-processor analysis mode, support: English letters, numbers, Chinese words and other word segmentation processing, compatible with Korean, Japanese characters optimized dictionary storage, smaller memory occupation. Support for user dictionary extension definition
IKQueryParser optimized for Lucene full-text retrieval (recommended by the author); The introduction of simple search expression and the adoption of ambiguity analysis algorithm to optimize the search arrangement and combination of query keywords can greatly improve the hit rate of Lucene retrieval. IK Analyser does not have a Maven library yet, so you have to manually download and install it to the local library. Next time you are empty, you can create a private Maven library on GitHub and upload the toolkits that are not available in the Maven Central library.

IKAnalyzer customizes user dictionaries

A dictionary file

Dictionary files with a custom dictionary suffix named. DIC must be saved using UTF-8 encoding without BOM.

Dictionary configuration

The dictionary and ikanalyzer.cfg.xml configuration file path issues, ikanalyzer.cfg.xml must be in the SRC root directory. The dictionary can be placed anywhere you want, but you need to configure it in ikanalyzer.cfg.xml. With the following configuration, ext. DIC and stopWord. DIC should be in the same directory.

<? The XML version = "1.0" encoding = "utf-8"? > <! DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd" > < properties > < comment > IK Analyzer extension configuration < / comment > <! > <entry key="ext_dict">/ext.dic; </entry> <! > <entry key="ext_stopwords">/stopword.dic</entry> </properties>

Neo4j full-text index construction

Specify IKAnalyzer as Luncene Word Analyzer and create a full-text index on the specified attributes of all nodes

@Override public void createAddressNodeFullTextIndex () { try (Transaction tx = graphDBService.beginTx()) { IndexManager  index = graphDBService.index(); Index<Node> addressNodeFullTextIndex = index.forNodes( "addressNodeFullTextIndex", MapUtil.stringMap(IndexManager.PROVIDER, "lucene", "analyzer", IKAnalyzer.class.getName())); ResourceIterator<Node> nodes = graphDBService.findNodes(DynamicLabel.label( "AddressNode")); while (nodes.hasNext()) { Node node = nodes.next(); Object text = node.getProperty("text", null); addressNodeFullTextIndex.add(node, "text", text); } tx.success(); }}

Neo4j full-text index test

Both keywords (such as’ limited company ‘) and multi-keyword fuzzy query (such as’ Suzhou Education Company ‘) can be retrieved by default, and the retrieval results have been sorted according to the correlation degree.

package uadb.tr.neodao.test; import org.junit.Test; import org.junit.runner.RunWith; import org.neo4j.graphdb.GraphDatabaseService; import org.neo4j.graphdb.Node; import org.neo4j.graphdb.Transaction; import org.neo4j.graphdb.index.Index; import org.neo4j.graphdb.index.IndexHits; import org.neo4j.graphdb.index.IndexManager; import org.neo4j.helpers.collection.MapUtil; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.test.context.ContextConfiguration; import org.springframework.test.context.junit4.SpringJUnit4ClassRunner; import org.wltea.analyzer.lucene.IKAnalyzer; import com.lt.uadb.tr.entity.adtree.AddressNode; import com.lt.util.serialize.JsonUtil; /** * AddressNodeNeoDaoTest * * @author geosmart */ @RunWith(SpringJUnit4ClassRunner. class) @ContextConfiguration(locations = { "classpath:app.neo4j.cfg.xml" }) public class AddressNodeNeoDaoTest { @Autowired GraphDatabaseService graphDBService; @Test public void test_selectAddressNodeByFullTextIndex() { try (Transaction tx = graphDBService.beginTx()) { IndexManager index = graphDBService.index(); Index<Node> addressNodeFullTextIndex = index.forNodes("addressNodeFullTextIndex" , MapUtil. stringMap(IndexManager.PROVIDER, "lucene", "analyzer" , IKAnalyzer.class.getName())); IndexHits < Node > foundNodes = addressNodeFullTextIndex. Query (" text ", "suzhou education company"); for (Node node : foundNodes) { AddressNode entity = JsonUtil.ConvertMap2POJO(node.getAllProperties(), AddressNode. class, false, true); System.out. println(entity.getAll address full name ()); } tx.success(); }}}

Custom full-text index queries are used in CyperQL

Regular query

Profile match (a:AddressNode{ruleAbbr :'TOW',text:' BELONGTO '})<-[r:BELONGTO]-(b:AddressNode{ruleAbbr :'STR'}) where b ext=~ *' return a,b

Full-text Index Query

Profile START b = node: addressNodeFullTextIndex match (" text: jinling * ") (a:AddressNode{ruleAbbr :' Tow ',text:' Vbelongto '})<-[r:BELONGTO]-(b:AddressNode) where b. RuleAbbr ='STR' return a,b

Exact and fulltext indexes are federated in LegacyIndex

For nodes whose label is AddressNode, AddressNode_FullText_Index (province -> City -> District County -> Township Street -> Street Lane/Property Community)/AddressNode_Exact_Index (Door No. -> Building No. -> Unit No. -> Floor No. -> Room No.) Create indexes of different types for the attribute text

Profile START a=node: addressNode_fullText_index ("text: high street "),b=node: addressNode_exact_index ("text: second 19") match (a:AddressNode{ruleabbr:'STR'})-[r:BELONGTO]-(b:AddressNode{ruleabbr:'TAB'}) return a,b limit 10