preface

Keyword detection is now almost all web systems must have something. To do complex will involve NLP and other “advanced technology”, this will not be in this paper nonsense. Here is a simple algorithm: DFA.

What’s the DFA

DFA: Deterministic Finite Automaton. Its characteristics are: there is a finite set of states and some edges leading from one state to another state, each edge marked with a symbol, one of the states is the initial state, some states are the final state.

You have a set, and each element in the set has its state identified. By knowing some events, you can get the next state from the current state of the element to know what the next element will be. That is, event + state = next state. And, after all, the set is finite, so it must be possible to find the final node from one node.

The figure above clearly shows the relationship between the elements of this set. State S can find U through event A, and V can be found through event B. So how to use this finite automata keyword detection, here is an example.

For example

Now, there are a few key words: Hyacinth seed, hyacinth grass, hyacinth flower, pikachu, pichilus. We now carry out some operations on these keywords to construct a finite automaton.

If we organize these words into such a tree structure, we can search the tree to which the first word belongs according to it, which will greatly improve the search efficiency.

Take the frog flower as an example:

  1. First through the word “miao” to search, to see whether there is a tree with “miao” as the root node. If it doesn’t exist, it’s not in our keywords.
  2. If the tree exists, we match the second word “frog” to see if it is on the second node of the current tree. If it doesn’t exist, it’s not in our keywords. And so on.
  3. How to tell if the tree has reached the last word of the keyword. When we generate the tree, we add an attribute (isEnd=true) to the last node to indicate that the keyword has ended.

A HashSet instance of this automaton is given below:

Myra = {isEnd = 0 Frog = {isEnd = 0 species = {isEnd = 0 child = {isEnd = 1} grass = {isEnd = 1} Flower = {isEnd = 1}} skin = {isEnd = 0 card = {isEnd = 0 mound = {isEnd = 1} mound = {isEnd = 1}}Copy the code

Code implementation

class DFA
{
    private $tree = [];

    public function build($strWord)
    {

        $len = mb_strlen($strWord.'UTF-8');
        $tree = &$this->tree;

        for ($i = 0; $i < $len; $i{+ +)$word = mb_substr($strWord.$i, 1, 'UTF-8');

            if (isset($tree[$word]) {if ($i= = ($len - 1)) {
                    $tree[$word] ['end'] = 1; }}else {

                if ($i= = ($len - 1)) {
                    $tree[$word] = [];
                    $tree[$word] ['end'] = 1;
                } else {
                    $tree[$word] = [];
                    $tree[$word] ['end'] = 0; }}$tree = &$tree[$word];
        }
    }

    public function search($strWord)
    {
        $len = mb_strlen($strWord.'UTF-8');

        $arrHashMap = $this->tree;

        for ($i = 0; $i < $len; $i{+ +)$word = mb_substr($strWord.$i, 1, 'UTF-8');

            if(! isset($arrHashMap[$word]) {$arrHashMap = $this->tree;
                continue;
            }

            if ($arrHashMap[$word] ['end']) {
                return true;
            }

            $arrHashMap = $arrHashMap[$word];
        }
        
        return false; }}$obj = new DFA();
$obj->build('Frog seed');
$obj->build('Frogweed');
$obj->build('Pikachu');
$obj->build('pichel');

var_dump($obj->search('Lips'));
var_dump($obj->search('Frog seed'));

Copy the code

Some of the problems

  1. This detection keyword is also dependent on the thesaurus itself, the thesaurus itself needs maintenance.
  2. A large thesaurus can take up a lot of memory.
  3. Meaningless characters affect the query and need to be filtered in advance.

To sum up

The above is a simple application of using DFA for keyword detection, which is quite practical in some simple application scenarios. Of course, true keyword detection also requires deep learning and natural language analysis, which will be studied later when the opportunity comes.

PS

Attach a warehouse

Github.com/Devil0023/p…