Search is a common requirement in big data. Splunk and ELK are the leaders in the non-open and open space respectively. This article has implemented a basic data search feature with very little Python code to try to get you to understand the basics of big data search.

Bloom Filter

The first step is to implement a Bloom filter.

Bloom filters are a common algorithm in the field of big data. Their purpose is to filter out elements that are not the target. That is, if a search term does not exist in my data, it can quickly return that the target does not exist.

Let’s take a look at the following bloom filter code:

class Bloomfilter(object):
    """ A Bloom filter is a probabilistic data-structure that trades space for accuracy when determining if a value is in a set. It can tell you if a value was possibly added, or if it was definitely not added, but it can't tell you for certain that it was added. """
    def __init__(self, size):
        """Setup the BF with the appropriate size"""
        self.values = [False] * size
        self.size = size

    def hash_value(self, value):
        """Hash the value provided and scale it to fit the BF size"""
        return hash(value) % self.size

    def add_value(self, value):
        """Add a value to the BF"""
        h = self.hash_value(value)
        self.values[h] = True

    def might_contain(self, value):
        """Check if the value might be in the BF"""
        h = self.hash_value(value)
        return self.values[h]

    def print_contents(self):
        """Dump the contents of the BF for debugging purposes"""
        print self.valuesCopy the code
  • The basic data structure is an array (actually a bitmap, with 1/0 to record the presence of data), initialized with nothing, so set all to False. In practical use, the array length is very large to ensure efficiency.
  • The hash algorithm is used to determine which bits of data should be stored, which is the index of the array
  • When a data is added to the Bloom filter, the hash value is computed and the corresponding position is set to True
  • When checking whether a piece of data already exists or has been indexed, simply check the True/Fasle bit of the corresponding hash value

As you can see from this, if the Bloom filter returns False, the data must not have been indexed, whereas if the bloom filter returns True, the data must not have been indexed. Using bloom filters in the search process can make many missed searches return early to improve efficiency.

Let’s see how this code works:

bf = Bloomfilter(10)
bf.add_value('dog')
bf.add_value('fish')
bf.add_value('cat')
bf.print_contents()
bf.add_value('bird')
bf.print_contents()
# Note: contents are unchanged after adding bird - it collides
for term in ['dog'.'fish'.'cat'.'bird'.'duck'.'emu'] :print ' '{} {} {}..format(term, bf.hash_value(term), bf.might_contain(term))Copy the code

Results:

[False.False.False.False.True.True.False.False.False.True]
[False.False.False.False.True.True.False.False.False.True]
dog: 5 True
fish: 4 True
cat: 9 True
bird: 9 True
duck: 5 True
emu: 8 FalseCopy the code

First, a Bloom filter of capacity 10 is created

Then add ‘dog’, ‘fish’ and ‘cat’ to the filter, and the contents of the bloom filter are as follows:

The ‘bird’ object is then added, and the contents of the Bloom filter do not change, because ‘bird’ and ‘fish’ happen to have the same hash.

Finally we check that a bunch of objects (‘dog’, ‘fish’, ‘cat’, ‘bird’, ‘duck’, ’emu’) are already indexed. It turns out that ‘duck’ returns True, 2 and ’emu’ returns False. Because ‘duck’ happens to have the same hash as’ dog ‘.

participles

The next step is to implement a participle. The purpose of word segmentation is to divide our textual data into the smallest searchable units, namely words. Here we mainly focus on English, because Chinese word segmentation involves natural language processing, which is more complicated, while English basically just uses punctuation marks.

Let’s look at the code for word segmentation:

def major_segments(s):
    """ Perform major segmenting on a string. Split the string by all of the major breaks, and return the set of everything found. The breaks in this implementation are single characters, but in Splunk proper they can be multiple characters. A set is used because ordering doesn't matter, and duplicates are bad. """
    major_breaks = ' '
    last = - 1
    results = set()

    # enumerate() will give us (0, s[0]), (1, s[1]), ...
    for idx, ch in enumerate(s):
        if ch in major_breaks:
            segment = s[last+1:idx]
            results.add(segment)

            last = idx

    # The last character may not be a break so always capture
    # the last segment (which may end up being "", but yolo)    
    segment = s[last+1:]
    results.add(segment)

    return resultsCopy the code

Main segmentation

The main partition uses Spaces for word segmentation, but in actual word segmentation logic, there are other separators as well. For example, Splunk’s default separators include the following, and users can also define their own.

] < > () {} |! ; , ‘ ” * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D — %2520 %5D %5B %3A %0A %2C %28 %29

def minor_segments(s):
    """ Perform minor segmenting on a string. This is like major segmenting, except it also captures from the start of the input to each break. """
    minor_breaks = '_.
    last = - 1
    results = set()

    for idx, ch in enumerate(s):
        if ch in minor_breaks:
            segment = s[last+1:idx]
            results.add(segment)

            segment = s[:idx]
            results.add(segment)

            last = idx

    segment = s[last+1:]
    results.add(segment)
    results.add(s)

    return resultsCopy the code

Secondary segmentation

The logic of a minor split is similar to that of a major split, except that the results from the start to the current split are added. For example, the minor split of “1.2.3.4” would have 1,2,3,4,1.2, 1.2.3

def segments(event):
    """Simple wrapper around major_segments / minor_segments"""
    results = set()
    for major in major_segments(event):
        for minor in minor_segments(major):
            results.add(minor)
    return resultsCopy the code

The logic of word segmentation is to carry out primary segmentation for text first, and secondary segmentation for each primary segmentation. Then return all the separated words.

Let’s see how this code works:

for term in segments('src_ip = 2.) :print termCopy the code
src
1.2
1.23.4.
src_ip
3
1
1.23.
ip
2
=
4Copy the code

search

Well, there is a word segmentation and bloom filter after the support of these two tools, we can realize the search function.

The code:

class Splunk(object):
    def __init__(self):
        self.bf = Bloomfilter(64)
        self.terms = {}  # Dictionary of term to set of events
        self.events = []
    
    def add_event(self, event):
        """Adds an event to this object"""

        # Generate a unique ID for the event, and save it
        event_id = len(self.events)
        self.events.append(event)

        # Add each term to the bloomfilter, and track the event by each term
        for term in segments(event):
            self.bf.add_value(term)

            if term not in self.terms:
                self.terms[term] = set()
            self.terms[term].add(event_id)

    def search(self, term):
        """Search for a single term, and yield all the events that contain it"""
        
        # In Splunk this runs in O(1), and is likely to be in filesystem cache (memory)
        if not self.bf.might_contain(term):
            return

        # In Splunk this probably runs in O(log N) where N is the number of terms in the tsidx
        if term not in self.terms:
            return

        for event_id in sorted(self.terms[term]):
            yield self.events[event_id]Copy the code
  • Splunk represents a collection of indexes with search capabilities
  • Each collection contains a Bloom filter, an inverted word list (dictionary), and an array to store all events
  • When an event is added to an index, the following logic is done
    • Generate an UNqie ID for each event, which is the sequence number
    • For event segmentation, add each word to the inverted word list, that is, the mapping structure of the ID of the event corresponding to each word. Note that a word may correspond to multiple events, so the value of the inverted word list is a Set. Inversion lists are a core feature of most search engines.
  • When a word is searched, the following logic is done
    • Check bloom filter, if false, return directly
    • Check the word list and return if the searched word is not in the word list
    • Find all the corresponding event ids in the inversion list and return the contents of the event

Let’s run and see:

s = Splunk()
s.add_event('src_ip = 2.)
s.add_event('src_ip = 5.6.7.8')
s.add_event('dst_ip = 2.)

for event in s.search('2.') :print event
print The '-'
for event in s.search('src_ip') :print event
print The '-'
for event in s.search('ip') :print eventCopy the code
src_ip = 1.23.4.
dst_ip = 1.23.4.
-
src_ip = 1.23.4.
src_ip = 5.67.8.
-
src_ip = 1.23.4.
src_ip = 5.67.8.
dst_ip = 1.23.4.Copy the code

Isn’t it awesome?

More sophisticated searches

Further, in the search process, we want to use And And Or to implement more complex search logic.

The code:

class SplunkM(object):
    def __init__(self):
        self.bf = Bloomfilter(64)
        self.terms = {}  # Dictionary of term to set of events
        self.events = []
    
    def add_event(self, event):
        """Adds an event to this object"""

        # Generate a unique ID for the event, and save it
        event_id = len(self.events)
        self.events.append(event)

        # Add each term to the bloomfilter, and track the event by each term
        for term in segments(event):
            self.bf.add_value(term)
            if term not in self.terms:
                self.terms[term] = set()
            
            self.terms[term].add(event_id)

    def search_all(self, terms):
        """Search for an AND of all terms"""

        # Start with the universe of all events...
        results = set(range(len(self.events)))

        for term in terms:
            # If a term isn't present at all then we can stop looking
            if not self.bf.might_contain(term):
                return
            if term not in self.terms:
                return

            # Drop events that don't match from our results
            results = results.intersection(self.terms[term])

        for event_id in sorted(results):
            yield self.events[event_id]


    def search_any(self, terms):
        """Search for an OR of all terms"""
        results = set()

        for term in terms:
            # If a term isn't present, we skip it, but don't stop
            if not self.bf.might_contain(term):
                continue
            if term not in self.terms:
                continue

            # Add these events to our results
            results = results.union(self.terms[term])

        for event_id in sorted(results):
            yield self.events[event_id]Copy the code

The Python collection intersection And Union operations easily support And And Or operations.

The running results are as follows:

s = SplunkM()
s.add_event('src_ip = 2.)
s.add_event('src_ip = 5.6.7.8')
s.add_event('dst_ip = 2.)

for event in s.search_all(['src_ip'.'5.6') :print event
print The '-'
for event in s.search_any(['src_ip'.'dst_ip') :print eventCopy the code
Src_ip = 5.6.7.8 - src_ip = 2 src_ip = 5.6.7.8 dst_ip = 2Copy the code

conclusion

The above code is just to illustrate the basics of big data search, including bloom filters, word segmentation and inversion lists. If you really want to use this code for real search, you’re way off.

Author: naughty


Original: https://my.oschina.net/taogang/blog/1579204