Search is a common requirement in big data. Splunk and ELK are the leaders in the non-open and open space respectively. This article has implemented a basic data search feature with very little Python code to try to get you to understand the basics of big data search.
Bloom Filter
The first step is to implement a Bloom filter.
Bloom filters are a common algorithm in the field of big data. Their purpose is to filter out elements that are not the target. That is, if a search term does not exist in my data, it can quickly return that the target does not exist.
Let’s take a look at the following bloom filter code:
class Bloomfilter(object):
""" A Bloom filter is a probabilistic data-structure that trades space for accuracy when determining if a value is in a set. It can tell you if a value was possibly added, or if it was definitely not added, but it can't tell you for certain that it was added. """
def __init__(self, size):
"""Setup the BF with the appropriate size"""
self.values = [False] * size
self.size = size
def hash_value(self, value):
"""Hash the value provided and scale it to fit the BF size"""
return hash(value) % self.size
def add_value(self, value):
"""Add a value to the BF"""
h = self.hash_value(value)
self.values[h] = True
def might_contain(self, value):
"""Check if the value might be in the BF"""
h = self.hash_value(value)
return self.values[h]
def print_contents(self):
"""Dump the contents of the BF for debugging purposes"""
print self.valuesCopy the code
- The basic data structure is an array (actually a bitmap, with 1/0 to record the presence of data), initialized with nothing, so set all to False. In practical use, the array length is very large to ensure efficiency.
- The hash algorithm is used to determine which bits of data should be stored, which is the index of the array
- When a data is added to the Bloom filter, the hash value is computed and the corresponding position is set to True
- When checking whether a piece of data already exists or has been indexed, simply check the True/Fasle bit of the corresponding hash value
As you can see from this, if the Bloom filter returns False, the data must not have been indexed, whereas if the bloom filter returns True, the data must not have been indexed. Using bloom filters in the search process can make many missed searches return early to improve efficiency.
Let’s see how this code works:
bf = Bloomfilter(10)
bf.add_value('dog')
bf.add_value('fish')
bf.add_value('cat')
bf.print_contents()
bf.add_value('bird')
bf.print_contents()
# Note: contents are unchanged after adding bird - it collides
for term in ['dog'.'fish'.'cat'.'bird'.'duck'.'emu'] :print ' '{} {} {}..format(term, bf.hash_value(term), bf.might_contain(term))Copy the code
Results:
[False.False.False.False.True.True.False.False.False.True]
[False.False.False.False.True.True.False.False.False.True]
dog: 5 True
fish: 4 True
cat: 9 True
bird: 9 True
duck: 5 True
emu: 8 FalseCopy the code
First, a Bloom filter of capacity 10 is created
Then add ‘dog’, ‘fish’ and ‘cat’ to the filter, and the contents of the bloom filter are as follows:
The ‘bird’ object is then added, and the contents of the Bloom filter do not change, because ‘bird’ and ‘fish’ happen to have the same hash.
Finally we check that a bunch of objects (‘dog’, ‘fish’, ‘cat’, ‘bird’, ‘duck’, ’emu’) are already indexed. It turns out that ‘duck’ returns True, 2 and ’emu’ returns False. Because ‘duck’ happens to have the same hash as’ dog ‘.
participles
The next step is to implement a participle. The purpose of word segmentation is to divide our textual data into the smallest searchable units, namely words. Here we mainly focus on English, because Chinese word segmentation involves natural language processing, which is more complicated, while English basically just uses punctuation marks.
Let’s look at the code for word segmentation:
def major_segments(s):
""" Perform major segmenting on a string. Split the string by all of the major breaks, and return the set of everything found. The breaks in this implementation are single characters, but in Splunk proper they can be multiple characters. A set is used because ordering doesn't matter, and duplicates are bad. """
major_breaks = ' '
last = - 1
results = set()
# enumerate() will give us (0, s[0]), (1, s[1]), ...
for idx, ch in enumerate(s):
if ch in major_breaks:
segment = s[last+1:idx]
results.add(segment)
last = idx
# The last character may not be a break so always capture
# the last segment (which may end up being "", but yolo)
segment = s[last+1:]
results.add(segment)
return resultsCopy the code
Main segmentation
The main partition uses Spaces for word segmentation, but in actual word segmentation logic, there are other separators as well. For example, Splunk’s default separators include the following, and users can also define their own.
] < > () {} |! ; , ‘ ” * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D — %2520 %5D %5B %3A %0A %2C %28 %29
def minor_segments(s):
""" Perform minor segmenting on a string. This is like major segmenting, except it also captures from the start of the input to each break. """
minor_breaks = '_.
last = - 1
results = set()
for idx, ch in enumerate(s):
if ch in minor_breaks:
segment = s[last+1:idx]
results.add(segment)
segment = s[:idx]
results.add(segment)
last = idx
segment = s[last+1:]
results.add(segment)
results.add(s)
return resultsCopy the code
Secondary segmentation
The logic of a minor split is similar to that of a major split, except that the results from the start to the current split are added. For example, the minor split of “1.2.3.4” would have 1,2,3,4,1.2, 1.2.3
def segments(event):
"""Simple wrapper around major_segments / minor_segments"""
results = set()
for major in major_segments(event):
for minor in minor_segments(major):
results.add(minor)
return resultsCopy the code
The logic of word segmentation is to carry out primary segmentation for text first, and secondary segmentation for each primary segmentation. Then return all the separated words.
Let’s see how this code works:
for term in segments('src_ip = 2.) :print termCopy the code
src
1.2
1.23.4.
src_ip
3
1
1.23.
ip
2
=
4Copy the code
search
Well, there is a word segmentation and bloom filter after the support of these two tools, we can realize the search function.
The code:
class Splunk(object):
def __init__(self):
self.bf = Bloomfilter(64)
self.terms = {} # Dictionary of term to set of events
self.events = []
def add_event(self, event):
"""Adds an event to this object"""
# Generate a unique ID for the event, and save it
event_id = len(self.events)
self.events.append(event)
# Add each term to the bloomfilter, and track the event by each term
for term in segments(event):
self.bf.add_value(term)
if term not in self.terms:
self.terms[term] = set()
self.terms[term].add(event_id)
def search(self, term):
"""Search for a single term, and yield all the events that contain it"""
# In Splunk this runs in O(1), and is likely to be in filesystem cache (memory)
if not self.bf.might_contain(term):
return
# In Splunk this probably runs in O(log N) where N is the number of terms in the tsidx
if term not in self.terms:
return
for event_id in sorted(self.terms[term]):
yield self.events[event_id]Copy the code
- Splunk represents a collection of indexes with search capabilities
- Each collection contains a Bloom filter, an inverted word list (dictionary), and an array to store all events
- When an event is added to an index, the following logic is done
- Generate an UNqie ID for each event, which is the sequence number
- For event segmentation, add each word to the inverted word list, that is, the mapping structure of the ID of the event corresponding to each word. Note that a word may correspond to multiple events, so the value of the inverted word list is a Set. Inversion lists are a core feature of most search engines.
- When a word is searched, the following logic is done
- Check bloom filter, if false, return directly
- Check the word list and return if the searched word is not in the word list
- Find all the corresponding event ids in the inversion list and return the contents of the event
Let’s run and see:
s = Splunk()
s.add_event('src_ip = 2.)
s.add_event('src_ip = 5.6.7.8')
s.add_event('dst_ip = 2.)
for event in s.search('2.') :print event
print The '-'
for event in s.search('src_ip') :print event
print The '-'
for event in s.search('ip') :print eventCopy the code
src_ip = 1.23.4.
dst_ip = 1.23.4.
-
src_ip = 1.23.4.
src_ip = 5.67.8.
-
src_ip = 1.23.4.
src_ip = 5.67.8.
dst_ip = 1.23.4.Copy the code
Isn’t it awesome?
More sophisticated searches
Further, in the search process, we want to use And And Or to implement more complex search logic.
The code:
class SplunkM(object):
def __init__(self):
self.bf = Bloomfilter(64)
self.terms = {} # Dictionary of term to set of events
self.events = []
def add_event(self, event):
"""Adds an event to this object"""
# Generate a unique ID for the event, and save it
event_id = len(self.events)
self.events.append(event)
# Add each term to the bloomfilter, and track the event by each term
for term in segments(event):
self.bf.add_value(term)
if term not in self.terms:
self.terms[term] = set()
self.terms[term].add(event_id)
def search_all(self, terms):
"""Search for an AND of all terms"""
# Start with the universe of all events...
results = set(range(len(self.events)))
for term in terms:
# If a term isn't present at all then we can stop looking
if not self.bf.might_contain(term):
return
if term not in self.terms:
return
# Drop events that don't match from our results
results = results.intersection(self.terms[term])
for event_id in sorted(results):
yield self.events[event_id]
def search_any(self, terms):
"""Search for an OR of all terms"""
results = set()
for term in terms:
# If a term isn't present, we skip it, but don't stop
if not self.bf.might_contain(term):
continue
if term not in self.terms:
continue
# Add these events to our results
results = results.union(self.terms[term])
for event_id in sorted(results):
yield self.events[event_id]Copy the code
The Python collection intersection And Union operations easily support And And Or operations.
The running results are as follows:
s = SplunkM()
s.add_event('src_ip = 2.)
s.add_event('src_ip = 5.6.7.8')
s.add_event('dst_ip = 2.)
for event in s.search_all(['src_ip'.'5.6') :print event
print The '-'
for event in s.search_any(['src_ip'.'dst_ip') :print eventCopy the code
Src_ip = 5.6.7.8 - src_ip = 2 src_ip = 5.6.7.8 dst_ip = 2Copy the code
conclusion
The above code is just to illustrate the basics of big data search, including bloom filters, word segmentation and inversion lists. If you really want to use this code for real search, you’re way off.
Author: naughty
Original: https://my.oschina.net/taogang/blog/1579204