What data structure is used to store HASH

Store each item in an array, indexed by subscript. The problem with this approach is:

  1. The key to be stored is not an int and cannot be used as a subscript;

Solution: Map key from string to int

  1. A large number of keys are required, and the space required to store them can be very large

Solution: Map all possible keys to a table of size M, ideally m=n, where n represents the number of keys in the table. Problem: It is possible to cause conflicts, where two different keys hash, but get the same key

How to map key to table index scheme

Use the hash function.

division

h(k)=k mod m

The m chosen in this way is usually a prime number not too close to the power of 2

The multiplication

A is a random number, k contains W bits, and M is generally selected

The value rule is as follows:

Global hash

H (k)=[(ak+b)mod p]mod m where a,b is {0,.. P -1},p is a large prime number

Use linked lists to resolve hash conflicts

If the keys are the same, add a linked list to the current index of the table, pointing to the new value. In this case, the worst case is that all the keys will hash, resulting in a worst-case search time of O(n).

Simple consistent hash

Assume that each key has the same probability of being mapped to any index in the table, regardless of where the other keys are hashing. Given this assumption, if there are n keys, and the table size is m, then the length of each chain


So in the general case, the running time is O(1+α), so you can see that using linked lists is a good choice for resolving hash collisions given the assumptions

Use open Address to resolve hash conflicts

The hash function consists of the key to compute the hash and the number of attempts to get a specific index

Suppose that after three inserts,h(586,1)=1,h(133,1)=2, and h(112,1)=4

Insert h(226,2)=3, h(226,2)=3, h(226,2)=3

  • Insert: Calculates the subscript position using the given hash function. If the calculated subscript has no value or the data has been deleted, insert. Otherwise, increase the number of attempts and calculate again
  • Search: Computes the subscript using the hash function. If the key obtained is inconsistent with the key to be searched, the number of attempts is increased until the subscript is found or no value is found
  • Delete: first find the corresponding value, in this case, only mark the data has been deleted, but do not leave the storage place empty

    In the example, 112 is added and deleted. In the process of searching for 226, h(226,1)==4 is calculated, and the previous position is occupied by 112. If 112 is left blank when deleting, it will be marked as not found, which is obviously not correct. For locations with delete marks, you can also insert, and that solves the problem

Try the strategy choice

  1. Linear growth. Select h(k, I)=(h'(k)+ I)mod m, where H ‘(k) is a viable hash function. In this case, it can traverse all the locations of the storage array. However, there is a problem with this method
  2. Double hash. selectwhenWhen m and M are prime each other, you can go through all the locations of the array

    In this case, the number of attempts is zero

What size (m) should the hash store table be?

The expected search time is constant, so hopefully, considering that M is too small, the query is slow; M is too big, wasteful

  1. I need to increase by m. You can first make m a small value, and then increase m to m prime. Consider two growth strategies:
    • M prime is equal to m plus 1
    • M ‘=2m, where the time cost is
  2. I need to shrink m. If you delete a lot of data in the table, the original space is too large and there is a waste, it is best to reduce the waste of space
    • M prime is equal to n over 2, and I’m going to cut m in half. Let’s say I only have 8 elements, and if I insert another element, I have to grow it to 16, and then I have to remove another element, and I have to shrink it back to 8, and I have to move it around every timeThe element
    • M prime is equal to n over 4, making m half of what it was. The above problems do not arise at this time

The use of the hash

Given two strings s and t, we need to determine whether S occurs in t. The easiest way to do this is to do it twice:

for i in range(len(t)-len(s)):
    for j inRange (len(s)): Compares whether the match is successfulCopy the code

Its execution rule is to traverse the entire string, and then match the short string s to see if it exists in the original array

Karp – Rabin algorithm

Use Karp – Rabin algorithm improve the speed, to match the string s, can directly calculate the hash value, for string t, need to first obtain a the length of the string as | s |, can also calculate the hash value

  1. r.skip(oldChar)
  2. r.append(newChar)
  3. Computes the new hash value

If in the above calculation process can be finished in a constant time, so the total cost is O (| t |). The specific implementation is as follows:

def rhCombinationMatch(self): WinLength = len(self.findstr) // Build the string RollingHash object to look for winRh = RollingHashCombination(self.findstr) lineLen = Len (self.lines) // Build a RollingHash object for strings to be evaluated multiple times matchRh = RollingHashCombination(self.lines[0:winLength])for i inRange (0, linelen-winlength +1): //hashAre the values consistentif matchRh.hash() == winRh.hash():
				sequence=self.lines[i:i+winLength]
				If yes, remove the effect of hash collisions to see if strings are equal
				if sequence == self.findStr:
					self.count+=1
			ifMatchr.lines (self.lines[I],self.lines[I +winLength]);Copy the code

The RollingHash object is built as follows, which is responsible for assembling each step

class RollingHashCombination(object):
	"""Combine each step of rolling Hash"""
	def __init__(self,s):
		base = 7
		p=499999
		self.rhStepByStep = RollingHashStepByStep(base,p)
		for c in s:
			self.rhStepByStep.append(c)
		self.chash = self.rhStepByStep.hash()
	
	def hash(self):
		returnDef slide(self,preChar,nextChar):"""Delete the previous value and add the new value."""
		self.rhStepByStep.skip(preChar)
		self.rhStepByStep.append(nextChar)
		self.chash = self.rhStepByStep.hash()
Copy the code

For example, if five strings are “ABCDEF” and the string is 3 in length, and the hash value is directly concatenated based on ASCII, the entire calculation process matches as follows:

  1. The first matching string is “ABC” and the hash value is 656667
  2. If not found, remove first character from base 100, 656667-65*100^2
  3. Add a character D after it, and the result is 6667*100+68

Thus the original character changes from 656667 to 666768. Assuming that

  • N (0) the old Numbers
  • N (1) the new Numbers
  • Old Specifies the element to delete
  • New Specifies the element to be added
  • Base means base
  • K represents the length of the string to be compared

So n(1) = (n(0)-old base^(k-1)) base+new, assuming the hash of the old number is h1, and the hash of the new number is

H2 =[(n(0)-old*base^(k-1))*base+new] mod p =[(n(0) mode p)* base-old *(base^(K) mod p) +new] mod pCopy the code

Make magic = base(k) mod p while h1 = n(0) mod p, h2= [H1Base – oldMagic +new]mod p

The code is implemented as follows, which computes the hash value for each step

class RollingHashStepByStep(object):
	"""Split RollingHash step by step into two steps, each generating the corresponding hash value."""
	def __init__(self, base,p):
		"""Get an initial rollingHash."""
		super(RollingHashStepByStep, self).__init__()
		self.base = base
		# prime Numbers
		self.p = p
		There are no elements at first
		self.chash= 0 
		# magic = magic ** k %p k=0
		self.magic= 1
		self.ibase = base ** (-1) 
	Keep data small
	def append(self,newChar):
		"""Add a character to the original hash and compute the hash value."""
		# old returns the ASCII value of a string
		new10=ord(newChar)
		self.chash = (self.chash * self.base + new10 ) % self.p
		# Add an element to the slide window. Magic is the base length raised to the power according to the definition of magic
		self.magic = (self.magic * self.base) 

	def skip(self,oldChar):
		"""Compute the hash value by removing a character from the original hash."""
		Old < base magic 
		self.magic =int(self.magic * self.ibase) 
		# todo base calculation, why don't the numbers passed in need to be converted to the corresponding base where no base is used
		old10 =ord(oldChar); 
		self.chash = (self.chash-old10*self.magic + self.p * self.base )  % self.p
	
	def hash(self):
		return self.chash
Copy the code