Vegetable vegetable’s raise salary application is still waiting for approval….
As a technician, technical problems still need to be solved. After the analysis of online logs, logs adopt the hourly mechanism, one log file per hour, and there are multiple log files in the same hour, that is to say, the logs at the same time may be scattered in multiple log files, which is the main reason why Y always merges. Each log file is about 500 MB, and there are about 100 of them. At this point, what should you do if you read this article? How about 2 minutes of meditation!!
Problem analysis
There are several difficulties in achieving the total requirement of Y:
- How can I sort all log files by time
- The total size of the log files is 500M by 100, or about 50GB, so it is impossible to load them all into memory
- Frequently sort and find the smallest element as the program executes.
So what do we do? One solution is this: the heap
The solution
Heap definition
Heap (English: heap) is a general term for a special type of data structure in computer science. A heap is usually an array object that can be thought of as a tree. A heap always satisfies the following properties:
- The value of a node in the heap is always no greater than or less than its parent
- A heap is always a complete binary tree (a complete binary tree requires a full number of nodes except for the last one, which is left)
For a heap where each node is greater than or equal to the value of each node in the subtree, we call it the “big top heap.” A heap whose value of each node is less than or equal to the value of each node in the subtree is called a “small top heap.”
Heap implementation
Full binary trees are best stored in arrays (linked lists are also possible). Why do you say that? Using arrays to store full binary trees is very storage efficient. Because we don’t need to store Pointers to the left and right child nodes, we can find the left and right child and parent nodes of a node simply by using the index of the array.
- When adding elements, we adjust the heap in a bottom-up way. We insert new elements in the last free position of the array, find the parent element by subscript and superscript, and swap if the value is less than the parent element (the big top heap). As shown in figure:
- Delete maximum (minimum element) For a large top heap, the element at the top of the heap is the maximum element. After deleting this element, we need to bring the second largest element to the top of the heap. And so on, until all the elements along the path have been adjusted.
Further reading
- The top element of the small top heap is actually the smallest element of the heap, and the top element of the large top heap is the largest element of the heap. And that’s one of the great things about heap sort, taking the smallest element or the largest element is order one.
- Remove elements we should note, if with the method of top-down switching elements, in many cases caused a serious imbalance of heap (left and right subtrees depth is large), in order to prevent similar situation, we can put the last element mentioned pile tip, and then adjust the strategy, because the last element is always in the final stage, It doesn’t make a big difference between the left and right subtrees.
- For a heap with duplicate elements, one way to solve this problem is to assume that the first element is the largest, and that the last element to enter the heap is smaller than the first one, so that the search is always thorough. Another way is to store a linked list of the same elements in each element of the heap, similar to a hash table. However, this requires special treatment when deleting this element.
- The time complexity of removing data from the heap and inserting data into the heap is O(logn).
- The process of constantly tweaking the heap is essentially a sorting process, and in some scenarios, we can use the heap to sort.
Asp.net Core emulation code
The following code can be used directly in a production environment with little or no modification
Small top heap implementation code
/// <summary> /// In the small top heap, THE TYPE of T requires the IComparable interface /// </summary> class MinHeap<T> whereT : IComparable { private T[] container; Private int capacity; private int capacity; Private int count; private int count; // The number of data stored in the heap public MinHeap(int _capacity) { container = new T[_capacity + 1]; capacity = _capacity; count = 0; } // Insert an element public bool AddItem(T item) {if (count >= capacity)
{
return false;
}
++count;
container[count] = item;
int i = count;
whileContainer [I].CompareTo(Container [I /2]) < 0) {container[I] = container[I]; container[i] = container[i / 2]; container[i / 2] = temp; i = i / 2; }return true; } // Get the smallest element public TGetMinItem()
{
if (count == 0)
{
return default(T);
}
T result = container[1];
returnresult; } // Delete the smallest element, i.e. the heaptop elementDeteleMinItem()
{
if (count == 0)
{
return false;
}
container[1] = container[count];
container[count] = default(T);
--count;
UpdateHeap(container, count, 1);
return true; Private void UpdateHeap(T[] a, int n, int I) {private void UpdateHeap(T[] a, int n, int I) {while (true) { int maxPos = i; // Iterate over the left and right subtrees to make sure that that is the smallest elementif (i * 2 <= n && a[i].CompareTo(a[i * 2]) > 0)
{
maxPos = i * 2;
}
if (i * 2 + 1 <= n && a[maxPos].CompareTo(a[i * 2 + 1]) > 0)
{
maxPos = i * 2 + 1;
}
if (maxPos == i)
{
break; } T temp = container[i]; container[i] = container[maxPos]; container[maxPos] = temp; i = maxPos; }}}Copy the code
Simulate log file contents
// Because of the need to keep fromlogThe file reads the content, so you need a sumlogClass LogInfoIndex: IComparable {// Which file the message is sent from public int FileIndex {get;set; } // Specific log file contents public LogInfo Data {get;set; }
public int CompareTo(object obj)
{
var tempInfo = obj as LogInfoIndex;
if (this.Data.Index > tempInfo.Data.Index)
{
return 1;
}
else if (this.Data.Index < tempInfo.Data.Index)
{
return- 1; }return0; }} class LogInfo {// public int Index {get;set; }
public string UserName { get; set; }}Copy the code
Generate a mock logger
static void WriteFile()
{
int fileCount = 0;
while (fileCount < 10)
{
string filePath = $@"D:\log\{fileCount}.txt";
int index = 0;
while (index < 100000)
{
LogInfo info = new LogInfo() { Index = index, UserName = Guid.NewGuid().ToString() };
File.AppendAllText(filePath, JsonConvert.SerializeObject(info)+ "\r\n"); index++; } fileCount++; }}Copy the code
The content of the document is as follows:
The test program
static void Main(string[] args)
{
int heapItemCount = 10;
int startIndex = 0;
StreamReader[] allReader = new StreamReader[10];
MinHeap<LogInfoIndex> container = new MinHeap<LogInfoIndex>(heapItemCount); // First read one message per filewhile(startIndex< heapItemCount)
{
string filePath = $@"D:\log\{startIndex}.txt";
System.IO.StreamReader reader = new System.IO.StreamReader(filePath);
allReader[startIndex] = reader;
string content= reader.ReadLine();
var contentObj = JsonConvert.DeserializeObject<LogInfo>(content);
LogInfoIndex item = new LogInfoIndex() { FileIndex= startIndex , Data= contentObj }; container.AddItem(item); startIndex++; } // Then start looping out of the heap and into the heapwhile (true)
{
var heapFirstItem = container.GetMinItem();
if (heapFirstItem == null)
{
break;
}
container.DeteleMinItem();
File.AppendAllText($@"D:\log\total.txt", JsonConvert.SerializeObject(heapFirstItem.Data) + "\r\n");
var nextContent = allReader[heapFirstItem.FileIndex].ReadLine();
if(string) IsNullOrWhiteSpace (nextContent)) {/ / if one of the file is read skipcontinue;
}
var contentObj = JsonConvert.DeserializeObject<LogInfo>(nextContent);
LogInfoIndex item = new LogInfoIndex() { FileIndex = heapFirstItem.FileIndex, Data = contentObj }; container.AddItem(item); } // release StreamReader foreach (var readerin allReader)
{
reader.Dispose();
}
Console.WriteLine("Complete");
Console.Read();
}
Copy the code
The results are as follows:
Add attention, view more beautiful version, harvest more wonderful