1. What’s in your hammer?

It has always been our team’s concern to explore the greater value of the log. In addition to real-time log query, SLS has improved the following features in DevOps this year:

  • Context query
  • Real-time Tail and intelligent clustering to improve problem investigation efficiency
  • Provides a variety of anomaly detection and prediction functions for time series data for more intelligent checking and prediction
  • Visualization of the results of data analysis
  • Powerful alarm Settings and notifications, by calling Webhook for associated action



Today, we will focus on how to cooperate with log clustering and abnormal alarms to better discover and alarm exceptions

2. Platform experiment

2.1 Experimental Data

A copy of Sys Log’s original data, and Log clustering service is enabled. The specific status is shown in the screenshot below:



By adjusting the size of red box 1 in the screenshot below, the result of red box 2 in the figure can be changed, but it does not change for each of the most fine-grained patterns, that is, the result of the subpattern is stable and unique, and we can find the corresponding original log entry through the Signature of the subpattern.



2.2 Generate timing information of submodes

Suppose we want to monitor this subpattern:

MSG: vm-11193.tc su: pam_UNIX (*:session): session closed for user root signature_id: log_signature: 1814836459146662485

We have obtained the original log corresponding to the above pattern, and we can see the backward graph of the specific quantity on the timeline:



In the figure above, we can find that the distribution of logs in this mode is not very balanced, and some of them are not. If the number of logs is directly counted according to the time window, the sequence diagram is as follows:

__log_signature__: 1814836459146662485 |  
select 
    date_trunc('minute', __time__) as time, 
    COUNT(*) as num 
from log GROUP BY time order by time ASC limit 10000
Copy the code



In the diagram above we find that time is not continuous. Therefore, we need to complement this sequence.

__log_signature__: 1814836459146662485 | 
select 
    time_series(time, '1m'.'%Y-%m-%d %H:%i:%s'.'0') as time, 
    avg(num) as num 
from  ( 
    select 
        __time__ - __time__ % 60 as time, 
        COUNT(*) as num 
    from log GROUP BY time order by time desc ) 
GROUP by time order by time ASC limit 10000
Copy the code



2.3 Abnormal detection of the timing sequence

Use the timing exception detection function: ts_predicate_arma

__log_signature__: 1814836459146662485 | 
select 
    ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') 
from  ( 
    select 
        time_series(time, '1m'.'%Y-%m-%d %H:%i:%s'.'0') as time, 
        avg(num) as num 
    from  ( 
        select 
            __time__ - __time__ % 60 as time, 
            COUNT(*) as num 
        from log GROUP BY time order by time desc ) 
    GROUP by time order by time ASC ) limit 10000
Copy the code



2.4 How do I Set Alarms

  • Unpack the results of machine learning functions
__log_signature__: 1814836459146662485 | 
select 
    t1[1] as unixtime, t1[2] as src, t1[3] as pred, t1[4] as up, t1[5] as lower, t1[6] as prob 
from  ( 
    select 
        ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') as res 
    from  ( 
        select 
            time_series(time, '1m'.'%Y-%m-%d %H:%i:%s'.'0') as time, 
            avg(num) as num 
        from  ( 
            select 
                __time__ - __time__ % 60 as time, 
                COUNT(*) as num 
            from log GROUP BY time order by time desc ) 
        GROUP by time order by time ASC )) , unnest(res) as t(t1)
Copy the code



  • This alarm is generated for the last two minutes
__log_signature__: 1814836459146662485 | 
select 
    unixtime, src, pred, up, lower, prob 
from  ( 
    select 
        t1[1] as unixtime, t1[2] as src, t1[3] as pred, t1[4] as up, t1[5] as lower, t1[6] as prob 
    from  ( 
        select 
            ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') as res 
        from  ( 
            select 
                time_series(time, '1m'.'%Y-%m-%d %H:%i:%s'.'0') as time, 
                avg(num) as num 
            from  ( 
                select 
                    __time__ - __time__ % 60 as time, COUNT(*) as num 
                from log GROUP BY time order by time desc ) 
            GROUP by time order by time ASC )) , unnest(res) as t(t1) ) 
    where is_nan(src) = false order by unixtime desc limit 2
Copy the code



  • Alarm the rising point and set a bottom-pocket policy
__log_signature__: 1814836459146662485 | 
select 
    sum(prob) as sumProb, max(src) as srcMax, max(up) as upMax 
from ( 
    select 
        unixtime, src, pred, up, lower, prob 
    from  ( 
        select 
            t1[1] as unixtime, t1[2] as src, t1[3] as pred, t1[4] as up, t1[5] as lower, t1[6] as prob 
        from  ( 
            select 
                ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') as res 
            from  ( 
                select 
                    time_series(time, '1m'.'%Y-%m-%d %H:%i:%s'.'0') as time, avg(num) as num 
                from  ( 
                    select 
                        __time__ - __time__ % 60 as time, COUNT(*) as num 
                    from log GROUP BY time order by time desc ) 
                GROUP by time order by time ASC )) , unnest(res) as t(t1) ) 
        where is_nan(src) = false order by unixtime desc limit2)Copy the code



Specific alarm Settings are as follows:



3. Hard wide time

3.1 Log Progression

This is a Demo of the log service



For details about log learning, see Log Service Learning Path.


The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.