There are different apis on the company platform, for internal or external call, these apis undertake different functions, such as account query, version distribution, grab red envelopes and so on.
The log records how many times an API is accessed per minute, which means that an API has 1440 records per day (1440 minutes). It connects the data for each day, similar to the stock movement.
I want to predict the traffic access situation on the N+1 day based on the historical data of the last N days. The predicted value is a reasonable reference for real-time comparison between the real value and the new day. When the actual traffic is greatly different from the predicted value, abnormal access is considered and an alarm is triggered.
Data exploration
I put a sample data in the data folder to see the size and structure of the data.
data = pd.read_csv(filename)
print('size: ',data.shape)
print(data.head())Copy the code
Data size:
10,080 records, 10,080 minutes, seven days of data.
Parameter Meaning:
Date: indicates the time, in minutes
Count: indicates the number of times that the API is accessed in a minute
Plot a sequence to see how it looks :(some of the exploration class methods for plotting are included in the test_stationarity. Py file, including time series plots, moving averages, if you are interested, try them out for yourself).
def draw_ts(timeseries):
timeseries.plot()
plt.show()
data = pd.read_csv(path)
data = data.set_index('date')
data.index = pd.to_datetime(data.index)
ts = data['count']
draw_ts(ts)Copy the code
The sequence
Look at this stupid graph, the points that drop to zero and that’s the first pit I hit, and I started doing it as soon as I got this data. Later, after a long time of agonizing, IT was discovered that the sharp drop to 0 was caused by T д T by the ETL students’ automatic zeroing with the data missing.
Fill in the holes, fill in the missing values with the mean of the values before and after, and take another look:
Fill in the sequence of missing values
It is found that this data has the following characteristics, which should be taken into account in model design and data preprocessing:
1. This is a periodic time series, and the value fluctuates regularly on a daily basis. The API in the figure is more active in the afternoon and evening, and less frequent in the morning and early morning. You need to do the decomposition before modeling.
2, my second hole: the data itself is not smooth, sudden sudden drop more, and this is not conducive to prediction, after all, the model needs to learn the normal sequence to give objective judgment of unknown data, otherwise there will be frequent false positives, make the atmosphere become very embarrassing (д ‘), so it must be smoothed.
3. This is just a sequence diagram of an API, and there is a big difference in the form of different apis. After all, different functions are assumed, so how to adapt the model to different forms of API is also a problem to be considered.
pretreatment
1. Divide the training test set
The first six days of data do the training, and the seventh day do the test set.
class ModelDecomp(object):
def __init__(self, file, test_size=1440):
self.ts = self.read_data(file)
self.test_size = test_size
self.train_size = len(self.ts) - self.test_size
self.train = self.ts[:len(self.ts)-test_size]
self.test = self.ts[-test_size:]Copy the code
2. Smooth the training data
To eliminate data burrs, you can use the moving average method, which I did not use here, because I tried to find that for my data, the moving average does not smooth the data after processing. The method I used here is very simple, but it works well:
Take the change value of each point and the previous point as a new sequence, shave the outliers of the edge here, that is, the values with the unreasonable change, and fill them with the mean value of the data before and after, paying attention to the possibility of continuous large change points.
def _diff_smooth(self, ts): Dif = ts.diff().dropna() # dif = describe() # Min, 25%, 50%, 75%, the Max value high = td [' 75% '] + 1.5 * (td [' 75% '] - td [' 25% ']) # define high threshold, 1.5 times the interquartile range beyond low = td [' 25% '] - 1.5 * (td [' 75% '] - td [' 25% ']) # define low threshold, Ditto # of the indexes exceed the threshold point forbid_index = dif [(dif > high) | (dif (low)]. The index while I = 0 I < len (forbid_index) - 1: N = 1 # It is found that how many consecutive points change too much, While forbid_index[I +n] == start + timedelta(minutes=n): N += 1 I += n-1 end = forbid_index[I] # value = np.linspace(ts[start-timedelta (minutes=1)], ts[end + timedelta(minutes=1)], n) ts[start: end] = value i += 1 self.train = self._diff_smooth(self.train) draw_ts(self.train)Copy the code
Smoothed training data:
3. Decompose the training data periodically
Using the Statsmodels kit:
from statsmodels.tsa.seasonal import seasonal_decompose decomposition = seasonal_decompose(self.ts, freq=freq, Two_sided =False) # self.ts: time series; # freq: cycle, in this case 1440 minutes, i.e. one day; # two_sided: observe rows 2 and 4 below, with a blank on the left. If set to True, the left and right sides are blank, and False ensures that the sequence has data at the end for easy prediction. self.trend = decomposition.trend self.seasonal = decomposition.seasonal self.residual = decomposition.resid decomposition.plot() plt.show()Copy the code
diagram
The first line observed: Original data; Second line trend: the decomposed trend part; 4. Last residual: residual
I decomposed the addition model of seasonal_decompose, that is, observed = trend + seasonal + residual, and the multiplication model.
In modeling, we only study and forecast trends. How to process the predicted results of trends into reasonable final results? Of course, we’re going to do addition, and we’ll write that in more detail.
model
1, training,
The decomposed trend part was trained separately with ArIMA model:
def trend_model(self, order): Self.trend.dropna (inplace=True) train = self.trend.dropna(inplace=True) train = self.trend.dropna(inplace=True) train = self.trend[:len(self.trend)-self.test_size] For details, see the official document. The parameter setting process is omitted. self.trend_model = ARIMA(train, order).fit(disp=-1, method='css')Copy the code
2, forecasting
After the trend data is predicted, the periodic data is used as the final prediction result, but more importantly, what we need to get is not a specific value but a reasonable interval. When the real data exceeds this interval, the alarm will be triggered. The setting of the high and low error interval is from the decomposed residual data:
d = self.residual.describe()
delta = d['75%'] - d['25%']
self.low_error, self.high_error = (d['25%'] - 1 * delta, d['75%'] + 1 * delta)Copy the code
Predict and complete the final addition processing, and get the predicted value of the seventh day, namely the high and low confidence interval:
def predict_new(self): "Predict new data" # renew train, generate time index of length N, Pred_time_index = pd.date_range(start=self.train. Index [-1], periods=n+1, freq='1min')[1:] self.trend_pred= self.trend_model.forecast(n)[0] self.add_season() def add_season(self): Self. train_season = self.seasonal[:self.train_size] values = [] low_conf_values = [] high_conf_values = [] for i, t in enumerate(self.pred_time_index): Season_train_season [self.train_season.index. Time == t.Time () ]. Mean () # predict = trend_part + season_part low_bound = trend_part + season_part + self.low_error high_bound = trend_part + season_part + self.high_error values.append(predict) low_conf_values.append(low_bound) High_conf_values.append (high_bound) # Self. final_pred = pd.Series(values, index=self.pred_time_index, name='predict') self.low_conf = pd.Series(low_conf_values, index=self.pred_time_index, name='low_conf') self.high_conf = pd.Series(high_conf_values, index=self.pred_time_index, name='high_conf')Copy the code
3. Evaluation:
For the prediction of the seventh day, the evaluation index is the root mean square error rmSE, and the difference between the graph comparison and the real value:
md = ModelDecomp(file=filename, test_size=1440) md.decomp(freq=1440) md.trend_model(order=(1, 1, Predict_new () pred = md.final_pred test = md.test plt.subplot(211) plt.plot(md.ts) # Plt.title (filename.split('.')[0]) plt.subplot(212) pred.plot(color='blue', Plot (color='red', label='Original') # md.low_conf.plot(color='grey', Md.high_conf.plot (color='grey', label='high') # legend(loc='best') plt.title('RMSE: %.4f' % np.sqrt(sum((pred.values - test.values) ** 2) / test.size)) plt.tight_layout() plt.show()Copy the code
Predicted results
As can be seen, the root mean square error of 462.8, relative to the original data on the order of several thousand, is still ok. The two mutation points in the test data also exceeded the confidence intervals and were accurately reported.
conclusion
As mentioned earlier, different apis vary greatly in shape, but this article shows only one.
In this project, I also came into contact with other forms of sequences, some of which had obvious upward or downward trends; Some start off flat and then grow… .
However, they all belong to typical periodic time series. Its core idea is very simple: do a good job of decomposition, restore the predicted results, and set the confidence interval. The specific operation can be adjusted according to the specific business logic.
The original link: www.jianshu.com/p/09e5218f5…
Wenyuan network, for learning purposes only, delete.
You will definitely encounter difficulties in learning Python. Don’t panic, I have a set of learning materials, including 40+ e-books, 800+ teaching videos, covering Python basics, crawlers, frameworks, data analysis, machine learning, etc. Shimo. Im/docs/JWCghr… Python Learning Materials
Follow the Python circle and get good articles delivered daily.