Kaggle champion interpretation: wind farm short-term wind condition prediction task scheme

Background of the problem

In recent years, with the expansion of onshore wind turbine installations, wind turbines installed in areas with more abrupt weather changes are increasingly affected by weather changes. When the wind condition is abrupt, due to the lag of the control system, it is easy to cause the load of the unit to be too large, or even to be inverted, resulting in great economic losses. At the same time, the accuracy of the existing ultra-short-term wind power prediction system is poor, resulting in the wind power prediction system has little reference value for power grid scheduling, and will lead to a large number of power generation planning assessment. Due to the high unit price of common wind speed measurement products, such as lidar, which are greatly affected by weather, it is difficult to achieve the deployment of mass applications, and it is still difficult to have reliable forward-looking in large time and space scales. Therefore, reliable prediction of ultra-short-term wind conditions is imminent.

Ultra-short-term wind condition prediction is a worldwide problem. If we can predict the wind speed and wind direction data of each unit in a short time in the future through big data and artificial intelligence technology, we can improve the prospective control of wind turbines and improve the load safety of wind turbines. At the same time, the improvement of the prediction ability of ultra-short-term wind power will bring significant safety value and economic benefits.

This competition is co-sponsored by shenzhen Security District People’s Government and China Academy of Information and Communication Technology, providing real data and scenes from industrial production, hoping to solve the challenges in actual production tasks by combining industrial and AI big data.

The data analysis

Training Set specification

Two-year training data for the two wind farms:

Wind speed, wind direction, temperature, power and corresponding hourly meteorological data of each wind field 25 typhoon generator unit are provided. 2. The fan number of wind field 1 is X26-X50, and the data range of training set is 2018 and 2019; 3. The fan number of wind field 2 is X25-X49, and the data range of training set is 2017 and 2018. 4. Data files of each unit are stored in the form of/training set /[wind field]/[unit]/[date].csv;
Meteorological data is stored in the/training set/wind field folder.

Test Set Description

The test set is divided into two folders: the preliminary test set and the final test set. The folders of the preliminary test set and the final test set are organized in the same form.
The folders of the preliminary and final competitions each contain the data of 80 periods, 1 hour data for each period (30S resolution, time expressed in seconds), 20 periods in spring, summer, autumn and winter, numbered 1-20 for the preliminary and 21-40 for the final. That is, there are 80 time slots numbered from Spring 01 to winter 20 in the preliminary competition. There will be 80 finals from spring 21 to winter 40.
The data files of each unit are stored as/test set _**/[wind field]/[unit]/[time period].csv.
Meteorological data is stored in/Test set _**/[Wind field]/ folder. There are 80 periods of wind speed and direction data at wind field locations. Each period provides wind speed and direction data in the past 12 hours and the next hour. The time coding is the same as above, and the time coding is -11~2, where the hour 0 ~ 1 corresponds to the one-hour data of the engine room.

The figure above shows the data division method of the test set. The data accumulated during the period of -11 ~ 1 is used as the input of the model to predict the wind speed and direction in the next 10 minutes (between 1 and 2).

Missing value

The missing values in the data mainly come from two aspects: one is that the data record of the current day is missing, and the other is that the data of some time periods is missing. A variety of missing conditions lead to the problem of omission in data filling. Filling missing values in only one way will lead to missing values filling. Therefore, forward fill, backward fill and mean fill were used simultaneously in the competition to ensure the filling coverage. This process may introduce noise, but the neural network has a certain tolerance to noise, so the final training effect is not very important. Considering that missing data may exist in training data and future test data, and they are recorded in the same way, we did not remove the data with missing values, and used the same filling method for them, so as to avoid inconsistent data distribution due to different pretreatments.

Introduction to model ideas

Model structure

In the competition, we used the Encoder-Decoder model to mine the information in the input sequence through the sequence model, and then predict through the Decoder. There are many Encoder and Decoder options, such as LSTM, a common sequence model, or Transformer, which has emerged in recent years. In the game, we stacked several layers of LSTM on the Decoder side and only one layer of LSTM on the Encoder side. Dropout is not added into the model for regularization. This is because there is a large amount of high-frequency noise in the data itself. Adding dropout will lead to slow convergence of the model and affect the training efficiency of the model. PaddleNLP (github.com/PaddlePaddl…) Is the official natural language processing model library for PaddleNLP. Provides convenient data processing API, rich network structure and pre-training model as well as classification, generation and other NLP application examples, very suitable for playing games, will consider using later.

The code for the final model structure built using the flying-paddle framework is as follows:

class network(nn.Layer): def __init__(self, name_scope='baseline'): super(network, self).__init__(name_scope) name_scope = self.full_name() self.lstm1 = paddle.nn.LSTM(128, 128, Dropout =0.0) self. lSTm2 = paddy.nn.LSTM(25, 128, direction = 'bidirectional', dropout=0.0) self. lSTm2 = paddy.nn. Embedding_layer1 = class (0, 0) class (0, 0) class (0, 0) class (0, 0) class (0, 0) class (0, 0) 16) self.mlp1 = paddle.nn.Linear(29, 128) self.mlp_bn1 = paddle.nn.BatchNorm(120) self.bn2 = paddle.nn.BatchNorm(14) self.mlp2 = paddle.nn.Linear(1536, 256) self.mlp_bn2 = paddle.nn.BatchNorm(256) self.lstm_out1 = paddle.nn.LSTM(256, 256, direction = 'bidirectional', Dropout =0.0) self. lSTM_out2 = paddle.nn.LSTM(512, 128, direction = 'bidirectional', Dropout =0.0) self. lSTM_out3 = paddy.nn.LSTM(256, 64, direction = 'bidirectional', Dropout =0.0) self. lSTM_out4 = paddy.nn.LSTM(128, 64, direction = 'bidirectional', Dropout =0.0) self.output = paddles.nn. Linear(128, 2,) self.sigmoid = paddles.nn. sigmoid () # def forward(self, input1, input2): embedded1 = self.embedding_layer1(paddle.cast(input1[:,:,0], dtype='int64')) embedded2 = self.embedding_layer2(paddle.cast(input1[:,:,1]+input1[:,:,0] # * 30 , dtype='int64')) x1 = paddle.concat([ embedded1, embedded2, input1[:,:,2:], input1[:,:,-2:-1] * paddle.sin(np.pi * 2 *input1[:,:,-1:]), input1[:,:,-2:-1] * paddle.cos(np.pi * 2 *input1[:,:,-1:]), paddle.sin(np.pi * 2 *input1[:,:,-1:]), paddle.cos(np.pi * 2 *input1[:,:,-1:]), ], axis=-1) # 4+16+5+2+2 = 29 x1 = self.mlp1(x1) x1 = self.mlp_bn1(x1) x1 = paddle.nn.ReLU()(x1) x2 = paddle.concat([ embedded1[:,:14], embedded2[:,:14], input2[:,:,:-1], input2[:,:,-2:-1] * paddle.sin(np.pi * 2 * input2[:,:,-1:]/360.), input2[:,:,-2:-1] * paddle.cos(np.pi * 2 * input2[:,:,-1:]/360.), paddle.sin(np.pi * 2 * input2[:,:,-1:]/360.), paddle.cos(np.pi * 2 * input2[:,:,-1:]/360.), ], axis=-1) # 4+16+1+2+2 = 25 x2 = self.bn2(x2) x1_lstm_out, (hidden, _) = self.lstm1(x1) x1 = paddle.concat([ hidden[-2, :, :], hidden[-1, :, :], paddle.max(x1_lstm_out, axis=1), paddle.mean(x1_lstm_out, axis=1) ], axis=-1) x2_lstm_out, (hidden, _) = self.lstm2(x2) x2 = paddle.concat([ hidden[-2, :, :], hidden[-1, :, :], paddle.max(x2_lstm_out, axis=1), paddle.mean(x2_lstm_out, axis=1) ], axis=-1) x = paddle.concat([x1, x2], axis=-1) x = self.mlp2(x) x = self.mlp_bn2(x) x = paddle.nn.ReLU()(x) # decoder x = paddle.stack([x]*20, axis=1) x = self.lstm_out1(x)[0] x = self.lstm_out2(x)[0] x = self.lstm_out3(x)[0] x = self.lstm_out4(x)[0] x = self.output(x) output = self.sigmoid(x)*2-1 output = paddle.cast(output, dtype='float32') return outputCopy the code

The flying oar framework can be used to train models in a variety of ways. It can be trained by gradient regression like other deep learning frameworks, or it can be trained using highly encapsulated apis. When using high-level API training, we need to have the generator and model structure of the data ready. The encapsulation mode of generator in the flying propeller frame is as follows, which is highly efficient:

class TrainDataset(Dataset): def __init__(self, x_train_array, x_train_array2, y_train_array=None, mode='train'): Self. training_data = x_train_array.astype('float32') self. training_datA2 = x_train_array2.astype('float32') self.mode = mode if self.mode=='train': self.training_label = y_train_array.astype('float32') self.num_samples = self.training_data.shape[0] def __getitem__(self, idx): data = self.training_data[idx] data2 = self.training_data2[idx] if self.mode=='train': label = self.training_label[idx] return [data, data2], label else: return [data, data2] def __len__(self): Return self.num_samplesCopy the code

With the Generator in place, you can train directly using the FIT interface:

Model =paddle. Model(network(), inputs= paddle) model. Prepare (optimizer=paddle. Optimizer. parameters=model.parameters()), loss=paddle.nn.L1Loss(), ) model.fit( train_data=train_loader, eval_data=valid_loader, epochs=10, verbose=1, )Copy the code

Optimization of pipeline

We extract features in the same way for different fan data, so we can use Python’s Parallel library to further optimize the performance of the code and improve the efficiency of iteration. The core code is as follows:

Def generate_train_data(station, id): df = read_data(station, id, Train_data = [] for station in [1, 2]: train_data_tmp = Parallel(n_jobs = -1, verbose = 1)(delayed(lambda x: generate_train_data(station, x))(id) for id in tqdm(range(25))) train_data = train_data + train_data_tmpCopy the code

The increase in efficiency is proportional to the number of cores in the CPU. We used an 8-core CPU in the game, so we were 8 times more efficient in data generation.

Fitting the wind direction

The prediction tag for this competition contains wind speed and direction, for which direction, since the Angle is cyclic, we have

The evaluation function is MAE. In the training stage, direct prediction of wind direction will be problematic, because 0 and 1 represent the same meaning, and the model will predict their mean value of 0.5 when wind direction is 0/1, leading to errors. Here, we convert wind direction and Angle into vertical component of wind direction to avoid direct prediction of wind direction and problems caused by wind direction fitting.

To deal with noise

After winning the top spot in the A-list, we tried to deal with the noise in the data. Due to the high risk of processing the input side, it is easy to erase the effective signal in the input feature, so we choose to smooth the label. We weighted average the values predicted by the model with the original labels, and then used the smoothed new labels for training, achieving an improvement of 0.1 points in the A-list.

The experimental results

The score is calculated by the following formula:

Among them,

Is the mean absolute error. The experimental results are shown in the following table. It is not difficult to find that the improvement of competition results mainly comes from the processing of data and labels, which are also the two elements that we should pay attention to most in modeling.

thoughts

In this industrial big data competition, we won the second prize in wind condition prediction track and heavy parts demand prediction track. Through this competition, we found that the data quality in the industrial scenario may not be ideal, missing values, noise need to be carefully handled. When dealing with time series prediction tasks, the accumulation of historical data may not include future emergencies, and relying only on models may lead to large deviations, which is also a problem we need to pay special attention to in modeling.

Studio project links: aistudio.baidu.com/aistudio/pr… Address of Paddle: github.com/PaddlePaddl… PaddleNLP address: github.com/PaddlePaddl…

reference

[1] Industrial bigdata Innovation platform www.industrial-bigdata.com/

Kaggle champion interpretation: wind farm short-term wind condition prediction task scheme

Fitting the wind direction

thoughts

reference

Related Posts

Gas Diffusion Simulation using Gaussian Plume Model based on Modified Genetic and Particle Swarm Optimization algorithm

Analyze the existing problems and corresponding solutions of ViT

Machine learning Notes – Sklearn implementation of classification decision trees