First, garbage classification
How to maximize the utilization of garbage resources, reduce the amount of garbage disposal and improve the quality of living environment through garbage classification management is one of the urgent issues concerned by all countries in the world. According to the unified standards formulated by the country, household waste is now widely divided into four categories, namely recyclables, kitchen waste, hazardous waste and other waste. Recyclables refer to wastes suitable for recycling and resource utilization, mainly including waste paper, plastic, glass, metal and cloth. They are collected in blue garbage containers and recycled through comprehensive treatment. Kitchen waste includes food waste such as leftovers, bones, vegetable roots and leaves, and fruit peels, which are collected in green waste containers. However, with the development of deep learning technology, in order to identify and classify domestic garbage simply and efficiently, this paper will implement a waste classification and identification method based on convolutional neural network. This method only need to simple image preprocessing, CNN model can automatically extract the image feature and pooling process can reduce the number of parameters, reduce the complexity of the calculation and the experimental results show that the convolution neural networks, can overcome the disadvantages of the traditional algorithms of image classification and, of course, waiting for you to study more complex models. However, VGG or global pooling is considered to be more effective.
Convolutional neural network CNN
Convolutional Neural Networks (CNNs/ConvNets) are very similar to ordinary Neural Networks in that they both consist of neurons with learnable weights and biases. Each neuron takes some input and does some dot product, and the output is the fraction of each category, and some of the computational tricks from ordinary neural networks still apply here.
So what’s different? The default input of convolutional neural network is image, which allows us to encode specific properties into the network structure, making our feedforward function more efficient and reducing a large number of parameters.
Convolutional neural networks utilize inputs that are characteristic of the image. Neurons are designed in three dimensions: width, height, and depth. For example, if the input image size is 32×32×3 (RGB), then the input neuron also has 32×32×3 dimensions. Here is the illustration:
Traditional neural network
Convolutional neural network
A convolutional neural network consists of many layers, their inputs are three-dimensional, their outputs are three-dimensional, some layers have parameters, some layers do not need parameters.
Layers used to build ConvNets
Convolutional neural network usually contains the following layers:
- Convolutional layer, each Convolutional layer in Convolutional neural network is composed of several Convolutional units, and the parameters of each Convolutional unit are optimized by the back propagation algorithm. The purpose of convolution operation is to extract different features of the input. The first convolution layer may only extract some low-level features such as edge, line and Angle, while the network with more layers can iteratively extract more complex features from low-level features.
- 1. The 1 Linear Units layer (ReLU layer) is Rectified Linear Units (ReLU) for the Activation function of this layer of nerves.
- In the Pooling layer, features with large dimensions are usually obtained after the convolution layer. The features are cut into several areas and the maximum value or average value is taken to obtain new features with small dimensions.
- Fully-Connected layer, which combines all local features into global features, is used to calculate the score of each final category.
A convolutional neural network layer applicationThe instance:
Convolutional Layer
Local Connectivity
The normal neural network carries out “Full Connected” design between the input layer and the hidden layer. From the point of view of computation, it is feasible to calculate the features of relatively small images from the whole image. However, for larger images (such as 96×96), learning the features of the entire image through this all-connected network approach would become computationally time-consuming. You need to design 10 to the fourth (= 10,000) input units. Assuming you have 100 features to learn, there are 10 to the sixth parameters to learn. The 96×96 image is also 10 ^ 2 (=100) times slower than the 28×28 small block image, which uses either forward transport or backward conduction.
A simple method of convolution layer to solve this kind of problem is to restrict the connection between the implicit element and the input element: each implicit element can only connect a part of the input element. For example, each hidden cell connects only a small contiguous area of the input image. For input forms other than image input, there are also special options for input signal “connection regions” that are connected to a single hidden layer. If audio is used as a signal input mode, a subset of input units connected to an implicit unit may only be signals at a certain time period corresponding to an audio input.
The size of the input region connected by each hidden unit is called the receptive field of r neurons (receptive field).
Since the neurons of the convolution layer are also three-dimensional, they also have depth. The parameters of the convolution layer include a series of filters. Each filter trains a depth, and several filter output units have the depth.
As shown in the figure below, the size of the sample input unit is 32×32×3, and the depth of the output unit is 5. For the same position with different depth of the output unit, the area connected to the input picture is the same, but the parameters (filters) are different.
Although each output unit is only a part of the connected input, the calculation method of the value is unchanged, which is the dot product of the weight and input, and then the bias is added. This point is the same as that of ordinary neural network, as shown in the figure below:
Spatial Arrangement
The size of an output unit is controlled by the following three variables: depth, stride and zero-padding.
- Depth: As the name implies, it controls the depth of the output unit, that is, the number of filters, the number of neurons connected to the same area. Aka: Depth Column
- Stride: It controls the distance between two adjacent implicit units at the same depth and the input area they are connected to. If the stride length is very small (such as stride = 1), the overlap of input areas of adjacent implicit units will be large. Large strides lead to less overlap.
- Zero-padding: You can control the size of the output cell by adding zeros around the input cell to change the overall size.
Let’s define a few symbols:
- : Input unit size (width or height)
- Experience field (receptive Field)
- : Stride
- : Number of zero-padding
- : depth, depth of output unit
The following formula can be used to calculate how many hidden units can exist in an output unit within a dimension (width or height) :
If the calculation result is not an integer, it means that the existing parameters are not suitable for input, the stride is not set properly, or zero needs to be filled. The proof is omitted. Here is an example to illustrate.
This is a one-dimensional example, the model on the left has 5 input units, namely, the boundary is filled with a zero, namely, step is 1, namely, the receptive field is 3, because each output hidden unit is connected with 3 input units, namely, the number of output hidden units can be calculated according to the above formula:, which is consistent with the figure. In the model on the right, the step length is changed to 2 and the rest remains unchanged. The output size can be calculated as:, which is also consistent with the figure. If the step is changed to 3, the formula is not divisible, indicating that step 3 cannot exactly match the size of the input unit.
In addition, the weight of the network is in the upper right corner of the graph, and the calculation method is the same as that of ordinary neural networks.
Parameter Sharing
The number of parameters can be drastically reduced by applying parameter sharing, which is based on the assumption that if a point in the image (x1, y1) contains an important feature, it should be just as important as another point in the image (x2, y2). In other words, we call planes of the same depth slices ((e.g., a volume of size [55x55x96] has 96 depth slices, each of size [55×55])), Then the same slice should share the same set of weights and biases. We can still learn these weights using gradient descent, with a few minor changes to the original algorithm, where the gradient of the shared weights is the sum of the gradients of all the shared parameters.
Why share weights? On the one hand, repeating units can recognize features regardless of their position in the viewable range. On the other hand, weight sharing enables us to carry out feature extraction more effectively, because it greatly reduces the number of free variables to learn. By controlling the size of the model, convolutional networks can have good generalization ability for visual problems.
Convolution
If parameter sharing is applied, each layer of computation is essentially a convolution of the input layer and the weight! So that’s where the convolutional neural network gets its name.
Forget about convolution for a second. For simplicity, consider a 5×5 image, and a 3×3 convolution kernel. The convolution kernel here has nine parameters, so let’s call that lambda. In this case, the convolution kernel actually has nine neurons, and their outputs form a 3×3 matrix called the feature graph. The first neuron connects to the first 3×3 part of the image, and the second neuron connects to the second part (note the overlap! In the same way that your eyes sweep continuously). The details are shown in the following figure.
The top of the graph is the output of the first neuron, and the bottom is the output of the second neuron. The operations per neuron are still going to be
Now, notice that we’re used to writing it this way when we’re doing operations, but in fact, what we’re using here is.
Now let’s recall the discrete convolution operation. Suppose there are two dimensional discrete functions,, then their convolution is defined as
Now there you have it! After all the 9 neurons in the above example complete the output, it is actually equivalent to the convolution operation of the image and the convolution kernel!
Numpy examples
Below use numpy code to illustrate the above concepts and formulas.
Assuming the input is stored in a NUMpy array X, then: * Depth column at (X, y) is X[X, y, :] * depth slice at depth d is X[:, :, d]
Assume that the size of X is x.shape: (11,11,4), and zero filling is not required (P = 0), the filter (receptive field) size F = 5, and the stride length is 2 (S = 2). Then the space size of the output unit should be (11-5) / 2 + 1 = 4, that is, both width and height are 4. Assuming the output is stored in V, it should be evaluated as follows:
V[0,0,0] = np. Sum (X[:5,:5,:] * W0) + b0
V[1,0,0] = np. Sum (X[2:7,:5,:] * W0) + b0
V[2,0,0] = np. Sum (X[4:9,:5,:] * W0) + b0
V[3,0,0] = np. Sum (X[6:11,:5,:] * W0) + b0
-
Sum (X[:5,:5,:] * W1) + b1
V[1,0,1] = np. Sum (X[2:7,:5,:] * W1) + b1
V[2,0,1] = np. Sum (X[4:9,:5,:] * W1) + b1
V[3,0,1] = np. Sum (X[6:11,:5,:] * W1) + b1
Sum (X[:5,2:7,:] * W1) + b1
V[2,3,1] = np. Sum (X[4:9,6:11,:] * W1) + b1
Note that * in numpy means the multiplication of the corresponding elements of two arrays.
Summary of Convolution Layer
-
Receiving 3d input
-
4 hyperparameters need to be given:
- Number of filters ,
- their spatial extent ,
- the stride ,
- the amount of zero padding .
-
Output a three-dimensional element, where:
-
Applying weight sharing, each filter generates a weight, a total of weights and biases.
-
In the output unit, the result of the DTH depth slice is obtained by the convolution operation of the DTH filter and the input unit, and then the offset is added.
Pooling Layer
Pooling, or downsamples, is intended to reduce feature graphs. The Pooling operation is independent for each depth slice and the scale is generally 2 * 2. Compared with the convolution operation for the convolution layer, the Pooling layer generally carries out the following operations: * Max Pooling. Take the maximum of four points. This is the most common pooling method. * Mean Pooling. Take the mean of the four points. * Gaussian pooling. Use the method of Gaussian blur for reference. Not commonly used. * Training can be pooled. The training function FF accepts 4 points as input and 1 point in and out. Not commonly used.
The most common pooling layer is a 2*2 scale, step 2, downsampling each depth slice of input. Each MAX operation is performed on four numbers, as shown below:
The pooled operation will save the same depth.
If the size of the input unit of the pooling layer is not an integer multiple of two, it is usually filled with a multiple of 2 using zero-padding, and then pooled.
Pooling Layer Summary
-
The receiving unit size is:
-
Two hyperparameters are required:
- their spatial extent ,
- the stride ,
-
Output size:, where:
-
There is no need to introduce new weights
Fully connected layer
The fully connected layer and the convolution layer can be converted to each other: * For any convolution layer, turning it into a fully connected layer only requires turning the weights into a huge matrix, most of which are zero except for certain blocks (due to local perception), and many blocks have the same weights (due to weight sharing). * Conversely, for any fully connected layer can also be changed to the convolution layer. For example, a full connection layer, the input layer size is, it can be equivalent to a convolution layer. In other words, we set filter size to exactly the size of the entire input layer.
Convolutional neural network architecture
Layer Patterns
The common convolutional neural network architecture is as follows:
INPUT -> [[CONV -> RELU]*N -> POOL?] *M -> [FC -> RELU]*K -> FCCopy the code
- 1
Stack several convolution and rectification layers, add a pooling layer, repeat the pattern until the image has been merged to a smaller size, and then use the full connection layer to control the output.
In the above expression? N >= 0 &&n <= 3, M >= 0, K >= 0 &&k < 3.
For example, you can combine the following patterns: * INPUT -> FC, implements a linear classifier, Here we see N = M = K = 0 * INPUT -> CONV -> RELU -> FC * INPUT -> CONV -> RELU -> POOL *2 -> FC -> RELU -> FC that there is a single CONV layer between every POOL layer. * INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FC Here we see two CONV layers stacked before every POOL layer. This is generally a good idea for larger and deeper networks, because multiple stacked CONV layers can develop more complex features of the input volume before the destructive pooling operation.
Layer Sizing Patterns
- Input Layer: should be an integer power of 2. 32,64, 128, etc.
- Conv Layer: use small filter, step, and edge zeroing if the input Layer cannot be fitted exactly. If used, the output size will be the same as the input. If you use a larger filter (such as 7*7), it will generally only be seen in the convolution layer next to the original input image.
- Pool Layer :
Three, part of the code
function varargout = cnnMain(varargin) gui_Singleton = 1; gui_State = struct('gui_Name', mfilename, ... 'gui_Singleton', gui_Singleton, ... 'gui_OpeningFcn', @cnnMain_OpeningFcn, ... 'gui_OutputFcn', @cnnMain_OutputFcn, ... 'gui_LayoutFcn', [] , ... 'gui_Callback', []); if nargin && ischar(varargin{1}) gui_State.gui_Callback = str2func(varargin{1}); end if nargout [varargout{1:nargout}] = gui_mainfcn(gui_State, varargin{:}); else gui_mainfcn(gui_State, varargin{:}); end function cnnMain_OpeningFcn(hObject, eventdata, handles, varargin) handles.output = hObject; guidata(hObject, handles); movegui(hObject,'center'); function varargout = cnnMain_OutputFcn(hObject, eventdata, handles) varargout{1} = handles.output; function LPBut_Callback(hObject, eventdata, handles) [img_lp,PL]=LPLocation(handles.img_rgb); axes(handles.axes1); hold on; row = PL.row; col = PL.col plot([col(1) col(2)], [row(1) row(1)], 'g-', 'LineWidth', 3); plot([col(1) col(2)], [row(2) row(2)], 'g-', 'LineWidth', 3); plot([col(1) col(1)], [row(1) row(2)], 'g-', 'LineWidth', 3); plot([col(2) col(2)], [row(1) row(2)], 'g-', 'LineWidth', 3); hold off; axes(handles.axes2); imshow(img_lp); Title (' location map ', 'FontWeight', 'Bold'); handles.img_lp=img_lp; guidata(hObject, handles); % -------------------------------------------------------------------- function openFile_Callback(hObject, eventdata, handles) [uuu,vvv]=uigetfile({'*.jpg; *.tif; *.png; *.gif; *.BMP; *.JPEG','All Image Files'} ,'MultiSelect', 'on'); % Get a license plate photo path=strcat(VVV,uuu); % img_rgb=imread(path); Img_rgb = imresize (img_rgb, [240320]). axes(handles.axes1); im = imread(path); imshow(img_rgb); Title (' original image ', 'FontWeight', 'Bold'); handles.img_rgb=img_rgb; guidata(hObject, handles); function FGBut_Callback(hObject, eventdata, handles) functionPath=pwd; % save existing path dataPath=strcat(functionPath,' cnndata '); % Save data path cnnToolPath=strcat(functionPath,'\CNN\ deeplearnToolbox_cnn_lzbv3.0 '); addpath(functionPath) addpath(dataPath) addpath(cnnToolPath) [LP_word]=LPWordDivide(handles.img_lp); Figure for I =1:7 subplot(1,7, I) imshow(LP_word(:,:, I)); figure for I =1:7 subplot(1,7, I) imshow(LP_word(:,:, I)); end handles.LP_word=LP_word; guidata(hObject, handles); % --- Executes on button press in CNNbut. function CNNbut_Callback(hObject, eventdata, handles) functionPath=pwd; % save existing path dataPath=strcat(functionPath,' cnndata '); % Save data path cnnToolPath=strcat(functionPath,'\CNN\ deeplearnToolbox_cnn_lzbv3.0 '); addpath(functionPath) addpath(dataPath) addpath(cnnToolPath) If (~ exist (' trainData. Mat ', 'file') | | ~ exist (' testData. Mat ', 'file')) % check for trainData, automatic testData % data stored in the/data folder [train_x,train_y,test_x,test_y]=dataSet(); Else Load trainData Load testData end if(~exist(' net.mat ','file')) % % Network automatically saved in /data folder CD (functionPath) [net,err,bad]=LPNetTrain(train_x,train_y,test_x,test_y); Else load net5_3_1. mat CD (functionPath) end LP = CNNFF (net,handles.LP_word); % Neural network results of obtaining license plate characters lplabel= Lp.y; % get character label [word,position]= label2Word (lplabel); Set (handles.AA,'String', word); % --- Executes on button press in pushbutton4. function pushbutton4_Callback(hObject, eventdata, handles) img_gray = rgb2gray(handles.img_lp); axes(handles.axes4); imshow(img_gray); Title (' grayscale image ', 'FontWeight', 'Bold'); function pushbutton5_Callback(hObject, eventdata, handles) se=[1 1]; img_bimr = imerode(handles.img_rgb, se); Se = strel (" rectangle ", [1, 5)); img_bimr2 = imdilate(img_bimr, se); axes(handles.axes5); imshow(img_bimr2); Title (' corrode and swell ', 'FontWeight', 'Bold');Copy the code
4. Operation results
Five, reference and code private message blogger
-
[1]
Research on Waste Classification System Based on Convolutional Neural Network [J]. Wang Yang, WANG Xiaoni, WANG Yuxin, LIU Chang, XIONG Jiwei, HAN Dingliang. Sensor World. 2020(08)
-
[2]
Intelligent Garbage Classification System based on Convolutional Neural Network [J]. Wu Bi-cheng, DENG Xiang-en, ZHANG Zi-Jie, TANG Xiao-Yu. Physics Experiments. 2019(11)
-
[3]
Wu Jian, Chen Hao, FANG Wu. Research on Waste Analysis and Identification based on Computer vision [J]. Information Technology and Informatization. 2016(10)
-