In the previous article “Selection of Image Similarity Algorithm for Mobile Terminal”, we tested three methods of image similarity calculation based on perceptual hash, convolutional neural network and local invariant feature. It is found that the accuracy of similarity calculation based on local invariant feature is better than traditional perceptual hash algorithm and the support of rotation invariance is better than convolutional neural network. At the same time, the retrieval efficiency difference of FEATURE point detection using SIFT and Hessian-Affine is compared in detail, and the limit of “minimum feature points” is added. Finally, the method of “Hessian-Affine feature point detection +SIFT feature point description” is used. An efficient retrieval algorithm is obtained, which can give consideration to anti-interference ability and retrieval efficiency.
In this article, we will further introduce how to transplant this algorithm to the terminal from the engineering point of view, and how to optimize for the mobile terminal to achieve the generation of video fingerprint on the terminal. The whole project is realized based on project [1]. Here we mainly introduce it in the following parts:
-
Smoke video frame: — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – the video similarity transformation for image similarity problems
-
Image feature extraction algorithm transplantation to the end: ———- to solve the dependency library and other problems
-
Optimization for mobile: ————————- speed and package size optimization
-
Bloom Filter based retrieval system: ———– For video retrieval efficiency test
-
In the pit of
1. Video frame extraction
We extract multiple frames of the video, and then extract the feature point description of these frames as the feature information of the whole video, thus transforming the video fingerprint problem into the image feature extraction problem.
In order to reduce the amount of calculation, we extracted 10 key frames at most for each video, and used Pearson correlation coefficient to exclude similar frames in the middle.
static float pearson_relative(unsigned char* inputX, unsigned char* inputY,
int length)
{
// Use Pearson correlation coefficient to calculate correlation
Double totalX = 0.0, totalY = 0.0;
Double totalMul = 0.0, totalSqrtX = 0.0, totalSqrtY = 0.0;
for(int i = 0; i < length; ++i) {
totalX += inputX[i];
totalY += inputY[i];
}
totalX /= length;
totalY /= length;
for(int i = 0; i < length; ++i) {
totalMul += (inputX[i] - totalX) * (inputY[i] - totalY);
totalSqrtX += (inputX[i] - totalX) * (inputX[i] - totalX);
totalSqrtY += (inputY[i] - totalY) * (inputY[i] - totalY);
}
return totalMul / (sqrt(totalSqrtX) * sqrt(totalSqrtY));
}
Copy the code
When the correlation is greater than 0.9, the two key frames are considered too similar and only one of them is taken.
2. Image feature extraction algorithm is transplanted to the terminal
After extracting key frames from the video, the next step is to extract features from key frames. We need to transplant the image feature extraction algorithm in project [1] to the end. The general process of image feature extraction is as follows:
The corresponding code of feature processing is:
File:
indexer/global_descriptors/gdindex.cc
Copy the code
Function:
void generate_point_indexed_descriptor(const FeatureSet* feature_set,
const int verbose_level,
vector<uint>& feat_assgns,
vector<float>& feat_assgn_weights,
vector < vector < float > >& feat_residuals);
Copy the code
The corresponding code for Hash extraction is:
File:
bloom_filters/point_indexed/binarize_residuals.cc
Copy the code
Function:
void binarize_residual(const vector < vector < float > >& feat_residuals,
vector<uint>& feat_residuals_binarized)
Copy the code
Feature point detection and description:
As mentioned above, we decided to use hessian-Affine + SIFT after testing, but project [1] only provided Linux and Mac executables, and did not provide source code, so we need to find relevant code.
2.1 Hessian Affine + SIFT feature extraction source code
After searching online, we found the VISE[2] project, where detect_points and Compute_Descriptors correspond to feature point detection and feature point description respectively.
Hessian-affine feature point detection and SIFT feature point description are provided. So we extracted the relevant code from inside and modified it appropriately.
First, feature point detection:
File:
src/external/KMCode_relja/exec/detect_points/detect_points.cpp
Copy the code
Function:
void detect_points_hesaff(std::string jpg_filename, std::vector<ellipse> ®ions)
Copy the code
Then is the description of feature points:
File:
src/external/KMCode_relja/exec/compute_descriptors/compute_descriptors.cpp
Copy the code
Function:
void compute_descriptors_sift(std::string jpg_filename,
std::vector<ellipse> ®ions,
uint32_t& feat_count,
float scale_multiplier,
bool upright,
float *&descs )
Copy the code
Looking at the two functions, we can see that both functions read and parse the image when they receive the input parameter jpg_filename, so this can be merged externally. Finally, combined with the Hash extraction code of project [1], our code is roughly as follows:
DARY *image = new ImageContent(inputStr.c_str());
if(image == NULL || image->bValid == false)
return -1;
image->toGRAY();
image->char2float();
uint32_t numFeats;
std::vector<ellipse> regions;
float *descs;
vector< CornerDescriptor* > descriptors;
initPatchMask(PATCH_SIZE);
KM_detect_points::detect_points_hesaff(image, regions, 500);
KM_compute_descriptors::compute_descriptors_sift(image,
regions,
numFeats,
3.0,
false,
descs,
descriptors);
vector < uint > feat_assgns;
vector <uint> result_vector = generate_point_indexed_descriptor(descriptors, numFeats, feat_assgns);
Copy the code
The resulting feat_assgns and result_vector are the end product. (The result_vector here is the result of the binarize_residual function, which we merged into generate_point_indexed_descriptor)
2.2 Libjpeg and BLas package Dependency
In the migration process, we will also encounter the dependency of jpeg image parsing library and BLAS computing library.
JPEG library:
The VISE[2] project relies directly on the computer’s JPEG library, so we need to find an end-to-end available JPEG library as an alternative. Here we use the project libjpeg-Turbo [3] to compile the available libraries for IOS and Android separately.
BLAS libraries:
Project [1] relies on a Yael sub-project, which relies on the BLAS scientific computing library, so we also need to solve the dependency problem of the BLAS library on the end. Initially we found an OpenBLAS[4] project suitable for use on the side, but later we found that we could simply strip down the BLAS library and replace it with a function. We’ll talk more about that later.
2.3 Compatibility of individual functions on different platforms
The fvec_new function of common/yael_v260_modif/yael/vector. C in project [1] needs to handle compatibility issues between different platforms:
-
float *fvec_new (long n )
-
{
-
#if defined(__custom__linux__)
-
float * ret = (float *) malloc ( sizeof(*ret ) * n);
-
#elif defined __IPHONE_OS_VERSION_MIN_REQUIRED
-
float * ret = (float *) malloc ( sizeof(*ret ) * n);
-
#else
-
float * ret = (float *) memalign ( 16, sizeof (*ret ) * n);
-
#endif
-
if (! ret) {
-
fprintf (stderr, "fvec_new %ld : out of memory\n" , n);
-
abort ();
-
}
-
return ret;
-
}
Copy the code
3 Optimization for mobile terminals
3.1 Speed Optimization
3.1.1Process optimization:Optimize file read and write operations
In the original process, after the description of feature points is generated, the feature points will be saved into a file with the suffix siftgeo, which will be read later and further extracted by Hash to save multiple frames of a video into a file. Here we can simply remove the IO operations and keep the data in memory.
3.1.2 Feature point control:A maximum of 500 feature points can be extracted from each image
We found that many images can often detect thousands of feature points (some up to 5000~6000), and the number of feature points will directly affect the calculation speed on the end, so we need to control the number of detected feature points.
We choose the maximum 500 feature points by weighing the calculation speed and effect.
By modifying the SRC/external/KMCode_relja/descriptor/CornerDetector CPP file harris_lap function can control the characteristics of the detected points.
After controlling the feature points, the calculation speed on the terminal can be sped up several times, and the calculation time of a single image (about 1700 feature points) we tested on the mobile phone can be reduced from 50~60 seconds to about 16 seconds.
3.1.3 NEON instruction optimization
You can enable NEON command optimization during compilation to speed up computation on the terminal. Specific method can search the tutorial on the net, here no longer redundant.
3.1.4 multithreading
A video will extract multiple frames, and each frame requires image feature extraction. Therefore, we can open multiple threads on task granularity to speed up feature extraction of the whole video. We use eight threads, each processing one picture frame at a time.
3.1.5 Partial algorithm optimization
VISE[2] engineeringsrc/external/KMCode_relja/gauss_iir/gauss_iir.cpp
There’s a lot of Gaussian fuzzy convolution in there, and when we actually run it we’ll see that some of the functions sometimes have a lot of zeros involved, and we can speed up that part of the operation.
And there are a lot of operations like this:
Rpix [r] [c] = 0.030 * pix [r - 3] [c] + 0.105 * pix [r - 2] + [c]
0.222 * pix [r - 1) [c] + 0.286 * pix [r] [c] + 0.222 * pix [c] + [r + 1]
0.105 * pix[r+2][c] + 0.030 * pix[r+3][c];
Copy the code
We can use the associative law of multiplication to transform into:
Rpix [r] [c] = 0.030 * (pix [r - 3] [c] + pix + 3 [r] [c]) + 0.105 * (pix [r - 2] [c] + pix [r + 2], [c]) +
0.222 * (pix [r - 1) [c] + pix [r + 1) [c]) + 0.286 * pix [r] [c];
Copy the code
3.1.6 Preset parameter file to source
In videosearch/indexer/global_descriptors/trained_parameters/directory, and some parameters in the process of the training is a binary file storage, here we are in order to reduce the IO operations, we used the parameter file can be modified into header files. There are three parameter files we use:
Sift. Pre_alpha. 0.50 desc_covariance
Sift. Pre_alpha. 0.50 desc_eigenvectors
Sift. Pre_alpha. 0.50. Pca. 32. GMM. 512
Copy the code
After modification, it looks like this:
3.2 Package size optimization
3.2.1 Yael Project simplification:Sgemm functions replace the OpenBLAS library
As mentioned above, we use OpenBLAS[4] as an on-side scientific library. However, it turns out that the only function we can use in this library is the matrix multiplication function: sgemm. Therefore, we implemented a sgemm function of our own to replace the entire OpenBLAS to reduce the package size.
void sgemm(char *transa, char *transb, const int M, const int N,
const int K, float alpha, const float* a, int lda,
const float* b, int ldb, float beta, float* c, int ldc)
{
// Define the sizes of arrays A, B, and C
const int area_a = M * K;
const int area_b = N * K;
const int area_c = M * N;
// Allocate the size of the transposed array
float input_a[M][K];
float input_b[K][N];
// transpose if a matrix inputs 'T', otherwise unchanged
if(M == lda) {
for(int i = 0; i < M; ++i) {
for(int j = 0; j < K; ++j)
input_a[i][j] = a[j * M + i];
}
} else {
int cnt = 0;
for(int i = 0; i < M; ++i)
for(int j = 0; j < K; ++j)
input_a[i][j] = a[cnt++];
}
// transpose if b matrix input 'T', otherwise unchanged
if(K == ldb) {
for(int i = 0; i < K; ++i) {
for(int j = 0; j < N; ++j)
input_b[i][j] = b[j * K + i];
}
} else {
int cnt = 0;
for(int i = 0; i < K; ++i)
for(int j = 0; j < N; ++j)
input_b[i][j] = b[cnt++];
}
float sum;
for(int m = 0; m < M; ++m)
for(int n = 0; n < N; ++n) {
sum = 0;
for(int k = 0; k < K; ++k)
sum += input_a[m][k] * input_b[k][n];
const int index = n * M + m;
c[index] = alpha * sum + beta * c[index];
}
}
Copy the code
3.2.2 Deleting Unnecessary Files
We use VISE[2] feature point detection and feature point description, and Hash generation for Project [1], which has a lot of code that we don’t need. So we started with the functions we needed, extracted as much code as possible, and cut out useless code to reduce our package size. As many files are involved, I will not elaborate here.
4 Retrieval system based on Bloom Filter
After the whole project can run, we need to test the retrieval efficiency of the video. The retrieval system body is based on the VideoSearch /bloom_filters/ section of Project [1].
The principles of video fingerprint storage and retrieval are as follows:
The video fingerprint library is A THREE-DIMENSIONAL list, which we named A.
When storing video fingerprints:
Assume that there is a video with id 100, and each extracted key frame has extracted the Hash of feature points (one point corresponds to two ints, one is the Hash function ID from 0 to 512, and the other is the Hash value). Let’s take out the description of the first feature point of Frame 0. Assume that the corresponding Hash function ID is 2 and the Hash value is 156324. Then, when storing this feature point in the video fingerprint database, use the Hash function with ID 2 to calculate the Hash value of “156324”. Assuming that the value obtained is 1465879456, find the list corresponding to A[2][1465879456] and store “100” (video ID) in this list.
When retrieving a video:
Assuming that the representation of A feature point of this video is also 2 and 156324, then the corresponding list of A[2][1465879456] can be found in the same way. At this time, it is assumed that there are two values in the list: 100 and 3, namely, the two videos with the id of 100 and 3, have feature points matching the feature points currently retrieved, so we add certain “scores” to these two videos. After traversing all the feature points of the retrieved video, the matching degree with other videos can be obtained according to the matched score.
Here, the size of “fraction” added to each matched feature point is determined by TF-IDF(Term Frequency-inverse Document Frequency). The basic idea is that the more times a feature point appears in a video, At the same time, the less times it appears in all videos, the more it represents the video.
Test results:
We took 959 videos as the video library (originally 1000 videos, but some repetitive or similar videos were manually removed), 283 of which were selected as the videos to be queried, and the final retrieval efficiency result was as follows:
Recall rate: 0.978799
Accuracy: 0.975352
F value (2PR/(P+R)) : 0.977
Copy the code
5 Encountered pits
PatchMask initialization problem
In the test, we found that the fingerprint of the same video obtained for the first time was always inconsistent with that obtained later. Finally, it was found that initPatchMask was initialized too late. SRC/external/KMCode_relja/descriptor/Corner. The CPP has a global patch_mask inside, and use it for the first time, haven’t perform initPatchMask, lead to the results calculated for the first time is wrong. The next time you execute it, you get the correct result because you’ve already initialized it. Finally, we changed the execution timing of initPatchMask method to solve this problem.
6 summary
Due to the difference in computing power, the implementation of video fingerprint on the end needs to solve the problems of computing speed and packet size, so it needs to strike a balance between computation and effect. We’ve made various optimizations for mobile, but there’s still a lot of room to explore. For example, we can try to use OpenCL to speed up operation, further simplify some processes (such as SVD singular value decomposition after removing feature extraction), and in terms of algorithm, we can also consider combining audio information to express video information more completely. Welcome to discuss together.
reference
[1] https://github.com/andrefaraujo/videosearch
[2] https://github.com/ox-vgg/vise
[3] https://github.com/libjpeg-turbo/libjpeg-turbo
Scan the QR code to follow the official account of xianyu Technology