This is the fifth day of my November challenge.

Abstract

Language knowledge is of great benefit to scene text recognition. However, how to effectively model language rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capabilities of language models come from: 1) implicit language modeling; 2) Unidirectional feature representation; And 3) language models with noisy inputs. Accordingly, we propose an autonomous, bidirectional and iterative ABINet for scene text recognition. First, the autonomy proposal blocks the gradient flow between visual and language models to explicitly perform language modeling. Secondly, a new bidirectional cloze network (BCN) based on bidirectional feature representation is proposed as a language model. Thirdly, we propose an iterative correction method for the language model, which can effectively mitigate the impact of noise input. In addition, based on the set of iterative predictions, we propose a self-training method that can effectively learn from unlabeled images. A number of experiments have shown that ABINet is superior on low quality images and has achieved state-of-the-art results on several mainstream benchmarks. In addition, ABINet with integrated self-training training shows promising improvements in implementing human-level recognition. The code is available at github.com/FangShanche… To obtain.

1, the introduction of

The ability to read text from scene images is essential for artificial intelligence [24, 41]. To this end, early attempts treated characters as meaningless symbols and recognized these symbols through classification models [42, 15]. However, when encountering challenging environments such as occlusion, blur and noise, it will become blurred due to visual discrimination. Fortunately, because the text is rich in language information, characters can be inferred from context. As a result, a bunch of methods [16, 14, 29] have turned their attention to language modeling and have achieved undoubted improvements.

However, how to effectively simulate the language behavior in human reading is still an open question. From psychological observations, we can make three assumptions about human reading, namely that language modeling is autonomous, bidirectional, and iterative: 1) Since both deaf and blind people can have fully functional vision and language respectively, we use the term autonomy to explain the independence of learning between vision and language. Autonomy also means good interaction between vision and language, and independently learned language knowledge helps to recognize characters in vision. 2) Infering character context behaves like a cloze task, because illegible characters can be treated as blank space. Thus, the prediction can be made simultaneously using clues of legible characters on the left and right sides of the illegible characters, which corresponds to bidirectional. 3) Iteration describes that in a challenging environment, humans adopt a progressive strategy to improve prediction confidence by iteratively correcting recognition results.

First, applying the principle of autonomy to scene text recognition (STR) means that recognition models should be decoupled into visual models (VM) and language models (LM), and submodels can be learned separately as separate functional units. Recent attention-based approaches often design LMS based on RNN or Transformer [39], where language rules are learned implicitly in coupled models [19, 36, 33] (Figure 1A). However, whether and how LMS learn character relationships is unknown. Furthermore, this approach cannot capture rich prior knowledge by pre-training LMS directly from large-scale unmarked text.

Secondly, compared with unidirectional LMs [38], LMs with bidirectional principle can capture twice as much information as the original. A direct approach to constructing bidirectional models is to combine left-to-right models and right-to-left models [28, 5] at the probability level [44, 36] or feature level [49] (Figure.1e). However, they are actually less powerful because their language features are effectively one-way representations. In addition, the integrated model means twice as much cost in terms of calculations and parameters. A recent notable work in NLP is BERT [5], which introduces deep bidirectional representations learned by masking text tags. Applying BERT directly to STR requires masking all characters in the text instance, which is very expensive because you can only mask one character at a time.

Third, LMS performed on iterative principles can refine predictions from visual and linguistic cues, which is not explored in current approaches. The typical method for performing LM is autoregression [44, 3, 45] (FIG. 1D), where error identifiers are accumulated as noise and used as input for subsequent predictions. In order to adapt to Transformer architecture, [25, 49] abandoned autoregression and adopted parallel prediction (FIG. 1e) to improve efficiency. However, noise inputs still exist in parallel prediction, where errors from VM output directly impair LM accuracy. In addition, parallel prediction in SRN [49] has the problem of unaligned length, and SRN is difficult to infer the correct character if VM incorrectly predicts the text length.

Considering the shortcomings of current methods in internal interaction, feature representation and execution, we propose an ABINet guided by autonomous, bidirectional and iterative principles. First, we explore a decoupling approach by blocking the gradient flow (BGF) between VM and LM (Figure 1B), which forces THE LM to explicitly learn language rules. In addition, both VM and LM are autonomous units that can be pre-trained from images and text, respectively. Second, we designed a novel two-way cloze network (BCN) as LM, eliminating the dilemma of combining two one-way models (Figure 1C). BCN controls access to characters on both sides by specifying an attention mask based on the left and right context. In addition, cross-step access is not allowed to prevent information leakage. Third, we propose an implementation of LM iterative correction (FIG. 1B). By repeatedly feeding the output of ABINet into LM, the prediction can be progressively improved and the problem of misaligned lengths can be mitigated to some extent. In addition, a semi-supervised approach based on self-training is explored to treat iterative prediction as a whole, which utilizes a new solution for human-level recognition.

The main contributions of this paper include: 1) We propose autonomous, bidirectional, and iterative principles to guide the design of LM in STR. Under these principles, LM is a functional unit that is required to iteratively extract bidirectional representations and correct predictions. 2) A new BCN is introduced, which uses bidirectional representation to estimate the probability distribution of characters such as cloze task. 3) The proposed ABINet achieves state of the art (SOTA) performance on mainstream benchmarks, and the integrated self-training ABINet shows promising improvements in implementing human-level recognition.

2. Related work

2.1. No language method

Language-free methods, such as CTC-based [7] and partition-based [21], usually make use of visual features without considering the relationships between characters. The CTC-based method uses CNN to extract visual features and RNN to model feature sequences. Then, CTC loss was used to conduct end-to-end training for CNN and RNN [34, 11, 37, 12]. The segmentation based method uses FCN to segment characters at the pixel level. Liao et al. recognized characters by grouping segmented pixels into text areas. Wan et al. [40] proposed an additional sequential partition graph that transcribes characters in the correct order. Due to the lack of language information, the language-free method can not solve the problem of low quality image recognition well.

2.2. Language-based approach

Internal interaction between vision and language. In some early work, n-gram’s bag of text strings was predicted by CNN, which acted as an explicit LM [14,16,13]. Later, attention-based approaches became popular, implicitly modeling the language using more powerful RNN [19, 36] or Transformer [43, 33]. The attention-based approach follows an encoder-decoder architecture, where the encoder processes the image and the decoder focuses on either one-dimensional image features [19, 35, 36, 3, 4] or two-dimensional image features [48, 45, 23, 20]. For example, R2AM [19] uses recursive CNN as a feature extractor and LSTM as the learned LM character-level implicit modeling language, avoiding n-Gram. In addition, this approach is typically enhanced by integrating correction modules for irregular images [36,51,47] before the images are fed into the network. Unlike the above approach, our approach seeks to build more powerful LMS through explicit language modeling. In an attempt to improve linguistic expression, some works introduce multiple losses, among which additional losses come from semantics [29, 25, 49, 6]. SEED[29] proposed to use the FastText model of pre-training to guide the training of RNN and bring additional semantic information. We deviate from this point because our approach pretrains LMS directly in unmarked text, which is more feasible in practice.

Representation of linguistic features. Character sequences in attention-based approaches are typically modeled from left to right [19, 35, 3, 40]. For example, Textscanner [40] inherits the unidirectional model of attention-based methods. The difference is that they use additional location branches to enhance location information and reduce misidentification in contextless scenarios. In order to make use of bidirectional information, [8, 36, 44, 49] et al. use the integrated model of two one-way models. Specifically, to capture global semantic context, SRN [49] combines left-to-right and right-to-left Transformers features for further prediction. We emphasize that integrated bidirectional models are essentially unidirectional feature representations.

How the language model is executed. Currently, THE network architecture of LMs is mainly based on RNN and Transformer [39]. Rn-based LMS typically perform [44, 3, 45] in automatic regression, which takes the prediction of the last character as input. DAN et al. [44] first used the proposed convolutional alignment module to obtain the visual features of each character. The GRU then predicts each character by taking as input the predictive embedding of the last time step and the character characteristics of the current time step. The Transformer based approach has advantages in parallel execution, where the input for each time step is either a visual feature [25] or a character embed from a visual feature prediction [49]. Our approach is parallel execution, but we try to mitigate the noise input problem in the parallel language model.

3. Suggested methods

3.1. Visual model

The visual model consists of a backbone network and location-aware modules (Figure 3). Following the previous approach, ResNet1 [36, 44] and Transformer units [49, 25] are used as feature extraction networks and sequence modeling networks. For image X, we have:


F b = T ( R ( x ) ) R H 4 x W 4 x C \mathbf{F}_{b}=\mathcal{T}(\mathcal{R}(\boldsymbol{x})) \in \mathbb{R}^{\frac{H}{4} \times \frac{W}{4} \times C}

Where H and W are the size of x, and C is the characteristic dimension.

The positional attention module transcribes visual features into character probability in parallel, based on the about query paradigm [39] :


F v = softmax ( Q K C ) V \mathbf{F}_{v}=\operatorname{softmax}\left(\frac{\mathbf{Q K}^{\top}}{\sqrt{C}}\right) \mathbf{V}

Specifically, Q∈RT× c\ mathbf{Q} \in \mathbb{R}^{T \times C}Q∈RT×C is the positional encoding of character order [39], and T is the length of character sequence.

K = G (Fb) ∈ RHW16 x C \ mathbf {K} = \ mathcal {G} \ left (\ mathbf {F} _ {b} \ right) \ \ mathbb in ^ {R} {\ frac {H W} {16} \ times C}K=G(Fb)∈R16HW×C, where G(.) \mathcal{G}\left(.\right) G(.) It is implemented by miniature U-Net2 [32]. V = H (Fb) ∈ RHW16 x C \ mathbf {n} = \ mathcal {H} \ left (\ mathbf {F} _ {b} \ right) \ \ mathbb in ^ {R} {\ frac {H W} {16} \ times C}V=H(Fb)∈R16HW×C, where H(.) \mathcal{H}\left(.\right)H(.) It’s an identity mapping.

3.2. Language Model

3.2.1. Autonomous strategy

As shown in Figure 2, the autonomy strategy includes the following characteristics: 1) LM is considered to be an independent spelling correction model, which takes the probability vector of characters as input and outputs the probability distribution of desired characters. 2) The flow of training gradient is blocked at the input vector (BGF). 3) LMS can be trained separately from unmarked text data.

Following an autonomous policy, an ABINet can be divided into explicable units. By using probability as input, LMS can be both fungible (that is, directly substituted for a more powerful model) and flexible (for example, iterated execution in Section 3.2.3 (P. 463). In addition, it is important to note that BGF inevitably forces the model to learn language knowledge, which is fundamentally different from implicit learning. Modeling, where exactly what the model learns is unknowable. In addition, the autonomous strategy allows us to directly share in the advances of the NLP community. For example, pre-training LMS can be an effective way to improve performance.

3.2.2 Bidirectional representation

Given a text string y=(y1,… , yn) \ boldsymbol {} y = \ left (y_ {1}, \ ldots, y_ {n} \ right) y = (y1,… ,yn), text length n, number of categories C, conditional probability P(yi∣yn,… , yi + 1, yi – 1,… , y1) P \ left (y_ {I} \ mid y_ {n}, \ ldots, y_ {I + 1}, y_ {1} I -, \ ldots, y_ {1} \ right) P (yi ∣ yn,… , yi + 1, yi – 1,… , y1) and P (yi ∣ yi – 1,… , y1) P \ left (y_ {I} \ mid y_ {1} I -, \ ldots, y_ {1} \ right) P (yi ∣ yi – 1,… , y1), respectively. From the point of view of information theory, the available entropy of bidirectional representation can be quantified as Hy=(n−1)log⁡c.H_{y}=(n-1) \log c. Hy=(n−1)logc. For one-way said, however, information is 1 n ∑ I = 1 n (I – 1) log ⁡ c = 12 hy \ frac {1} {n} \ sum_ {I = 1} ^ {n} (I – 1) \ log = c \ frac {1} {2} H_ {} y n1 ∑ I = 1 n (I – 1) logc = 21 hy. Our insight is that previous approaches typically use an integrated model of two one-way models, which are essentially one-way representations. The unidirectional representation basically captures 12Hy\frac{1}{2} H_{y}21Hy information, resulting in limited feature abstraction ability compared with bidirectional representation.

An off-the-shelf NLP model with spelling correction capability can be transferred thanks to the autonomous design in Section 3.2.1. One possible approach is to use the MASK Language model (MLM) in BERT [5] to replace YI with token [MASK]. However, we note that this is unacceptable because MLM should be called individually n times for each text instance, resulting in extremely low efficiency. Instead of masking input characters, we present BCN by specifying an attention mask.

In general, BCN is a variant of the L layer converter decoder. Each layer of BCN is a series of multi-headed attention and feedforward networks [39], followed by residual connections [10] and layer normalization [1], as shown in FIG. 4. Unlike Vanilla Transformer, feature vectors are fed into multiple attention blocks rather than layer 1 networks. In addition, the attention mask in multi-headed attention is designed to prevent “seeing yourself.” In addition, self-attention is not used in BCN to avoid leaking information across time steps. The attention operation within the multi-block can be formalized as:


M i j = { 0 . i indicates j up . i = j K i = V i = P ( y i ) W l F m h a = softmax ( Q K C + M ) V \begin{aligned} \mathbf{M}_{i j} &= \begin{cases}0, & i \neq j \\ -\infty, & i=j\end{cases} \\ \mathbf{K}_{i} &=\mathbf{V}_{i}=P\left(y_{i}\right) \mathbf{W}_{l} \\ \mathbf{F}_{m h a} &=\operatorname{softmax}\left(\frac{\mathbf{Q K}^{\top}}{\sqrt{C}}+\mathbf{M}\right) \mathbf{V} \end{aligned}

Where Q∈RT×C\mathbf{Q} \in \mathbb{R}^{T \times C}Q∈RT×C is the position encoding of character order in the first layer, otherwise it is the output of the last layer. K, V ∈ RT x C \ mathbf {K}, \ mathbf {n} \ \ mathbb in ^ {R}, {T} \ times C K V ∈ RT by character probability P (yi) ∈ x C RcP \ left (y_ {I} \ right) \ in \mathbb{R}^{c}P(yi)∈Rc, Wl∈Rc× c \mathbf{W}_{l} \in \mathbb{R}^{c \times c} Wl∈Rc× c is a linear mapping matrix. M∈RT×T\mathbf{M} \in \mathbb{R}^{T \times T}M∈RT×T is an attention mask matrix that prevents attention to the current character. After stacking the BCN layers into a deep architecture, the bidirectional representation of the text Y is determined FlF_{L}Fl.

By specifying the attention mask as a cloze, BCN can elegantly learn bidirectional representations that are more powerful than the integration of one-way representations. In addition, BCN can perform calculations independently and in parallel, thanks to an architecture like Transformer. In addition, it is more efficient than the integration model because only half the calculations and parameters are required.

3.2.3 Iterative correction

Transformer’s parallel prediction uses noise inputs, which are usually approximations of visual predictions [49] or visual features [25]. Specifically, in the bidirectional representation example shown in Figure 2, the ideal condition for P(” O “) is “sh-wing”. However, due to environmental ambiguity and occlusion, the actual situation obtained from the VM is “sh-ving”, where the “V” becomes noise, compromising the reliability of the prediction. As misprediction increases in the VM, it tends to become more hostile to the LM.

To solve the problem of noise input, we propose iterative LM (figure 2). LM repeats M times with different Y allocations. For the first iteration, yi=1 is the probability prediction from VM. For subsequent iterations, Yi ≥2 is the probability prediction of the fusion model (Section 3.3, p. In this way, LM is able to iteratively correct visual predictions.

Another observation is that Transformer based approaches often have problems with unaligned lengths [49], indicating that Transformer has difficulty correcting visual predictions if the character count is inconsistent with the ground reality. The unaligned length problem is caused by the inevitable implementation of a padded mask, which is fixed to filter context beyond the length of the text. Our iterative LM mitigated this problem because the visual and linguistic features were fused multiple times, and therefore the predicted text length was progressively refined

3.3 and fusion

Conceptually, visual models trained on images and language models trained on text come from different modes. In order to align visual features with linguistic features, we simply use a gating mechanism [49, 50] to make the final decision:


G = sigma ( [ F v . F l ] W f ) F f = G Even though F v + ( 1 G ) Even though F l \begin{aligned} \mathbf{G} &=\sigma\left(\left[\mathbf{F}_{v}, \mathbf{F}_{l}\right] \mathbf{W}_{f}\right) \\ \mathbf{F}_{f} &=\mathbf{G} \odot \mathbf{F}_{v}+(1-\mathbf{G}) \odot \mathbf{F}_{l} \end{aligned}

Wf∈R2C×C\mathbf{W}_{f} \in \mathbb{R}^{2 C\ times C}Wf∈R2C×C and G∈RT×C\mathbf{G} \in \mathbb{R}^{T \times C}G∈RT×C.

3.4. Supervised training

ABINet uses the following multitasking objectives for end-to-end training:


L = Lambda. v L v + Lambda. l M i = 1 M L l i + 1 M i = 1 M L f i \mathcal{L}=\lambda_{v} \mathcal{L}_{v}+\frac{\lambda_{l}}{M} \sum_{i=1}^{M} \mathcal{L}_{l}^{i}+\frac{1}{M} \sum_{i=1}^{M} \mathcal{L}_{f}^{i}

Lv, Ll\mathcal{L}_{v}, Ll\mathcal{L}_{L} Lv, Ll and Lf\mathcal{L}_{f}Lf are the cross entropy loss of Fv \mathbf{f} _{v} Fl\mathbf{F}_{l}Fl and Ff\mathbf{F}_{F} Ff respectively. In particular, Lli\mathcal{L}_{L} ^{I}Lli and Lfi\mathcal{L}_{f}^{I}Lfi are the losses of iteration III. λv\lambda_{v}λv and λl\lambda_{L}λl are equilibrium factors.

3.5. Semi-supervised integrated self-training

To further explore the superiority of our iterative model, we propose a semi-supervised learning method based on self-training [46] and a set of iterative predictions. The basic idea of self-training is to first generate pseudo-labels from the model itself, and then retrain the model using additional pseudo-labels. Therefore, the key issue is to build high-quality pseudo-labels.

In order to filter noise pseudo-labels, we propose the following methods: 1) Select the minimum confidence of characters in text instances as text determinism. 2) The iteration prediction of each character is treated as a whole to smooth the effect of the noise label. Therefore, we define the filter function as follows:


{ C = min 1 Or less t Or less T e E [ log P ( y t ) ] P ( y t ) = max 1 Or less m Or less M P m ( y t ) \left\{\begin{array}{l} \mathcal{C}=\min _{1 \leq t \leq T} e^{\mathbb{E}\left[\log P\left(y_{t}\right)\right]} \\ P\left(y_{t}\right)=\max _{1 \leq m \leq M} P_{m}\left(y_{t}\right) \end{array}\right.

Where C\mathcal{C}C is the minimum deterministic of text instance, Pm(YT)P_{m}\left(y_{t}\right)Pm(YT) is the probability distribution of TTT-th characters in mmM-th iteration. The training process is shown in algorithm 1, where QQQ is the threshold value. Bl,BuB_{L}, B_{U}Bl,Bu are training batches from labeled and unlabeled data. Nmax⁡N_{\ Max}Nmax is the maximum number of training steps, and NuplN_{u p L}Nupl is the number of steps to update the pseudo-label.

4,

4.1. Data set and experimental details

In order to make a fair comparison, experiments were carried out according to the Settings of [49]. Specifically, the training dataset is two synthetic datasets, MJSynth (MJ) [13, 15] and SynthText (ST) [9]. The six standard benchmarks include ICDAR 2013 (IC13) [18], ICDAR 2015 (IC15) [17], IIIT 5KWords (IIIT) [27], Street View Text (SVT) [42], Street View Text Perspective (SVTP) [30] and CUTE80 (CUTE) [31] was used as the test data set. Detailed information on these data sets can be found in previous works [49]. In addition, Uber-Text [52] de-tagging was used as an unlabeled data set to evaluate semi-supervised methods.

Model size CCC is always set to 512. BCN has 4 layers, each with 8 attention heads. Balance factor lambda v, lambda l \ lambda_ {n}, \ lambda_ {l} lambda v, lambda l is set to 1, 1, respectively. The image is directly adjusted to 32×12832 \times 12832×128 through geometric transformation (i.e. rotation, affine and perspective), image quality deterioration and color jitter and other data enhancement. We used 4 NVIDIA 1080Ti Gpus to train our batch size model 384. Using ADAM optimizer, the initial learning rate was 1E-4 and attenuated to 1E-5 after 6 epoches.

4.2 Ablation study

4.2.1 Visual model

First, we discuss VM performance in terms of feature extraction and sequence modeling. The experimental results are recorded in the table. 1. Parallel attention is a popular attention method [25, 49], and the proposed positional attention has a more powerful key/value vector representation. From the statistics we can conclude: 1) Simply upgrading the VM will result in a huge increase in accuracy, but at the expense of metrics and speed. 2) To upgrade the VM, we can use positional attention in feature extraction and deeper Transformer in sequence modeling.

4.2.2 Language Model

Autonomous strategy. In order to analyze the autonomous model, LV and BCN are used as VM and LM respectively. From the results in the table. 2. We can observe: 1) Pre-training VM is very useful, improving accuracy by 0.6%-0.7% on average; 2) The benefit of pre-training LM on the training data set (i.e. MJ and ST) is negligible; 3) It can also be helpful to pre-train LMS from additional unlabeled datasets, such as Wikitext-103, even if the underlying model has high precision. The above observations suggest that STR is useful for pre-training VMS and LMS. Pre-training LMS on additional unlabeled datasets is more effective than training datasets because limited text diversity and biased data distribution do not facilitate learning of well-performing LMS. In addition, it is cheap to pre-train LMS on untagged datasets because additional data is readily available.

In addition, performance was reduced by an average of 0.9% by allowing gradient flow (AGF) between VM and LM (Table 2). We also noted a sharp reduction in training losses to lower values for AGF. This suggests that overfitting occurs when LMS as VMS contribute to cheating in training, and this may also occur in implicit language modeling. Therefore, it is essential to force LM to learn independently through BGF. We note that SRN [49] uses the argmax operation after VM, which is essentially a special case of BGF because argmax is non-differentiable. Another advantage is that the autonomy strategy makes the model more interpretable because we can gain insight into THE performance of LMS (for example, Table 4), which is not feasible in implicit language modeling.

Bidirectional representation. Since BCN is a variant of Transformer, we compare BCN with its SRN counterpart. The Transformer based SRN [49] shows superior performance and is a collection of one-way representations. To be fair, the experiment was conducted under the same conditions except for the Internet. We use SV and LV as VMS to verify the level of validity at different precision. As shown in Table 3, although BCN has similar parameters and inference speed to the one-way version of SRN (SRN-U), it gains a competitive advantage in accuracy under different VMS. In addition, BCN shows better performance than bi-directional SRN in integration, especially on challenging data sets such as IC15 and CUTE. In addition, ABINet equipped with BCN is about 20%-25% faster than SRN, which is practical for large-scale tasks.

Section 3.2.1 argues that LM can be regarded as an independent unit to estimate the probability distribution of spelling correction, so we conduct experiments from this perspective. The training set is text from MJ and ST. To simulate spelling errors, the test set was 20,000 randomly selected items, in which we added or removed a character for 20% of the text, replaced a character for 60% of the text and left the rest unchanged. From the results in the table. 4. We can see that BCN is superior to SRN in character accuracy by 4.5% and word accuracy by 14.3% respectively, which indicates that BCN has stronger capability in character-level language modeling.

To better understand how BCN works in ABINet, we visualize the top-5 probability in Figure 5, using “today” as an example. On the one hand, since “today” is a string with semantic information and takes “-oday” and “tod-y” as inputs, BCN can predict “T” and “A” with high confidence and contribute to the final fusion prediction. On the other hand, since the error characters “L” and “O” are noise for the rest of the prediction, BCN becomes less confident and has little effect on the final prediction. In addition, if there are multiple error characters, BCN has a hard time recovering the correct text due to the lack of sufficient context.

Iterative correction. We again applied SV and LV with BCN to demonstrate the performance of different levels of iterative correction. The experimental results are shown in Table 5, where the number of iterations in training and testing is set as 1, 2 and 3. It can be seen from the results that three iterations of BCN can improve the accuracy by 0.4% respectively and 0.3% on average. In particular, IIIT is a relatively simple data set with a clear character appearance and little benefit. However, when it came to other more difficult data sets, such as IC15, SVT, and SVTP, iterative corrections steadily improved accuracy and improved SVT by 1.3% and 1.0% for SV and LV, respectively. It should also be noted that the reasoning time increases linearly with the number of iterations.

We further explored the iterative differences between training and testing. The fluctuation of average accuracy in FIG. 6 shows that: 1) The direct application of iterative correction in the test is also effective; 2) Iteration is beneficial in training because it provides LM with additional training samples; 3) The model iterates more than 3 times to achieve saturation accuracy, so large iterations are not needed.

To get a complete picture of the iterative correction, we visualize the intermediate prediction in Figure 7. Often, visual predictions can be modified to approximate reality, but errors can still occur in some cases. After several iterations, the predictions can eventually be revised. In addition, we observed that iterative correction mitigated the problem of unaligned lengths, as shown in the last column of Figure 7.

Conclusions can be drawn from ablation studies: 1) Bidirectional BCN is a powerful LM that can effectively improve accuracy and speed performance. 2) The noise input problem can be mitigated by further equipping the BCN with iterative correction, suggesting that challenging examples such as low quality images be handled at the expense of incremental computation.

4.3 Comparison with the most advanced technology

In general, it is not easy to use the reported statistics [2] directly to make a fair comparison with other methods because of the backbone (i.e. CNN structure and parameters), data processing (i.e. Image correction and data enhancement) and training skills. To strictly enforce fair comparison, we reproduce THE SOTA algorithm SRN that shares the same experimental configuration with ABINet, as shown in Table 6. The two re-implemented SRN-SV and SRN-LV are slightly different from the reported model by replacing the VM, eliminating the side effects of multi-scale training, applying the attenuated learning rate, etc. Note that since the SRN-SV performs slightly better than the SRN above technique. As you can see from the comparison, our ABinet-SV outperforms SRN-SV by 0.5%, 2.3%, 0.4%, 1.4%, 0.6%, and 1.4% on IC13, SVT, IIIT, IC15, SVTP, and CUTE data sets, respectively. In addition, abinet-LV with more powerful VMS improved 0.6%, 1.2%, 1.8%, 1.4%, 1.0% over its counterparts on the IC13, SVT, IC15, SVTP, and CUTE benchmarks.

ABINet also showed an impressive performance compared to recent SOTA works trained on MJ and ST (Table 6). ABINet in particular has an outstanding advantage over SVT, SVTP, and IC15 because these data sets contain a large number of low-quality images, such as noisy and fuzzy images, which the VM cannot identify with confidence. In addition, we found that images with abnormal fonts and irregular text can be successfully recognized because linguistic information is an important complement to visual features. Therefore, even without image correction, ABINet can get second-best results on CUTE.

4.4 Semi-supervised training

To further push the boundaries of accurate reading, we explore a semi-supervised approach that uses MJ and ST as labeled datasets and Uber-Text as unlabeled datasets. In Section 3.5, the threshold Q is set to 0.9, and the batch sizes of Bl and Bu are 256 and 128 respectively. The experimental results are shown in the table. FIG. 6 shows that the proposed self-training method abinet-LVST can easily outperform abinet-LV on all benchmark data sets. In addition, integrated self-training Abinet-Lvest showed more stable performance by improving data utilization efficiency. Observing the enhanced results, we find that hard samples with sparse fonts and fuzzy appearance can also be frequently recognized (FIG. 8), which indicates that exploring semi-unsupervised learning methods is a promising direction for scene text recognition.

5, conclusion

St easily outperforms ABinet-LV on all benchmark datasets. In addition, integrated self-training Abinet-Lvest showed more stable performance by improving data utilization efficiency. Observing the enhanced results, we find that hard samples with sparse fonts and fuzzy appearance can also be frequently recognized (FIG. 8), which indicates that exploring semi-unsupervised learning methods is a promising direction for scene text recognition.

Img-lar5vyad-1636846680444]

5, conclusion

In this paper, we propose ABINet, which explores an effective way to utilize language knowledge in scene text recognition. ABINet is 1) autonomous, which improves the power of the language model by explicitly enforcing learning; 2) Bidirectional, learning text representation by jointly adjusting the character context of both sides; 3) Iteration, gradually correct the prediction to reduce the impact of noise input. Based on ABINet, we further propose an integrated self-training method for semi-supervised learning. The experimental results of the standard datas demonstrate the superiority of ABINet, especially for low quality images. Furthermore, we claim that it is possible and hopeful to achieve human-level identification with untagged data.