The author Steven Schwarcz, Alex Gorban [] (https://research.google/people/105189/), Dar – Shyang Lee, Xavier Gibert

Ali Tao department – Auction – Ye ‘an

Presented at IEEE Winter Conference on Applications of Computer Vision (WACV)

(2020).

The original:

Research. Google/pubs/pub488…

Abstract

In this paper, we achieve the goal of training and learning language data without labels, and completing sequence OCR on road name sign photos. Our approach is to achieve reasonable performance on unlabeled images by combining domain adaptive techniques based on gradient inversion and a multi-task learning scheme using composite data that is easy to generate and another data with a labeled language. To achieve this, we adopted and published two new datasets – Hebrew Street Name Signs (HSNS) and Synthetic Hebrew Street Name Signs (SynHSNS) – while also utilizing an existing French Street Name Signs (FSNS) dataset. We proved that by using the Hebrew synthetic data set and natural images of the characters in the French street name logo tag data sets, can improve the effect of the real street name in the Hebrew transcription, including synthesis of Hebrew data and real data of French and we hope that the transcription of Hebrew data have different characteristics of overlap.

1. The introduction

There are currently eight groups of languages that use the alphabet: Arabic, Aramaic, Armenian, Sanskrit, Cyrillic, Georgian, Greek and Latin – each of which is widely used in a total of multiple dialects of various languages. In most of these languages it is difficult to find skilled operators to tag large data sets at a reasonable cost. Without a better way to train a system for a new language, it would be impractical to build text-recognition systems for real text images, such as Google Street View, which supports non-Latin languages.

At present, most serial OCR systems use a mixture of real data and synthetic data to train [18,43]. For printed documents and books, there is no difference between synthetic data and real data, and there are many ways to build a generalized OCR model. However, for the text recognition problem of field images, such as road signs, the gap between the synthetic text rendering and the real image is too big. Therefore, most existing OCR methods cannot be generalized and require a large number of labels.

Our proposed algorithm solves this problem without requiring new manual tags. Instead, recognition of the new language can be achieved using only a few synthetic datasets and an existing dataset of another unrelated language.

Our experiments show that adding another language to the training can actually reduce the need for more realistic synthetic data. The neural network learns the “content” of the first language from the synthetic data, as well as the “style” of the real image from the second language. We use Hebrew as our target language and French as our existing data set to illustrate the validity of this approach. We intentionally keep the synthesized data relatively simple to emphasize that the system doesn’t use the synthesized data to learn anything stylistic, and because we believe that the less complex the synthesized data, the more practical our algorithm will be.

Interestingly, important learning occurred even though Hebrew (an Aramaic language) did not have the same glyphs or characters as French (a Latin language). Therefore, there are no inherent language features in our algorithm: In theory, the French data set should be sufficient to train systems in any language without any manual tagging.

Finally, to ensure that our data were repeatable, we introduced and published Hebrew Street Signs (HSNS) and Synthetic Hebrew Street Signs (SynHSNS) datasets on which we performed all experiments.

Figure 1: We attempt to convert a language (Hebrew) with real images using only a combination of composite data in the same language (Hebrew) and labeled real data in a completely different language (e.g. French) without using any labeled training data. The synthesized Hebrew dataset overlaps with the real Hebrew dataset in content, and the French dataset overlaps in style but not in content. Thus, sources are complementary; Although they overlap with each other very little, they clearly cover the target.

2. Related work

2.1 Domain adaptation

In the field of computer vision, a large number of unsupervised and semi-supervised domain adaptive technologies have been invented and explored, especially in the field of image classification [29,26,25,24,23,14]. It is also suitable for semantic segmentation and other fields [47,27,16]. And object recognition [2] and [4] object detection, in all cases, the goal of these techniques is that a distribution of the distribution of the source domain and target domain matching (editor: is the distribution of the different source domain and target domain data, a feature mapping to space, to make it as close as possible in the space of distance).

In some cases, this is done by explicitly matching the moments of the two distributions. For example, Maximum Mean Discrepancy (MMD) is a loss function that calculates the norm to minimize the difference between the Mean values of the two distributions, which has a good effect at [37,20,3]. In addition, the work of [31] and [32] has achieved good results in the second moment of their source domain and target domain.

In addition to explicit moment matching, another technique is called Gradient Reversa,),It has become a powerful paradigm for deep domain adaptation and plays a fundamental role in many deep domain adaptation systems [3,4,16].It has even been effectively applied to problems completely beyond the scope of computer vision, such as machine translation [8]. inIn the setup, a discriminator branch is attached to the deep network, which classifies samples from the source domain or the target domain using deep features. The network also trains a feature extractor to deceive the discriminator by reversing the gradient loss symbol of the discriminator relative to the feature extractor.

Another closely related paradigm of deep domain adaptation is the use of adversarial learning approaches to minimize domain transitions [36,15,2,26,27]. These techniques are similar to GAN, and you can also use a discriminator to push the two feature distributions together.

Domain adaptation is also used for various text-related tasks in computer vision. For example, domain adaptive techniques have been used to recognize fonts in images [42,41]. Domain adaptation has also been applied to problems in natural language processing [6,11,5], which is an area related to OCR in language modeling and sequential processing.

There are also stylistic adaptation issues for language or computer vision, although none have been applied to field sequence OCR, and finally, there are various techniques that can take advantage of incomplete data training systems. For example [7], machine translation is performed using data from other languages by enhancing existing data to improve performance [48].

2.2 Optical character recognition

Optical character recognition (OCR) is the task of recognizing a string of characters in an image. Modern OCR methods based on deep learning usually use such a system: first, convolutional neural network [18] is used to extract features, and then text is extracted at the subsequent decoding layer [30,43]. In particular, [43] uses the first few layers of InceptionV3 structure [34] to extract features, which are then fed into LSTM to produce transcription.

Domain adaptation is also applied in sequential OCR domain. When the target domain contains large corpora (such as books), fine-tuning of the gaussian distribution model can be achieved using the MAP standard of maximum likelihood or expectation maximization using style and language consistency [28,39]. This is also similar to speaker adaptation using speaker-independent HMM models [10]. In recent studies [46,40], style and content separation has effectively promoted digital recognition from MNIST to SVHN datasets.

Finally, we note that while many of the above image classification tasks demonstrate their effectiveness on MNIST[19] and SVHN[22] datassets, it is important to emphasize that this task, while falling within the OCR category, is a simpler sequential OCR than the general task. MNIST and SVHN are both single digits in classification, and the variable length character series in the image we are concerned with must be recognized and classified in the correct order. Therefore, it is important to apply the domain adaptive techniques discussed above directly to sequential OCR tasks. For example, our system for performing domain adaptation contains additional recurrent neural networks (RNN) and attentional mechanism parts that are not present in any of the non-sequential OCR architectures discussed above.

Method 3.

We tried to design a system that could transcribe a language from real images where no real markup data exists. To do this, we approach the problem from two different perspectives simultaneously, focusing on the style and content of the data images by using two different data sets. Specifically, we use unsupervised domain adaptation to transfer knowledge of what is learned in the learning synthesis data (the language itself), while using a simple multi-task learning scheme to make the system robust to the style of real images.

We distinguish between three sets of images available during training. The first set of source images is the content data set,Is a composite image in a certain language,Represents the associated label, whereAre sequences of integers in the alphabetConcretely we we usually willReferred to asOr content source. Similarly, the second source image, and the labelPresentation style data set; Other languages use different alphabet for real images of images and labeled text.

Specifically, we useFrench, any other language, even glyphs, will work. We will beAs a“Source of style”, we also useFor domain adaptation,For multi-tasking training.

The third domain, the target domain T, contains only images, andImages from the same language, some real photos rather than composite, are used as wellThe alphabet. A key feature in this setup is the assumption of domains in T andandThe transfer is not too great.andThey have little in common to ensure that they do not overlap in content or style.

3.1 Basic Algorithm

We conduct experiments by extending the algorithm structure introduced in [43]. At a high level, the architecture consists of three components: a CNN as a feature extractor; An RNN, which circulates characters for processing extracted visual features; A spatial attention mechanism that directs RNN components to focus on salient features is introduced to RNN networks for discussion purposesIn the.

Figure 2: Architecture of Baseline, see [43], a feature extractorUsed to extract features, in this case for content, these features are input to an RNN decoderThe decoder includes a spatial attention component

We then use the first few layers of the Inception V3 CNN architecture as our visual feature extractor, this mapping is a complete convolution operation, and we take the output characteristics as.is, we express RNN and the attention part of spatial attention as, (see Figure 2 for the framework).

More precisely, for calculationIn the specific step T, we first need to calculate the spatial attention mask on the visual feature F,And then compute the context vector

(1)

Then enter it into RNN

(2)

Among themandRepresents the internal state and output of RNN at time t, andThe one-hot of the previous letter comes either from the actual situation during training or from the prediction during inference.

Finally, we calculate the distribution of the letters

(3)

And specify the

(4)

3.2 Style adaptation

In order to learn the real images “style”, we used a simple multitasking learning steps, training a simple network, synthesis and Hebrew and transcription can learn real French task, the final result is that system can by implicitly using real style overlap between French and Hebrew data, To achieve the goal of better transcribing authentic Hebrew images. In particular, we trained a single, come at the same time from authentic French street symbols“And composite Hebrew street signs, as shown on the left of Figure 3. Output characteristics

f

This is then entered into two different attentional RNN components,Generate two sets of outputAmong themisThe parameters. Then. We can train the two sets of data respectively according to their cross entropy:

(5)

In practice, we actually extend these losses into autoregressions, as described in [33], by passing real labels as history when performing training.

In order to learnFrench image labels, the system must learn to ignore the true style of French images, focus on the content; Authentic French image style withThe styles of the images overlap heavily, and we assume that the system also learns to ignore the realistic style of the target image, even though it can be derived from the composite imageLearning content

3.3 Content adaptation

Although the system described in Section 3.2 still learns Hebrew content from the synthesized data, it does not specifically enhance the source domainAnd the similarity between target domain T; In fact, it doesn’t use T in training at all. To solve this problem, we use unsupervised domain adaptive techniques to explicitly adapt the synthesized Hebrew data to the real data.

3.3.1 Gradient inversion

We try to improve our performance in the target domain by directly training our system on the robustness of the domain conversion between synthetic and real Hebrew data. Specifically, we want to reduce the difference between the source distribution and the target distribution. For this reason, Ben-David et al. [1] showed that h-Divergence of the target domain Y and the source domain S could be calculated as

(6)

Among themIt’s a set of dichotomies, and it allocates 1 to the sample in the source domain, and 0 to the sample in the target domainMiscategorization of experience in source and target domains. So we can make the distance between the two domainsMinimum, maximum classifiers that distinguish between two fields of classification errors

Ganin et al. [9] achieved this through a technique called gradient inversion (GR). Here, training is framed as a saddle point problem, and the system is divided into three parts. Feature F is extracted by a feature extractorExtract, which is then passed into a particular task’s classifier branchAnd there is also a domain discriminant branch.To classify attempts to all samples from the source or target domain, use the following loss function:

(7)

In essenceIs a classifier belonging to the above hypothetical class H.

So, given a loss function(cross entropy, for example), we can define an energy function

(8)

Where di is a field label equal to 1 if.λ is a hyperparameter, the purpose is to control the tradeoff between two losses,Is the saddle point to be minimized:

(9)

Gradient inversion presents a simple method to optimize saddle point problems by stochastic gradient descent. In order to do this, inandBetween, there is a special gradient inversion layer (GPL). To continue training, GRL does a mapping of the identity, multiplying its gradient by -1 as it passes back. Effective willIs replaced by –, the goal of reaching the saddle point can be achieved (8).

3.3.2 Adaptive decoder

A simple way to apply gDA to the architecture described in Section 3.1 is to apply gDA to the architecture described in Section 3.1As we deal with in Section 3.3.1Same: as a simple classifier, acting onExtracted features. In layman’s terms, we intuitively adjust visual features to make them more robust to accommodate changes between real and synthetic styles.

Figure 3: Network configuration for multi-tasking training on the left. The same feature extractor Gf is used from the content domainDomain and styleThen, these features will be input into two independent RNN decoders. On the right, we aggregate the values of RNN, and the use of gradient inversion technology in the domain classifier to distinguish S from the target domain T, and the use of domain adaptive technology in the RNN decoder, in addition to what the network learned through multi-task training, we have noAdjustments made.

However, we explored multiple architectures using this approach, and we experimentally found that the main benefit of domain adaptation was its ability to improve understanding of content, with little improvement in its ability to build a robust style. Under this assumption, domain adaptation is more meaningful in the RNN part of the network that deals with language structures.

Therefore, we introduce a method of directly adapting to the RNN component of the system, as shown in Figure 3. Specifically, we maintainMost of is unchanged, but for each RNN step t, we introduce a new value:

(10)

Is the internal state of RNN, introduced in equation (2). We found experimentally that aggregating RNN outputs using Max and min is critical, because average or attention-based aggregations using SoftMax will not produce a better system than baseline.

The domain discriminator is then used on the output, we calculate it as:

(11)

These are parameters that the network needs to learn.

We can defineIs equation (7), and then our final energy function is:

(12)

Such modifications are necessary because onceAdd, which can perform adaptation for parts of the network that are not directly enhanced by additional data. When combined with multitasking learning, our final energy function becomes:

,

(13)

At every step of the training, we optimize the loss of these three parts in each training batch. The complete architecture containing all components and unsupervised domain adaptation applied to the decoder is shown in Figure 3. During training, λ = 0.5, a value we have experimentally determined.

Experiment 4.

Our proposed setting is unique and highly specific, so to evaluate it properly, we introduced two new data sets containing both real and composite images of Hebrew street name signs. Combining existing FSNS (FA) street name datasets, we demonstrate the effectiveness of our domain adaptive techniques and simple multitasking learning methods. We then demonstrate that using both technologies together performs better than using a single technique, and provide a detailed empirical analysis of our results.

Table 1: The full sequence accuracy of the various systems discussed in this paper on the test data of each data set. Check markers indicate which data sets are available during training for each experiment. The most important precision result is the HSNS (Hebrew data set), the target data set for our system. We also reported SynHSNS and FSNS datasets, although optimizing the performance of these datasets was not the goal of our system. Nonetheless, it turns out that the fact that our system does not completely destroy the performance of these data sets is useful for building a more general system.

Figure 4: Simple images of HSNS (top), synHSNS (middle) and FSNS (bottom) datasets

Then, the metric for all the techniques we report is full sequence accuracy, and the sample is considered to be correctly classified only if every character in the sample is correctly predicted.

Unfortunately, in the absence of alternative methods for reliable hyperparametric optimization, we follow [3] and perform experiments directly on a small group of validation data. We know this is not optimal, so to speak, and any marker data available during any training should be used during training. Therefore, we hope that in the future, the research community will propose an alternative method to validate unsupervised domain adaptation schemes. For now, we leave the development of this metric to future work.

4.1 the data set

4.1.1 Hebrew street name signs

This is our target data set, we collected about 92,000 cropped images of Hebrew language road signs from Israel. We divide it into three parts, namely 89,936 training images, 899 verification images and 903 test images, among which only the verification and test images have labels. When dividing the data set, we kept a geographical distance of at least 100 meters between any training/validation location and the test image to ensure that the system was not exposed to any test markers during training or performing validation. All images are 150 by 150 resolution

Many Hebrew road signs have specific prefixes that translate into words like “street,” “road,” “avenue,” and so on. Often, these words are written in much smaller font than the rest of the logo, making them hard to read at 150 by 150 resolution. Since many Israeli map services do not include these prefixes, we also decided to exclude them from transcription.

We will publish these data as a Hebrew road sign Names (HSNS) dataset. A sample from this dataset can be seen in Figure 4. Although the images were all in full RGB color and will be published in full RGB color, in all subsequent tests we converted each image to grayscale in order to be consistent with our composite image, which we describe below.

4.1.2 Composite Hebrew name marks

We chose to use a relatively simple scheme to generate synthetic data. This decision was made partly because it would be difficult to generate more complex, natural-looking composite data, and partly because, based on observations, the composite data would only need to contain the same content as the target data, since we could use other methods of styling.

Thus, our composite image consists of a simple text rendering, a box behind the text, perspective transformations, and some slight blurring. When rendering the text, we randomly selected one of 19 different Hebrew fonts. In some cases, we randomly added English text or numbers lower or higher than Hebrew, and we did not include a transcription of the real situation on the ground. The size and position of the text, the parameters of the perspective transform, and the amount of blur are all chosen randomly. The actual text itself was chosen from a list of real Israeli street names. In order to better match the text distribution of HSNS, we also randomly added small font prefixes, which can be translated into Hebrew words like street, road, avenue and so on. We found that these prefixes were critical for performance because they were often included in real images but were often too small to read, and we included them in synthetic data signals that sent signals to the system that they did not need to be transcribed. We generate all the images at 150×150 resolution.

In order to further simplify the text generation process, all composite images are generated in gray scale. This greatly simplifies the generation process, making it easier to produce images in a realistic color range. The color of each image was chosen randomly, although we enhanced the minimum contrast between the text and the following boxes. We used a solid color as the background because initial tests with a more complex background, such as Gaussian noise, did not produce any performance difference.

We generated approximately 430,000 composite images for training and 10,000 for evaluation and testing (see Figure 4). We publish these data along with HSNS as the Synthetic Hebrew Street Name Symbols Data Set (SynHSNS).

4.1.3 French street name signals

In addition to the two Hebrew datasets mentioned above, we also used the existing French Street Name Symbols (FSNS) dataset [30] for multitasking learning. FSNS contains approximately 1 million Training samples of French street name signs, 20,000 evaluation samples, and 16,000 test samples, each containing 1-4 views of the same sign with a resolution of 150×150. To be consistent with HSNS and SynHSNS, we only use one of these views during training, and we take the view listed first. Similarly, we maintain consistency with synchronized images by converting each image to grayscale. A sample image of the original FSNS dataset is shown in Figure 4.

4.2 Implementation Details

With the exception of the fine-tuning experiment described in Section 4.3.2, all training was performed at a learning rate of 0.0047 using a random gradient descent with a momentum value of 0.75. For each field actually used in training, we trained 800,000 steps with a batch size of 15. When using domain adaptive components, we enable them starting at 20,000 steps and calculate the losses in formulas 12 and 13, λ = 0.5. The resolution of all input images is 150*150, consistent with the data resolution of the three datasets.

4.3 Field adaptation and joint training

4.3.1 Baselines

To demonstrate the effectiveness of our system, we need to demonstrate that our method performs better than the pure method. Therefore, we define HSNS Baseline as test performance on a system specifically designed for SynHSNS data training. The results of this experiment are reported as “baseline” in Table 1.

Table 1 also includes a system performance for reference, specifically trained in the FSNS version used in all experiments, listed as “FSNS Baseline”. As mentioned above, our use of FSNS differs from standard usage because we only use one of up to four possible views for each symbol, and we have removed all colors from the image. Thus, although the number of FSNS we report here is smaller than that reported in the system [43], it should be noted that the two experiments were not conducted on exactly the same data. We would also like to emphasize that our goal is not to optimize performance on FSNS, but on HSNS, so these numbers are for reference only.

4.3.2 Multitasking Learning Baselines

We report the results of the multi-tasking learning scheme described in Section 3.2, where we train both SynHSNS and FSNS datasets.

We report this as “Multi-tasking Training (MT)” in Table 1. As with the baseline above, HSNS data was not seen at the time of training, but we still achieved 36.54% accuracy on the HSNS test set. Thus, by learning to parse real French images alone, the model improved by 18 points in parsing real Hebrew images, supporting our hypothesis that the system could better understand the real style of Hebrew data just by seeing real French data.

In addition to the joint training program described above, we also evaluate our approach on a simple fine-tuning program, shown in Table 1 as “fine-tuning.” In this scenario, we first train 800,000 steps across the system using FSNS data sets. And then we usereplace, perform an additional 66,000 training steps on the network at a reduced learning rate of 0.002 (the additional training steps did not improve HSNS performance). Table 1 reports the performance results for both approaches. We see that multitasking learning is superior to fine-tuning, probably because the extra training phase reduces some of the benefits that French data gained in the first phase.

4.3.3 Domain adaptation

To evaluate the validity of gradient inversion, we performed two more experiments, both based on the RNN-centric domain adaptive described in Section 3.3.2.

Figure 5: An example of visually indistinguishable Hebrew letters.

Tagged in table 1 is the first experiment “domain adaptation” in RNN networksUsing only HSNS and SynHSNS as inputs to explicitly optimize the losses in Formula 12,The algorithm structure is shown in Figure 3 (right), with the FSNS input removed.

Our second experiment, labeled “DA+MT,” uses all three data sets as inputs and is a test of the entire system, as shown in Figure 3 (right). This experiment stands out because it is the only one to make use of all three available data sets.

From these experiments we can see that using domain adaptation alone between HSNS and SynHSNS is enough to improve performance from 18.49% to 38.64%. More interestingly, combining this with multitasking improved performance by 50.16%. In particular, the marginal increase from DA to DA+MT(around 11%) is not trivial. Similarly, the increase from MT to DA+MT(around 14%) is quite significant.

We believe that this supports our hypothesis that domain adaptation targets content, while multitasking learning targets style, because it suggests that the improvements provided by each technique are largely irrelevant, that is, domain adaptation and multitasking learning help for different reasons. If these technologies are not complementary and DA and “MT” improve performance by processing the same characteristics of the target, then we might see a small marginal improvement when we use them together, as this would indicate that there is a lot of overlap between the two technologies.

4.3.4 Error analysis

The Hebrew alphabet is a challenging set of characters — it has multiple characters that are difficult for both humans (untrained or non-Hebrew speakers) and computers to distinguish, as shown in Figure 5. There are several other characters, but these make up 22.7%(1596/7013) of the validation set of all printable characters. Interestingly, all models were configured to obfuscate these characters, and the accuracy of these models did not vary much for these obfuscated characters (e.g. MT models misted VAV for YOD 40/894 times and MT+DA 41/894 times).

Another interesting observation is that the network learns how to represent the characteristics of whitespace characters, specifically null characters (termination sequences) and space characters. Table 6 shows the t-SNE dimension reduction icon for embedding characters. We observed that clusters developed with NULL and SPACE characters became more isolated from other clusters as network performance improved. We also saw this confusion when we looked at the performance numbers: “MT” classified Spaces as NULL 88/620 times, while “MT+DA” only made this error 45/620 times. We believe this phenomenon can be explained by looking at the region around the character.

In our opinion, the main difference in visual appearance between a composite image and a natural image is the style of areas without characters. In a tight clipping, there won’t be much difference between the real image and the composite image, but our model operates in a large environment where the area around the text might be too distracting for the model to easily ignore. Areas without characters are treated as NULL and Spaces.

Figure 6: Visualization of individual feature prediction in a network using multi-task only learning (left) and multi-task and DA(right). A number is a collection of individual characters in the Hebrew alphabet. The red dot at the top corresponds to the space character, while the red dot at the bottom corresponds to the NULL(end of sequence) character.

Conclusion 5.

In this paper, we explore different ways to implement a system to perform sequential OCR on photos of street name signs in unlabelled languages. To this end, we introduced two new datasets: the synthetic Hebrew street sign SynHSNS dataset and the real HSNS dataset with unlabeled Hebrew street name signs. Finally, we demonstrate that our approach, which leverages existing data in other languages and composite data easily generated in the same language, can greatly improve the performance of the target domain by transferring information about style and content.

reference

[1] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine Learning, 2010. 4

[2] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Pages 95 — 104, July 2017

[3] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, Pages 343 — 351. Curran Associates, Inc., 2016. 2, 6

[4] Y. Chen,W. Li, C. Sakaridis, D. Dai, and L. V. Gool. Domain adaptive faster r-cnn for object detection in the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 2

[5] C. Chu and R. Wang. A survey of domain adaptation for neural machine translation. In Proceedings of the

27th International Conference on Computational Linguistics, Pages 1304 — 1319. Association for Computational Linguistics, 2 2018.

[6] H. Daume III. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Journal of Computational Linguistics, Pages 256 — 263. Association for Computational Linguistics, 200.2

[7] M. Fadaee, A. Bisazza, and C. Monz. Data augmentation for low-resource neural machine translation. In ACL, 2017. 2

[8] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32Nd International Conference on Machine Learning – Volume 37, ICML ’15, Pages 1180 — 1189. JMLR.org, 2015. 2

[9] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, And V. Lempitsky. Domainadversarial training of Neural networks. J. Mach. Learning. Res., 17(1):2096 — 2030, Jan. 2016

[10] J. . Gauvain and C.-H. Lee. Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Transactions on Speech and Audio Processing,

Two (2) : 291-298, April 1994. 2

[11] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML ’11, Pages 513-520, USA, 2011. Omnipress. 2

[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D.Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, Pages 2672 — 2680. Curran Associates, Inc., 2014

[13] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, And B. Sch¨olkopf. Covariate Shift and local learning by distribution matching, Pages 131 — 160. MIT Press, Cambridge, CA, MA, USA, 2009. 2

[14] P. Haeusser, T. Frerix, A. Mordvintsev, and D. Cremers. Associative domain adaptation. In 2017 IEEE International Conference on Computer Vision (ICCV), Pages 2784 — 2792, Oct 2017.2

[15] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell. CyCADA: Cycle-consistent adversarial domain adaptation. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine

Learning Research, Pages 1989 — 1998, Stockholmsm¨ Assan, Stockholm Sweden, 10 — 15 Jul 2018. Pmlr.2

[16] J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel-level Adversarial and constraint-based adaptation. CoRR, ABS /1612.02649, 2016. 2

[17] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa. Crossdomain weakly-supervised object detection through progressive domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 2

[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, Pages 1097 — 1105. Curran Associates, Inc., 2012. 1, 2

[19] Y. Lecun, L. Bottou, Y. Bengio, And P. Haffner. Gradient-based learning Applied to Document Recognition. In Proceedings of the IEEE, Pages 2278 — 2324, 2 1998.

[20] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning – Volume 37, ICML ’15, Pages 97 — 105. JMLR.org, 2015. 2

[21] A. Mohammadian, H. Aghaeinia, F. Towhidkhah, and S. Seyyedsalehi. Subject adaptation using selective style transfer mapping for detection of facial action units. Expert Systems with Applications, 56, 03 2016. 2

[22] Y. Netzer, T.Wang, A. Coates, A. Bissacco, B.Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011. 2

[23] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tri-training for unsupervised domain adaptation. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, Volume 70 of Proceedings of Machine Learning Research, Pages 2988 — 2997, International Convention Centre, Sydney, Australia, 06 — 11 Aug 2017. Pmlr.2

[24] K. Saito, K.Watanabe, Y. Ushiku, and T. Harada. Maximum Classifier Discrepancy for Unsupervised Domain Adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 2

[25] K. Saito, S. Yamamoto, Y. Ushiku, and T. Harada. Open set domain adaptation by backpropagation. In The European Conference on Computer Vision (ECCV), September 2018. 2

[26] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generate to adapt: Aligning domains using generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 2

[27] S. Sankaranarayanan, Y. Balaji, A. Jain, S. Lim, And R. Chellappa. Unsupervised domain adaptation for semantic segmentation with Gans. CoRR, ABS /1711.06969, 2017

[28] P. Sarkar and G. Nagy. Style-consistency in isogenous patterns. In Proceedings of Sixth International Conference on Document Analysis and Recognition, Pages 1169 — 1174, Sept 2001.2

[29] R. Shu, H. Bui, H. Narui, and S. Ermon. A DIRT-t approach to unsupervised domain adaptation. In International Conference on Learning Representations (ICLR), 2018. 2

[30] R. Smith, C. Gu, D.-S. Lee, H. Hu, R. Unnikrishnan, J. Ibarz, S. Arnoud, and S. Lin. End-to-end interpretation of the french street name signs dataset. In ECCV Workshops, 2016. 2, 7

[31] B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI ’16, Pages 2058 — 2065. AAAI Press, 2016. 2

[32] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In G. Hua and H. J´egou, Editors, Computer Vision — ECCV 2016 Workshops, Pages 443 — 450, Cham, 2016. Springer International Publishing. 2

[33] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. 4

[34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Pages 2818 — 2826, 2016. 2, 3

[35] C. Thomas and A. Kovashka. Artistic object recognition by unsupervised style adaptation. In C. V. Jawahar, H. Li, G. Mori, and K. Schindler, Editors, Computer Vision — ACCV 2018, Pages 460-476, Cham, 2019. Springer International Publishing. 2

[36] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Pages 2962 — 2971, 2017. 2

[37] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing domain invariance. CoRR, ABS /1412.3474, 2014. 2

[38] L. van der Maaten and G. E. Hinton. Visualizing data using t-sne. 2008. 8

[39] S. Veeramachaneni and G. Nagy. Adaptive classifiers for multisource ocr. Document Analysis and Recognition, (3) : 154-166, Mar 2003. 2

[40] R. Volpi, P. Morerio, S. Savarese, and V. Murino. Adversarial feature augmentation for unsupervised domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 2

[41] Z. Wang, J. Yang, H. Jin, E. Shechtman, A. Agarwala, J. Brandt, And T. S. Huang. Real-time World font recognition using Deep Network and Domain adaptation. CoRR, ABS /1504.00028, 2015

[42] Z. Wang, J. Yang, H. Jin, E. Shechtman, J. B. Aseem Agarwala, and T. S. Huang. Decomposition-based domain adaptation for real-world font recognition. 2

[43] Z.Wojna, A. N. Gorban, D.-S. Lee, K. Murphy, Q. Yu, Y. Li, and J. Ibarz. Attention-based extraction of structured information from street view imagery. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017. 1, 2, 3, 6, 7

[44] Z. Yang, Z. Hu, C. Dyer, E. P. Xing, and T. Berg-Kirkpatrick. Unsupervised text style transfer using language models as discriminators. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, Pages 7287 — 7298. Curran Associates, Inc., 2018

[45] X.-Y. Zhang and C.-L. Liu. Writer adaptation with style transfer mapping. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35:1773 — 1787, 2013. 2

[46] Y. Zhang, W. Cai, and Y. Zhang. Separating style and content for generalized style transfer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 2

[47] Y. Zhang, P. David, and B. Gong. Curriculum domain adaptation for semantic segmentation of urban scenes. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. 2

[48] B. Zoph, D. Yuret, J. May, And K. Knight. Transfer Learning for Low-resource Neural machine Translation. Pages 1568 — 1575, 01 2016