UIzard Technologies, a copenhagen-based startup, has successfully shared part of the web design process for developers by training a neural network to translate screenshots of graphical user interfaces into lines of code. Amazingly, the same model can work across platforms, including iOS, Android and Web interfaces, and the algorithm has a 77% accuracy rate based on current levels of development.

A research paper published by the company explains how the model, called Pix2Code, works. Here’s the takeaway: As with all machine learning, researchers need to train models with task instances at hand. Unlike other tasks, however, instead of generating images from images or converting text to text, this algorithm inputs images and generates corresponding text (in this case code) output. To achieve this, researchers are trained in three steps. First, they use computer vision to understand GUI images and their elements (buttons, bars, etc.). Next, the model needs to understand computer code and be able to generate syntactically and semantically correct samples. The final challenge is to tie the previous two steps together, requiring it to generate descriptive text using a hypothetical scenario.

UI or graphic designers who only have a basic knowledge of code can build their own websites with its help. On the other hand, it also makes it easier to copy code from other sites, a problem that has plagued many developers. While the spirit of collaborative sharing is already prevalent among programmers on sites like Github, some devs — especially those who develop sites for clients who need the original site — don’t want others to steal their code. In practice, Pix2Cod will certainly save developers time, allowing them to input JPEG images of the designed interface into Pix2Code, generate working code, and further tweak and optimize. UI or graphic designers with only basic code knowledge can build entire websites themselves.

UIzard Technologies continues to refine the model, training it with more data to improve accuracy. Founder and CEO Tony Beltramelli recently completed his machine learning graduate program at the IT University of Copenhagen and ETH Zurich, There was also consideration of contributing Pix2Code to the school. “The Internet could theoretically support an unlimited amount of training data, given the already vast number of websites accessible online and the fact that new websites are being developed every day,” he wrote in the research paper. “We conclude that deep learning used in this way will eventually end the need for mobile programming GUIs [graphical user interfaces].”

Pix2Code is UIzard’s first app and is still in beta. The company’s vision is to save developers, designers and startups from writing code in the early stages of development, allowing more time to prototype, iterate, and ultimately produce better products, and ultimately develop better apps and websites.

  • The paper address: https://arxiv.org/pdf/1705.07962.pdf

  • Making project address: https://github.com/tonybeltramelli/pix2code

  • Address for trial application: https://uizard.io/? email_field=mmill06%40gmail.com

Machine Heart gives an overview of this paper, which is as follows:

Abstract: Computer developers often take screenshots of graphic user interfaces (GUI) designed by designers and apply them to software, websites and mobile applications by compiling computer code. In this article, we show that deep learning techniques can be used to automatically generate code given a graphical user interface image as input. Our model was able to generate code from a single input image for three different platforms (i.e., iOS, Android, and Web-based technologies) with over 77% accuracy.

The introduction

It is the developer’s responsibility to implement the process on the client side based on a graphical user interface (GUI) designed by the designer. However, writing the code to implement the GUI is time consuming and takes up a lot of developer time to implement the actual features and logic of the software. In addition, the computer languages used to implement such guIs vary from platform to platform; This leads to tedious and repetitive work in developing software for multiple platforms (albeit using native technology). In this article, we describe a system that can automatically generate platform-specific code given a graphical user interface screenshot as input. We infer that an extended version of this approach might end the need to program the GUI manually.

The first contribution of this article is Pix2Code, a new approach based on convolutional and recurrent neural networks that can generate computer code from a single GUI screenshot.

The second contribution of this article is to publish a composite dataset of GUI screenshots and associated source code from three different platforms. After publication of this paper, this dataset will be open source for free use to facilitate future research.

Related Work (omitted)

pix2code

The task of generating code for GUI screenshots can be analogous to the task of generating text descriptions for scene photos. Therefore, we can divide the problem into three sub-problems. First, it’s a computer vision problem: understanding a given scene (in this case, taking a screenshot of a GUI) and inferring the objects, identities, positions, and gestures in the diagram (buttons, labels, element containers). The second is a language modeling problem: understanding text (in this case, computer code) and producing syntactically and semantically correct samples. Finally, code is generated by using the solutions of the first two subproblems, that is, latent variables inferred from the scene understanding to generate the corresponding text description (in this case computer code rather than text).

Figure 1: Architectural overview of the Pix2code model. During the training, GUI screenshots were coded by cnN-based computer vision models. The sequence of one-hot encode symbols corresponding to DSL code (domain-specific language) is encoded by a language model containing the LSTM layer stack. The two resulting encoding vectors are then cascaded and fed into a second LSTM layer stack used as a decoder. Finally, the softmax layer is used to sample symbols individually. The output size of the SoftMax layer corresponds to the vocabulary size of the DSL. Given a sequence of images and symbols, the model (that is, the contents of the gray box) is differentiable, and thus can be optimized end-to-end by gradient descent in predicting the next symbol in the sequence. At each prediction, the input state (that is, the symbol sequence) is updated to include the symbol of the last prediction. At sampling time, the generated DSL code is compiled into the desired target language using traditional compiler design techniques.

3.1 Visual model

CNN is currently the preferred method to solve various visual problems, because their own topology is convenient for learning and training the rich potential representations in images [14,10]. We map the input image to a learned fixed-length vector and use CNN for unsupervised feature learning. Thus acting as an encoder as shown in Figure 1.

At initialization, the size of the input image is redefined to 256×256 pixels (without retaining the aspect ratio), and the pixel values are normalized before being fed to CNN. No further pretreatment is carried out. In order to encode each input image as an output vector of fixed size, we specifically use a small 3×3 receptive field (receptive field) with a convolution step (stride) of 1, which is similar to the method Simonyan and Zisserman used in VGGNet [15]. These operations are performed twice, and then down sample is performed using maximum pooling. The first convolution layer has a width of 32, followed by a width of 64, and finally a width of 128. Finally, two fully connected layers with a size of 1024 were used to provide corrective Linear unit activation to complete the visual model modeling.

Figure 2: An example of writing a native iOS GUI in a DSL.

3.2 Language Model

We designed a simple DSL to describe the GUI described in Figure 2. In this work, we were interested only in the layout of the GUI, and only in those different graphical controls and their relationships to each other, so our DSL actually ignored the text values of the label controls. In addition, to reduce the size of the search space, the simplified DSL reduces the size of the vocabulary (that is, the total number of symbols supported by the DSL). Thus, our language model can perform character-level language modeling using discrete input from the one-hot encoded vector, which reduces the need for word embedding techniques such as Word2vec [12] and thus greatly reduces the computational cost.

In most programming and markup languages, elements are usually declared as open symbols. But if the child elements or instructions are contained in a block, the interpreter or compiler usually requires a closing token. If the number of children contained in a parent element is a variable, it is important to build long-term dependencies to close an open block. However, when traditional recurrent neural network (RNN) fits data, there will be gradient disappearance or gradient explosion, so we choose long and short-term memory (LSTM) which can solve this problem well. Different LSTM gated outputs can be calculated by the following equations:

Where W is the weight matrix, Xt is the new input vector at time T, HT −1 is the previously generated output vector, CT −1 is the previously generated unit state output, b is the bias term, φ and σ are SigmoID and Hyperbolic tangent, respectively.

3.3 Composite Model

In our model, supervised learning is adopted, which uses image I and symbol T sequence X (XT, T ∈ {0.. T − 1}) as input, and symbol Xt as target annotation. As shown in Figure 1, cnN-based visual model encodes input image I as vector representation P, while LSTM-based language model encodes input symbol Xt as intermediate representation QT, allowing the model to pay more attention to specific symbols and less attention to other symbols [7].

The first language model was implemented with two LSTM layers, each with a stack of 128 units. The visual coding vector P and the language coding vector QT can be cascaded into a single vector RT, which can then be projected to an LSTM-based model to decode representations learned from both the visual and language models. The decoder therefore learns the relationship between the objects in the input GUI image and the symbols in the DSL code, and can therefore model this relationship. Our decoder is implemented with two LSTM layers, each with a stack of 512 units. The entire architecture can be mathematically expressed as:

This architecture allows the entire Pix2Code model to be optimized end-to-end by gradient descent so that the system can predict the next symbol when it sees the previous symbol in the image and sequence.

3.4 training

The sequence length T for training is important for modeling long-term dependencies. After empirical testing, the input file of DSL code for training is segmented by a sliding window of size 48, that is, we expand the recurrent neural network in 48 steps.

The back propagation algorithm is performed by taking partial derivative of the weight of the neural network by loss function. Therefore, multi-level logarithmic loss can be minimized for training:

Xt +1 indicates an expected token, and YT indicates a prediction token.

3.5 sampling

To generate the DSL code, we send the GUI image and sequence X of the T = 48 symbol, where the symbol Xt.. xt −1 is set to the empty symbol and the sequence Xt is set to the special <START> symbol.

Table 1: Data set statistics

test

Figure 3: Training losses in different training sets and sampling ROC curves of the model in training 10 Epochs.

Table 2: Reports of test results on the test suite (described in Table 1).

Figures 4, 5, and 6 show the input GUI images (sample truth values) and the GUI generated from the trained Pix2code model.

Figure 4: Test sample from the iOS GUI dataset.

Figure 5: Test sample from the Android GUI dataset.

Figure 6: A test sample of a web GUI dataset.

conclusion

In this paper, we propose pix2Code model, which is a new method to generate computer code given a GUI image as input. Although our work demonstrates the potential of such a system to automatically generate GUI code, our work only scratches the surface of this potential. Our model consists of relatively few parameters and can only be trained on relatively small data sets. Building more complex models and training on larger data sets significantly improves the quality of code generation. Moreover, the quality of generated code can be further improved by adopting various regularization methods and implementing the attention mechanism [1]. At the same time, the one-hot encoding adopted in this model does not provide any information about the relationship between symbols, while the word embedding model like Word2vec [12] may be improved, and the one-hot encoding also limits the number of symbols that can be supported. Finally, since generative adversarial networks (gans) have been extremely good at image generation recently, perhaps we can use Gans and their ideas to generate computer code from GUI images.