This article is reproduced with the authorization of AI new media qubit (public ID:qbitai), please contact the source for reprint.
How much trouble can it be to write a web page? In most companies, there are three steps:
After the product manager completes the user research task, lists a series of technical requirements;
Designers design low-fidelity prototypes according to these requirements, and gradually modify to get high-fidelity prototypes and UI design drawings;
Engineers turn these drawings into code that eventually becomes a product for users to use.
With so many links, anything that goes wrong lengthens the development cycle. As a result, companies like Airbnb have started using machine learning to make the process more efficient.
Airbnb’s internal AI tools go from drawing to code in one step
It looks nice, but Airbnb has yet to reveal the details of the end-to-end training in the model and how hand-designed image features contribute to it. This is a proprietary proprietary closed-source solution that may not be made public.
Fortunately, a programmer named Ashwin Kumar has created an open source version to make the developer/designer’s job much easier.
Here’s a translation from his blog:
Ideally, this model can quickly generate a usable HTML site from a simple hand-drawn prototype of the site design:
The SketchCode model uses hand-drawn wireframes to generate HTML web sites
In fact, the above example is the use of the trained model on the test set generated a real web site, code, please visit: https://github.com/ashnkumar/sketch-code.
Take inspiration from image annotations
The problem at hand is part of a broader task called Program Synthesis, which automatically generates working source code. While many program synthesis studies use natural language specifications or execution tracing to generate code, for the current task I will make full use of the source image, the hand-drawn wireframes given.
One of the hottest areas of machine learning is called Image caption, which aims to build models that link images to text, particularly to generate descriptions of the content of the source image.
△ Image annotation model generates text description of source image
Inspired by a Pix2Code paper and a related project that applied this approach, I decided to implement my task as an image tagging approach, using a wireframe drawing of a website as an input image and the corresponding HTML code as an output.
Note: The two reference items mentioned in the preceding paragraph are
Pix2code paper: https://arxiv.org/abs/1705.07962floydhub tutorial: https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/?source=techstories.org
Get the appropriate data set
Once the image annotation method is determined, the ideal training data set to be used contains thousands of pairs of hand-drawn wireframes and corresponding HTML output code. However, there is no relevant data set that I want, so I have to create the data set for this task.
For a start, I tried the open source dataset presented in the Pix2Code paper, which consists of 1750 screenshots of a synthetically generated website and their corresponding source code.
△ Pix2Code data set to generate site images and source code
This is a nice data set with several interesting points:
Each generated site in this dataset contains a few simple helper elements, such as buttons, text boxes, and DIV objects. Although this means that the model is limited to these few elements as its output, these elements can be modified and extended by choosing to generate a network. This approach should be easily generalized to larger element vocabularies.
The source code for each sample consists of a token for a domain-specific language (DSL), which the authors of this paper created for this task. Each token corresponds to a fragment of HTML and CSS, and a compiler is added to convert the DSL into running HTML code.
Color site images into a hand-drawn map
To modify my task data set, I want the site images to look like they were drawn by hand. I tried to modify each image using tools such as the OpenCV library and PIL library in Python, including grayscale conversion and contour detection.
Eventually, I decided to modify the CSS stylesheet of the original site directly by doing the following:
Change the border radius of elements on the page to smooth the edges of buttons and DIV objects.
Imitate the sketch to adjust the border thickness, and add shadows;
Change the original font to a handwriting like font;
An additional step was added to the resulting flow to simulate the changes in actual sketching by adding tilts, moves, and rotations for image enhancement.
Use images to annotate model architectures
Now that I’ve processed the data set, it’s time to build the model.
I took advantage of the model architecture used in image annotations, which consists of three main parts:
A computer vision model using convolutional neural network (CNN) to extract image features from source images;
A language model containing a gated unit GRU encoding the source code token sequence;
A decoder model, also part of the GRU unit, takes the output of the first two steps as input and predicts the next token in the sequence.
Train the model with token sequences as input
To train the model, I split the source code into token sequences. The input to the model is a single part sequence and its source image, whose label is the next token in the text. The model uses the cross entropy function as the loss function to compare the model’s next predicted token with the actual next token.
This reasoning is slightly different as the model generates code from scratch. The image is still processed through the CNN network, but the text processing starts with only a start sequence. At each step, the next prediction token that the model outputs in the sequence is added to the current input sequence and sent to the model as a new input sequence; This is repeated until the model’s prediction token is, or the process reaches a predefined value of the number of tokens in each text.
When the model generates a set of predictive tokens, the compiler converts the DSL token into HTML code that can be run in any browser.
BLEU scores were used to evaluate the model
I decided to use BLEU scores to evaluate the model. This is a metric commonly used in machine translation tasks that measures how closely machine-generated text is similar to what a human might produce given the same input.
In fact, BLEU creates an accurate version of the modified text by comparing n-element sequences of generated and reference text. It’s perfect for this project because it affects the actual elements in the GENERATED HTML code and how they relate to each other.
Best of all, I can also compare the current actual BLEU score by examining the generated site.
△ Observe the BLEU score
When the BLEU score is 1.0, it indicates that the model can set the appropriate elements in the correct position after a given source image, while a low BLEU score indicates that the model predicts the wrong elements or places them in a relatively inappropriate position. The BLEU score of our final model on the evaluation dataset was 0.76.
Bonus: Custom web style
Later, IT occurred to me that since the model only generates the frame of the current page, the token of the text, I could add a custom CSS layer during compilation and instantly get a different style of generated site.
A single sketch generates multiple styles of web pages
Separating style customization from model generation brings many benefits when using models:
If you want to apply the SketchCode model to your company’s products, front end engineers can use the model directly by changing a CSS file to match the company’s web design style.
The model’s built-in extensibility means that from a single source image, the model can quickly compile many different predefined styles, so users can imagine many possible web site styles and browse these generated pages in a browser.
Summary and Outlook
Inspired by the study of image tagging, the SketchCode model is capable of converting hand-drawn web wireframes into usable HTML web sites in seconds.
However, there are still some problems with this model, and this is where I might work next:
Since this model is trained with only 16 elements, it cannot predict tokens beyond these data. Direction of the next step might be to use more diverse samples have generated more web site, including web site pictures, drop-down menus and form, refer to the launcher component (https://getbootstrap.com/docs/4.0/components/buttons/) to get the idea;
In the actual site building, there are many variations. Creating a training set that better reflects this change is a good way to improve generation, by taking more HTML/CSS code for your site and screenshots of the content;
Hand-drawn drawings also have many changes that CSS modification techniques cannot capture. A good way to solve this problem is to use generative adversarial network gans to create more realistic drawing site images.
The relevant address
Code: https://github.com/ashnkumar/sketch-code
Original: https://blog.insightdatascience.com/automated-front-end-development-using-deep-learning-3169dd086e82
End