How to use front-end page prototype to generate corresponding code has always been our concern. In this paper, the author constructed a powerful front-end code generation model based on papers such as Pix2Code, and explained in detail how to use LSTM and CNN to write the design prototype into HTML and CSS website.
Project link: github.com/emilwallner…
Deep learning will change front-end development in the next three years. It will speed up prototyping and lower the barriers to developing software.
Tony Beltramelli published a paper last year called pix2Code: Generating Code from a Graphical User Interface Screenshot An also released Sketch2code (https://airbnb.design/sketching-interfaces/).
Currently, the biggest impediment to automated front-end development is computing power. But we can already use current deep learning algorithms, as well as synthesizing training data, to explore ways for AI to automatically build a front end. In this article, the author will teach neural networks how to write an HTML and CSS website based on an image and a design template. Here is a brief overview of the process:
1) Input a design diagram to the trained neural network
2) Neural networks convert images into HTML markup language
3) Render output
We’ll build three different models in three steps, from easy to hard. First, we’ll build the simplest version to master the moving parts. The second version of HTML focuses on automating all the steps and briefly explains the neural network layer. In the final version, Bootstrap, we will create a model to think about and explore the LSTM layer.
Code Address:
- https://github.com/emilwallner/Screenshot-to-code-in-Keras
- https://www.floydhub.com/emilwallner/projects/picturetocode
All FloydHub notebooks are in the FloydHub directory, and the local notebook is in the local directory.
The model construction in this paper is based on Beltramelli’s paper pix2Code: Generating Code from a Graphical User Interface Screenshot by Jason Brownlee and using Python and Keras
The core logic
Our goal was to build a neural network that could generate HTML/CSS markup language corresponding to screenshots.
To train a neural network, you provide a few screenshots and the corresponding HTML code. The web learns by predicting all matches one by one in HTML markup language. When anticipating the next markup language tag, the network receives the screenshot and all previous correct tags.
Here is a simple example: training data docs.google.com/spreadsheet… .
Creating a word-by-word prediction model is now the most common approach, and it is the approach used in this tutorial.
Note: The neural network receives the same screenshots each time the prediction is made. So if the network needs to predict 20 words, it will get 20 screenshots of the same design. Now, forget about how the neural network works, just focus on the input and output of the neural network.
Let’s start with the markup. Suppose we train a neural network to predict the sentence “I can code”. When the network receives “I”, predict “can”. The next time, the network receives “I can” and predicts “code”. It accepts all previous words, but only predicts the next word.
Neural networks create features from the data. Neural networks build features to connect input and output data. It must create representations to understand the content of each screenshot and the HTML syntax it needs to predict, all building knowledge for predicting the next tag. Applying a trained model to the real world is similar to training a model.
Instead of entering the correct HTML tag, the network receives the tag it currently generates and predicts the next tag. The prediction starts at the start tag and ends at the end tag, or when the maximum limit is reached.
Hello World edition
Now let’s build the Hello World version implementation. We will send you a card with the words “Hello World! And train it to generate the corresponding markup language.
First, the neural network translates the prototype design into a set of pixel values. And each pixel has three RGB channels, and the value of each channel is between 0 and 255.
To represent these markers in a way that a neural network can understand, I use one-hot coding. So the sentence “I can code” can be mapped to the following form.
In the figure above, our code contains start and end tags. These labels provide the neural network with location information for starting and ending prediction. Here are the various combinations of these tags and their one-HOT encodings.
We make each word change position with each round of training, so this allows the model to learn the sequence instead of remembering the position of the word. In the figure below, there are four predictions, one for each row. And the left side represents the RGB tri-color channel and the preceding word, and the right side represents the predicted result and the red closing tag.
#Length of longest sentence max_caption_len = 3 #Size of vocabulary vocab_size = 3 # Load one screenshot for each word and turn them into digits images = [] for i in range(2): images.append(img_to_array(load_img('screenshot.jpg', target_size=(224, 224)))) images = np.array(images, dtype=float) # Preprocess input for the VGG16 model images = preprocess_input(images) #Turn start tokens into one-hot encoding html_input = np.array( [[[0., 0., 0.], #start [0., 0., 0.], [1., 0., 0.]], [[0., 0., 0.], #start <HTML>Hello World! </HTML> [1., 0., 0.], [0., 1., 0.]]]) #Turn next word into one-hot encoding next_words = np.array( [[0., 1., 0.], # <HTML>Hello World! </HTML> [0., 0., 1.]]) # end # Load the VGG16 model trained on imagenet and output the classification feature VGG = VGG16(weights='imagenet', include_top=True) # Extract the features from the image features = VGG.predict(images) #Load the feature to the network, apply a dense layer, and repeat the vector vgg_feature = Input(shape=(1000,)) vgg_feature_dense = Dense(5)(vgg_feature) vgg_feature_repeat = RepeatVector(max_caption_len)(vgg_feature_dense) # Extract information from the input seqence language_input = Input(shape=(vocab_size, vocab_size)) language_model = LSTM(5, return_sequences=True)(language_input) # Concatenate the information from the image and the input decoder = concatenate([vgg_feature_repeat, language_model]) # Extract information from the concatenated output decoder = LSTM(5, return_sequences=False)(decoder) # Predict which word comes next decoder_output = Dense(vocab_size, activation='softmax')(decoder) # Compile and run the neural network model = Model(inputs=[vgg_feature, language_input], outputs=decoder_output) model.compile(loss='categorical_crossentropy', optimizer='rmsprop') # Train the neural network model.fit([features, html_input], next_words, batch_size=2, shuffle=False, epochs=1000)Copy the code
In the Hello World version, we use three symbols “start”, “Hello World” and “end”. Character-level models require smaller vocabularies and limited neural networks, where word-level symbols may have better performance.
Here is the code that performs the prediction:
# Create an empty sentence and insert the start token sentence = np.zeros((1, 3, 3)) # [0,0,0], [0,0,0]] start_token = [1. 0.] # start sentence[0][2] = start_token # place start in empty sentence # Making the first prediction with the start token second_word = model.predict([np.array([features[1]]), sentence]) # Put the second word in the sentence and make the final prediction sentence[0][1] = start_token sentence[0][2] = np.round(second_word) third_word = model.predict([np.array([features[1]]), sentence]) # Place the start token and our two predictions in the sentence sentence[0][0] = start_token sentence[0][1] = np.round(second_word) sentence[0][2] = np.round(third_word) # Transform our one-hot predictions into the final tokens vocabulary = ["start", "<HTML><center><H1>Hello World!</H1></center></HTML>", "end"] for i in sentence[0]: print(vocabulary[np.argmax(i)], end=' ')Copy the code
The output
- 10 epochs: start start start
- 100 epochs: start
Hello World!
Hello World!
- 300 epochs: start
Hello World!
The pits I walked through:
- Build the first version before collecting data. In the early stages of the project, I managed to get hold of an archive of older versions of the Geocities hosting site, which has 38 million sites. But I overlooked the enormous amount of work required to reduce 100K vocabulary.
- Training a terabyte of data requires good hardware or a lot of patience. After a few problems with my Mac, I ended up with a powerful remote server. I expect to rent 8 modern cpus and 1 GPS internal link to run my workflow.
- Until you understand the input and output data, everything else is vaguely understood. Input X is a screenshot of the screen and the label marked previously, and output Y is the label marked next. When I understood that, everything else became easier to figure out. In addition, it will be easier to experiment with different architectures.
- The image-to-code network is essentially a model that automatically describes the image. Even though I was aware of this, I still missed a lot of papers on automatic image summarization because they didn’t seem cool enough. Once I realized this, my understanding of the problem space became much deeper.
Run the code on FloydHub
FloydHub is a deep learning training platform. I have known about it since I started learning deep learning, and I often use it to train and manage deep learning experiments. We were able to install it and run the first model in 10 minutes, and it was the best choice for training the model on a cloud GPU. FloydHub can take about 10 minutes to install and learn if you haven’t already.
The FloydHub address is www.floydhub.com/
Copy the Repo:
https://github.com/emilwallner/Screenshot-to-code-in-Keras.gitCopy the code
Log in and initialize the FloydHub command-line tool:
cd Screenshot-to-code-in-Keras
floyd login
floyd init s2cCopy the code
Running Jupyter Notebook on a FloydHub cloud GPU machine:
Floyd run - gpu env tensorflow - 1.4 - data emilwallner/datasets/imagetocode / 2: data - mode jupyterCopy the code
All notebook files are in the floydbub directory. Once we start running model, then the floydhub/Helloworld/Helloworld ipynb can be found under the first Notebook. See earlier flags for more details on this project.
The HTML version
In this release, we will focus on creating an extensible neural network model. This version does not predict HTML directly from random web pages, but it is an indispensable step in exploring dynamic problems.
An overview of
If we extend the previous architecture to the one shown below on the right, it can handle the identification and transformation process more efficiently.
The architecture consists of two parts: encoder and decoder. The encoder is where we create the image features and the markup features that precede them. Features are the building blocks of the connection between web creation prototyping and markup languages. At the end of the encoder, we pass the image features to each of the previously marked words. The decoder then combines prototyping features and tag features to create a feature for the next tag, which predicts the next tag through the full connection layer.
Design prototype characteristics
Because we need to insert a screen shot for each word, this becomes a bottleneck for training the neural network. So instead of using images directly, we extract the information needed to generate markup language. These extracted information will be encoded into image features through pre-trained CNN, and we will use the hierarchical output before the classification layer to extract features.
We ended up with 1536 8*8 feature graphs. Although it was difficult for us to understand them intuitively, the neural network was able to extract the object and position of elements from these features.
Tag feature
In the Hello World version, we use one-hot encoding to represent tokens. In this version, we will use word embedding to represent the input and one-hot encoding to represent the output. The way we construct each sentence will remain the same, but the way we map each symbol will change. One-hot encoding treats each word as a separate unit, while word embedding represents the input data as a list of real numbers that represent the relationships between tag tags.
The embedding dimension above is 8, but generally the embedding dimension varies between 50 and 500 depending on the size of the vocabulary. The eight values of each word above are similar to the weights in the neural network, which tend to depict the connections between words (Mikolov Alt El., 2013). This is how we start deploying markup features, the neural network trained features that link input data to output data.
The encoder
We now feed the word embedding into the LSTM and expect to return a series of tag characteristics. These markup characteristics are then fed into a Time Distributed dense layer, which can be viewed as a fully connected layer with multiple inputs and outputs.
Parallel to the embedding and LSTM layer, there is another processing process, in which the image features are first expanded into a vector, and then fed to a fully connected layer to extract high-level features. These image features are then cascaded with labeled features as the output of the encoder.
Tag feature
As shown in the figure below, we now insert the word into the LSTM layer and all statements are filled with zero to get the same vector length.
To mix signals and look for advanced patterns, we use a TimeDistributed dense layer to extract marker features. The TimeDistributed dense layer is very similar to the normal fully connected layer, and it has multiple inputs and outputs.
Image characteristics
For a parallel process, we need to expand all the pixel values of the image into a vector, so the information will not be changed, they will only be used for recognition.
As above, we will mix signals through the full connection layer and extract higher-level concepts. Because we are not just dealing with one input value, we use the normal full-connection layer.
Cascading image features and tag features
All statements are populated to create the three tag characteristics. Since we have preprocessed image features, we can add image features to each tagged feature.
As above, after copying the image features to the corresponding tag features, we get the new image-tag features, which are the input values we feed to the decoder.
decoder
Now, we use image-tag features to predict the next tag.
In the following example, we use three image-tag feature pairs to output the next tag feature. Note that the LSTM layer should not return a vector of length equal to the input sequence, but only need to predict a feature. In our case, this feature predicts the next tag, which contains the information for the last prediction.
Final prediction
The dense layer works like a traditional feedforward network, connecting the 512 values in the next tag feature to the last four predictions, the four words we have in the vocabulary: start, Hello, world, and end. The final softmax function adopted by the dense layer will produce a probability distribution for the four categories, for example [0.1, 0.1, 0.1, 0.7] will predict the fourth word as the next label.
# Load the images and preprocess them for inception-resnet
images = []
all_filenames = listdir('images/')
all_filenames.sort()
for filename in all_filenames:
images.append(img_to_array(load_img('images/'+filename, target_size=(299, 299))))
images = np.array(images, dtype=float)
images = preprocess_input(images)
# Run the images through inception-resnet and extract the features without the classification layer
IR2 = InceptionResNetV2(weights='imagenet', include_top=False)
features = IR2.predict(images)
# We will cap each input sequence to 100 tokens
max_caption_len = 100
# Initialize the function that will create our vocabulary
tokenizer = Tokenizer(filters='', split=" ", lower=False)
# Read a document and return a string
def load_doc(filename):
file = open(filename, 'r')
text = file.read()
file.close()
return text
# Load all the HTML files
X = []
all_filenames = listdir('html/')
all_filenames.sort()
for filename in all_filenames:
X.append(load_doc('html/'+filename))
# Create the vocabulary from the html files
tokenizer.fit_on_texts(X)
# Add +1 to leave space for empty words
vocab_size = len(tokenizer.word_index) + 1
# Translate each word in text file to the matching vocabulary index
sequences = tokenizer.texts_to_sequences(X)
# The longest HTML file
max_length = max(len(s) for s in sequences)
# Intialize our final input to the model
X, y, image_data = list(), list(), list()
for img_no, seq in enumerate(sequences):
for i in range(1, len(seq)):
# Add the entire sequence to the input and only keep the next word for the output
in_seq, out_seq = seq[:i], seq[i]
# If the sentence is shorter than max_length, fill it up with empty words
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# Map the output to one-hot encoding
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# Add and image corresponding to the HTML file
image_data.append(features[img_no])
# Cut the input sentence to 100 tokens, and add it to the input data
X.append(in_seq[-100:])
y.append(out_seq)
X, y, image_data = np.array(X), np.array(y), np.array(image_data)
# Create the encoder
image_features = Input(shape=(8, 8, 1536,))
image_flat = Flatten()(image_features)
image_flat = Dense(128, activation='relu')(image_flat)
ir2_out = RepeatVector(max_caption_len)(image_flat)
language_input = Input(shape=(max_caption_len,))
language_model = Embedding(vocab_size, 200, input_length=max_caption_len)(language_input)
language_model = LSTM(256, return_sequences=True)(language_model)
language_model = LSTM(256, return_sequences=True)(language_model)
language_model = TimeDistributed(Dense(128, activation='relu'))(language_model)
# Create the decoder
decoder = concatenate([ir2_out, language_model])
decoder = LSTM(512, return_sequences=False)(decoder)
decoder_output = Dense(vocab_size, activation='softmax')(decoder)
# Compile the model
model = Model(inputs=[image_features, language_input], outputs=decoder_output)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
# Train the neural network
model.fit([image_data, X], y, batch_size=64, shuffle=False, epochs=2)
# map an integer to a word
def word_for_id(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
# seed the generation process
in_text = 'START'
# iterate over the whole length of the sequence
for i in range(900):
# integer encode input sequence
sequence = tokenizer.texts_to_sequences([in_text])[0][-100:]
# pad input
sequence = pad_sequences([sequence], maxlen=max_length)
# predict next word
yhat = model.predict([photo,sequence], verbose=0)
# convert probability to integer
yhat = np.argmax(yhat)
# map integer to word
word = word_for_id(yhat, tokenizer)
# stop if we cannot map the word
if word is None:
break
# append as input for generating the next word
in_text += ' ' + word
# Print the prediction
print(' ' + word, end='')
# stop if we predict the end of the sequence
if word == 'END':
break
return
# Load and image, preprocess it for IR2, extract features and generate the HTML
test_image = img_to_array(load_img('images/87.jpg', target_size=(299, 299)))
test_image = np.array(test_image, dtype=float)
test_image = preprocess_input(test_image)
test_features = IR2.predict(np.array([test_image]))
generate_desc(model, tokenizer, np.array(test_features), 100)Copy the code
The output
Addresses of websites generated by different rounds of training:
- 250 epochs: emilwallner. Making. IO/HTML / 250 _ep…
- 350 epochs: emilwallner. Making. IO/HTML / 350 _ep…
- 450 epochs: emilwallner. Making. IO/HTML / 450 _ep…
- 550 epochs: emilwallner. Making. IO/HTML / 550 _ep…
The pits I walked through:
- I think LSTM is a little harder to understand than CNN. When I expand the LSTM, they become a little easier to understand. In addition, we can focus on input and output characteristics before attempting to understand LSTM.
- It is much easier to build a vocabulary from scratch than to compress a huge vocabulary. Such builds include fonts, div tag sizes, hex colors for variable names, and general words.
- Most libraries are built to parse text documents. In the library’s usage documentation, they tell us how to split by whitespace, not the code, and we need to customize the way we parse.
- We can extract features from pre-trained models on ImageNet. However, the losses were about 30% higher compared to the pix2code model trained from scratch. In addition, I am interested in using a web screenshot-based pre-training for inception- Resnet networks.
The Bootstrap version
In the final version, we used the pix2Code paper to generate a dataset for the Bootstrap site. Using Twitter Bootstrap library (https://getbootstrap.com/), we can combine the HTML and CSS, reduce the vocabulary size.
We’ll use this version to generate tags for screenshots we haven’t seen before. We also delve into a priori knowledge of how it builds screenshots and tags.
Instead of training on the bootstrap tag, we compiled it into HTML and CSS using 17 simplified tokens. Data set (https://github.com/tonybeltramelli/pix2code/tree/master/datasets), including 1500 of the 250 test shots and screenshots. Each screenshot averaged 65 tokens, and there were 96,925 training samples.
We tweaked the model in the Pix2Code paper so that it could predict network components with 97% accuracy.
End-to-end approach
Feature extraction from pre-training model is effective in image description generation model. But after a few experiments, I found that pix2Code’s end-to-end approach worked better. In our model, we replace pre-trained image features with lightweight convolutional neural networks. Instead of using maximum pooling to increase information density, we increase steps. This preserves the position and color of the front elements.
There are two core models: convolutional neural network (CNN) and recurrent neural network (RNN). The most commonly used recurrent neural network is the long – and short-term memory (LSTM) network. I introduced the CNN tutorial in my previous article, and this article focuses on LSTM.
Understand the time step in LSTM
What is difficult to understand about LSTM is the time step. Our primitive neural network has two time steps, and if you give it “Hello,” it will predict “World.” But it will try to predict more time steps. In the following example, the input has four time steps, one for each word.
LSTM is a neural network suitable for sequential information input. As shown in the model expansion diagram below, you need to maintain the same weight for each cycle step.
The weighted input and output characteristics are cascaded into the activation function and serve as the output of the current time step. Because we are reusing the same weights, they will take information from some input and build knowledge of the sequence. Here is a simplified version of the LSTM process at each time step:
Understand the units in the LSTM hierarchy
The total number of LSTM units in each layer determines its memory capacity and also corresponds to the dimension size of each output feature. Each unit in the LSTM hierarchy will learn how to track a different aspect of syntax. The following is a visualization of an LSTM unit tracking tag row information, which is a simple markup language we used to train the Bootstrap model.
Each LSTM cell maintains a cell state, which we can think of as memory. The weights and activation values can modify the status values in different ways, allowing the LSTM layer to fine-tune by retaining or forgetting the input information. In addition to processing the current input and output information, the LSTM unit also needs to modify the memory state to pass to the next time step.
dir_name = 'resources/eval_light/' # Read a file and return a string def load_doc(filename): file = open(filename, 'r') text = file.read() file.close() return text def load_data(data_dir): text = [] images = [] # Load all the files and order them all_filenames = listdir(data_dir) all_filenames.sort() for filename in (all_filenames): if filename[-3:] == "npz": # Load the images already prepared in arrays image = np.load(data_dir+filename) images.append(image['features']) else: # Load the boostrap tokens and rap them in a start and end tag syntax = '<START> ' + load_doc(data_dir+filename) + ' <END>' # Seperate all the words with a single space syntax = ' '.join(syntax.split()) # Add a space after each comma syntax = syntax.replace(',', ' ,') text.append(syntax) images = np.array(images, dtype=float) return images, text train_features, texts = load_data(dir_name) # Initialize the function to create the vocabulary tokenizer = Tokenizer(filters='', split=" ", lower=False) # Create the vocabulary tokenizer.fit_on_texts([load_doc('bootstrap.vocab')]) # Add one spot for the empty word in the vocabulary vocab_size = len(tokenizer.word_index) + 1 # Map the input sentences into the vocabulary indexes train_sequences = tokenizer.texts_to_sequences(texts) # The longest set of boostrap tokens max_sequence = max(len(s) for s in train_sequences) # Specify how many tokens to have in each input sentence max_length = 48 def preprocess_data(sequences, features): X, y, image_data = list(), list(), list() for img_no, seq in enumerate(sequences): for i in range(1, len(seq)): # Add the sentence until the current count(i) and add the current count to the output in_seq, out_seq = seq[:i], seq[i] # Pad all the input token sentences to max_sequence in_seq = pad_sequences([in_seq], maxlen=max_sequence)[0] # Turn the output into one-hot encoding out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # Add the corresponding image to the boostrap token file image_data.append(features[img_no]) # Cap the input sentence to 48 tokens and add it X.append(in_seq[-48:]) y.append(out_seq) return np.array(X), np.array(y), np.array(image_data) X, y, image_data = preprocess_data(train_sequences, train_features) #Create the encoder image_model = Sequential() image_model.add(Conv2D(16, (3, 3), padding='valid', Activation ='relu', input_shape=(256, 256, 3,)) image_model.add(Conv2D(16, (3,3), activation='relu', padding='same', Strides =2) image_model.add(Conv2D(32, (3,3), activation='relu', padding='same')) image_model.add(Conv2D(32, (3,3), Activation ='relu', padding='same', strides=2)) image_model.add(Conv2D(64, (3,3), activation='relu', Image_model.add (Conv2D(64, (3,3), activation='relu', padding='same', Strides = 2) image_model. Add (Conv2D (128, (3, 3), the activation = 'relu', padding='same')) image_model.add(Flatten()) image_model.add(Dense(1024, Activation = 'relu) image_model. Add (Dropout (0.3)) image_model. Add (Dense (1024, Activation ='relu') image_model.add(Dropout(0.3)) image_model.add(RepeatVector(max_length)) Visual_input = Input(shape=(256, 256, 3,)) encoded_image = image_model(visual_input) language_input = Input(shape=(max_length,)) language_model = Embedding(vocab_size, 50, input_length=max_length, mask_zero=True)(language_input) language_model = LSTM(128, return_sequences=True)(language_model) language_model = LSTM(128, return_sequences=True)(language_model) #Create the decoder decoder = concatenate([encoded_image, language_model]) decoder = LSTM(512, return_sequences=True)(decoder) decoder = LSTM(512, return_sequences=False)(decoder) decoder = Dense(vocab_size, activation='softmax')(decoder) # Compile the model model = Model(inputs=[visual_input, language_input], Outputs =decoder) optimizer = RMSprop(lr=0.0001, clipValue =1.0) model.compile(Loss ='categorical_crossentropy', optimizer=optimizer) #Save the model for every 2nd epoch filepath="org-weights-epoch-{epoch:04d}--val_loss-{val_loss:.4f}--loss-{loss:.4f}.hdf5" checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_weights_only=True, period=2) callbacks_list = [checkpoint] # Train the model model.fit([image_data, X], y, batch_size=64, shuffle=False, Validation_split = 0.1, callbacks = callbacks_list, verbose = 1, epochs = 50)Copy the code
Test accuracy
Finding a good way to measure accuracy is tricky. For example, word by word, if you have one word that doesn’t match, your accuracy might be zero. If you remove one of the 100 percent matched words, the final accuracy might be 99/100.
I use the BLEU score, which is the best in both machine translation and image description model practice. It breaks down sentences into four N-grams, from a sequence of 1-4 words. In the prediction below, “cat” should be “code”.
To get the final score, multiply each score by 25%, (4/5) × 0.25 + (2/4) × 0.25 + (1/3) × 0.25 + (0/2) × 0.25 = 0.2 + 0.125 + 0.083 + 0 = 0.408. Then multiply the sum by the penalty function for sentence length. Since the length is correct in our example, it is directly our final score.
You can increase the number of N-grams, and four n-Gram models are the most appropriate for human translation. I suggest you read the following code:
#Create a function to read a file and return its content def load_doc(filename): file = open(filename, 'r') text = file.read() file.close() return text def load_data(data_dir): text = [] images = [] files_in_folder = os.listdir(data_dir) files_in_folder.sort() for filename in tqdm(files_in_folder): #Add an image if filename[-3:] == "npz": image = np.load(data_dir+filename) images.append(image['features']) else: # Add text and wrap it in a start and end tag syntax = '<START> ' + load_doc(data_dir+filename) + ' <END>' #Seperate each word with a space syntax = ' '.join(syntax.split()) #Add a space between each comma syntax = syntax.replace(',', ' ,') text.append(syntax) images = np.array(images, dtype=float) return images, text #Intialize the function to create the vocabulary tokenizer = Tokenizer(filters='', split=" ", lower=False) #Create the vocabulary in a specific order tokenizer.fit_on_texts([load_doc('bootstrap.vocab')]) dir_name = '.. /.. /.. /.. /eval/' train_features, texts = load_data(dir_name) #load model and weights json_file = open('.. /.. /.. /.. /model.json', 'r') loaded_model_json = json_file.read() json_file.close() loaded_model = model_from_json(loaded_model_json) # load weights into new model loaded_model.load_weights(".. /.. /.. /.. /weights.hdf5") print("Loaded model from disk") # map an integer to a word def word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None print(word_for_id(17, tokenizer)) # generate a description for an image def generate_desc(model, tokenizer, photo, max_length): photo = np.array([photo]) # seed the generation process in_text = '<START> ' # iterate over the whole length of the sequence print('\nPrediction---->\n\n<START> ', end='') for i in range(150): # integer encode input sequence sequence = tokenizer.texts_to_sequences([in_text])[0] # pad input sequence = pad_sequences([sequence], maxlen=max_length) # predict next word yhat = loaded_model.predict([photo, sequence], verbose=0) # convert probability to integer yhat = argmax(yhat) # map integer to word word = word_for_id(yhat, tokenizer) # stop if we cannot map the word if word is None: break # append as input for generating the next word in_text += word + ' ' # stop if we predict the end of the sequence print(word + ' ', end='') if word == '<END>': break return in_text max_length = 48 # evaluate the skill of the model def evaluate_model(model, descriptions, photos, tokenizer, max_length): actual, predicted = list(), list() # step over the whole set for i in range(len(texts)): yhat = generate_desc(model, tokenizer, photos[i], max_length) # store actual and predicted print('\n\nReal---->\n\n' + texts[i]) actual.append([texts[i].split()]) predicted.append(yhat.split()) # calculate BLEU score bleu = corpus_bleu(actual, predicted) return bleu, actual, predicted bleu, actual, predicted = evaluate_model(loaded_model, texts, train_features, tokenizer, max_length) #Compile the tokens into HTML and css dsl_path = "compiler/assets/web-dsl-mapping.json" compiler = Compiler(dsl_path) compiled_website = compiler.compile(predicted[0], 'index.html') print(compiled_website ) print(bleu)Copy the code
The output
Links to sample output:
- 1 – the Original Generated website 1 (emilwallner. Making. IO/bootstrap/r…
- Generated website 2 – the Original 2 (emilwallner. Making. IO/bootstrap/r…
- Generated website 3 – the Original 3 (emilwallner. Making. IO/bootstrap/r…
- 4 – the Original 4 Generated website (emilwallner. Making. IO/bootstrap/r…
- Generated website 5 – the Original 5 (emilwallner. Making. IO/bootstrap/r…
The pits I walked through:
- Understand model weaknesses rather than testing random models. First I used random things like batch normalization, two-way networking, and tried to implement attention mechanics. After looking at the test data and knowing that it could not predict color and position with high accuracy, I realized that CNN had a weakness. This led me to use increased stride length instead of maximum pooling. Verification loss decreased from 0.12 to 0.02 and BLEU score increased from 85% to 97%.
- If they are related, only the pre-trained model is used. In the case of small data, I think a pre-trained image model will improve performance. In my experiments, end-to-end model training is slower and requires more memory, but improves accuracy by 30%.
- When you run the model on a remote server, we need to be prepared for some differences. On my MAC, it reads documents in alphabetical order. But on the server, it’s randomly located. This creates a mismatch between the code and the screenshots.
The next step
Front-end development is the ideal space for deep learning applications. Data is easy to generate, and current deep learning algorithms can map most logic. One of the most exciting areas is the application of attention mechanisms to LSTM. This not only improves accuracy, but also allows us to visualize where CNN is focused when generating tags. Attention is also key to communication between tags, definable templates, scripts, and the most terminal. The attention layer keeps track of variables that allow the network to maintain communication between programming languages.
But in the near future, the biggest impact will come from scalable approaches to synthesizing data. Then you can add fonts, colors and animations step by step. Most of the progress so far has been in sketches and turning them into template applications. In less than two years, we’ll create a sketch that will find the right front end in a second. The Airbnb design team and Uizard have already created two prototypes in use. Here are some possible test procedures:
The experiment
start
- Run all models
- Try different hyperparameters
- Test a different CNN architecture
- Add a bidirectional LSTM model
- Implement the model with different data sets
Further experiment
- Create a stable random application/web generator using the appropriate syntax
- Data from sketches to application models. Automatically turn app/web screenshots into sketches and use GAN to create variety.
- Visualize each predicted image focus using the attention layer, similar to this model
- Create a framework for modular methods. For example, have a model encoder for fonts, one for colors, another for typography, and use a decoder to integrate them. Stable image features are a good start.
- Feed simple HTML components into a neural network and teach them to generate animations using CSS. It would be fascinating to use an attentional approach and visualize the focus of two input sources.
The original link: blog.floydhub.com/turning-des…