Summer turned to autumn, and it was the season without clothes. Fancy a jacket, but don’t know if it goes with your plaid shirt? No matter, researchers from Zalando Research in Germany have tweaked Style GAN to help you show off the hd look of a specific mix of items worn on a model, as well as specify specific poses. Using this technology to develop a virtual makeover APP, why not have a girlfriend?
Excerpted from arXiv by Gokhan Yildirim et al., Compiled by Heart of Machine, with participation by Han Fang and Zhang Qian.
Link: arxiv.org/pdf/1908.08…
Fashion e-commerce platforms simplify clothing shopping through search and personalization. Visual fitting can further enhance the user experience. Previous studies mainly focused on dressing fashion models on existing images [5,2] or generating low-resolution images from scratch based on given posture and clothing color [8]. This article focuses on generating high resolution images of fashion models wearing the desired clothing and holding specific poses.
In recent years, advances in generative adversarial networks (GAN) [1] have made it possible for researchers to sample real images through implicit generative modeling. One of these improvements is Style GAN [7], whose idea is to use Progressive GAN [6] to generate high-resolution images and modify them through adaptive instance normalization (AdaIN) [4]. In this paper, based on the model clothing pose image data set, the author applied and modified Style GAN. First, the original Style GAN was trained on a group of fashion model images. The results show that the clothing color and body pose of one generated fashion model can be transferred to another. Secondly, the author modifies Style GAN to adapt to the generation process of clothing and human body posture. This makes it possible to quickly visualize custom clothes in different body poses and shapes.
Clothing data set
The authors used a proprietary image data set with approximately 380K entries. Each entry in the dataset contains a particular pose fashion model wearing a particular outfit. An outfit can have up to six items. To obtain body posture, the author extracted 16 key points using a depth posture estimator [10]. Figure 1 visualizes a partial sample of the dataset. Red marks on fashion models represent the key points extracted. Both models and individual images have a 1024×768 resolution.
The experiment
The flow for Unconditional Style GAN is shown in Figure 2 (a). The model has 18 generator layers whose input is an affine transform copy of the style vector for adaptive instance normalization. The discriminator is the same as the original Style GAN. The authors trained the network on four NVIDIA V100 Gpus and trained 160 epochs in about four weeks.
Figure 2 a:Flow chart for Enfetter GAN
In the conditional version, the author modifies Style GAN using embedded networks, as shown in Figure 2 (b). The input to the network is six single-product images (18 channels in total) and a 16-channel heat map, which is calculated from 16 key points. The images of individual items are connected according to fixed ordering in order to obtain cross-clothing semantic consistency. The ordering is shown in Figure 1.
Figure 2 b:Conditional GAN Flow chart.
If an outfit does not have items of a particular semantic category, it will be filled with an empty gray image. Embedded networks create 512 dimensional vectors that are connected to potential vectors to produce style vectors. The model was also trained for four weeks (115 epochs). Discriminators in the Conditional model use a separate network to calculate the embedding vectors of input items and heat maps, and then use the method in [9] to calculate the final score.
Unconditional
In Figure 3 below, the author shows the images generated by the Unconditional model. As you can see, both individual objects and body parts are actually generated at a maximum resolution of 1024×768 pixels. During training, the synthesizer can be regularized by switching the style vectors of certain layers. This operation enables information transfer between images.
Figure 3:Model artwork generated by Unconditional Style GAN
In Figure 4 below, the author shows two examples of information migration. First, the same source style vector is propagated to layers 13 through 18 of the generator (before the affine transformation in Figure 2), which passes the color of the source clothing to the target to generate the image, as shown in Figure 4. Pose migration can be achieved by copying the source style vector to a lower level.
Transfer clothing color or body posture to each student
As a model
.
Table 1 shows the layers that propagate the source and target style vectors to achieve the desired migration effect.
Table 1: Layers for propagating style vectors.
Conditional
Once the Conditional model is trained, a set of desired items and a specific pose can be entered to visualize the effect of dressing, as shown in Figure 5. Figure 5 (a) and (b) are the two sets of clothing used to generate the image, and Figure 5 (c) and (d) are the models generated by randomly selecting four poses. You can observe that the pieces are presented correctly on the generated body, and the posture is consistent across the different garments. Figure 5 (e) shows the visualization generated by adding the jacket from the first outfit to the second. You can see the texture and size of the denim jacket represented correctly on the fashion model. Note that due to pseudo correlation in the data set, the generated model’s face may vary with clothing and posture.
Two different outfits (a) and (b) were used to generate the model images in (c) and (d).
(e) The jacket from Suit # 1 was added to suit # 2 to customize the visualization.
The data set included fashion models of all body types, taking into account differences in gender, shape and weight. This difference is implied by the relative distance between the extracted key points. Conditional models can capture and reproduce fashion models of different body types, as shown in the fourth generated image in Figure 5. The results are very exciting, and the method may be extended to more users through virtual wearables in the future.
Quantitative results
The author evaluates the quality of the generated image by calculating the Frechet Initial Distance (FID) score of The Unconditional and Conditional GAN [3]. As you can see from Table 2, Unconditional GAN can produce higher-quality images, a conclusion that can be drawn by comparing Figures 3 and 5. The Conditional discriminator also has the additional task of checking whether the input clothing and pose are generated correctly. This can lead to a trade-off between image quality (or “authenticity”) and the ability to directly control the generation of clothing and posture.
Model FID score