CANVASEMB: LEARNING LAYOUT REPRESENTATION WITH LARGE-SCALE PRE-TRAINING FOR GRAPHIC DESIGN

Abstract

Layout representation, which models visual elements in a canvas and their interrelations, plays a crucial role in graphic design intelligence. With a large variety of layout designs and the unique characteristic of layouts that visual elements are defined as a list of categorical (e.g. shape type) and numerical (e.g. position and size) properties, it is challenging to learn a general and compact representation with limited data. Inspired by the recent success of self-supervised pre-training techniques in various natural language processing tasks, in this paper, we propose CanvasEmb (Canvas Embedding), which pre-trains deep representation from unlabeled graphic designs by jointly conditioning on all the context elements in the same canvas, with a multi-dimensional feature encoder and a multi-task learning objective. The pre-trained CanvasEmb model can be fine-tuned with just one additional output layer and with a small size of training data to create models for a wide range of downstream tasks. We verify our approach with presentation slides data. We construct a large-scale dataset with more than one million slides, and propose two novel layout understanding tasks with human labeling sets, namely element role labeling and image captioning. Evaluation results on these two tasks show that our model with fine-tuning achieves state-of-the-art performances. Furthermore, we conduct a deep analysis aiming to understand the modeling mechanism of CanvasEmb, and demonstrate its great potential use on more applications such as layout auto completion and layout retrieval.

1. INTRODUCTION

Graphic design leverages layout to set up and arrange visual elements in a canvas for conveying message in different types of documents, while layout representation is the reversed process to understand visual elements and their inter-relations in a canvas, which is the key for the analysis (Stoffel et al., 2010) , retrieval (Beusekom et al., 2006) and generation (Li et al., 2020b; Lee et al., 2020) of graphic designs. However, elements in a layout are complex, which are defined with multi-dimensional properties such as type (e.g., text box, image or button), position and color. For example, the web page and presentation slide shown in Figure 1 is defined by a lot of settings, as each example is constructed by several elements and each element is defined by several proprieties. Due to the complex and sparse features of elements, as well as the rich diversity of layouts, learning a general and compact layout representation is challenging with limited amount of data. Previous works related to layout representations (Li et al., 2019; Tabata et al., 2019; Lee et al., 2020) are mostly task-oriented. They simplify the layout only as the positions of elements, and directly optimize task-specific labels with less than a few thousands instances. Recently a majority of self-supervised pre-trained models such as ELMO (Peters et al., 2018) , GPT (Radford, 2018) and BERT (Devlin et al., 2019) have shown promising results in improving a variety of natural language processing (NLP) tasks. The success of pre-trained models in NLP has inspired us to learn contextual layout representations from large-scale unlabeled graphic designs, which can facilitate various downstream tasks for design intelligence. As one highly related work, LayoutLM (Xu et al., 2019) is a document pre-trained model incorporating both text content and layout information for scanned documents. However, it is difficult to generalize to other document types, since its input is word-level and it defines layout only as the word position, which is insufficient to describe a layout in graphic design. In this paper, we present CanvasEmb, a large-scale pre-trained model for learning contextual layout representation. It is designed to pre-train deep representation from unlabeled graphic designs by jointly conditioning on all the context elements in the same canvas, and the pre-trained CanvasEmb model can be fine-tuned with just one additional output layer and with a small size of training data to create models for a wide range of downstream tasks. Specifically, we define a generic and high-coverage vocabulary to describe element properties in the canvas. A feature encoder is designed to jointly incorporate multi-dimensional properties, and it is developed with the multi-layer Transformer (Devlin et al., 2019) for modeling element contexts. To ensure the representation conditioning on all dimensions of element contexts, we adopt the masked language modeling strategy with a multi-task objective, where we randomly mask some properties of elements for prediction in the pre-training. To verify our approach, we construct a large-scale dataset with more than one million presentation slides containing rich layout meta-information for pre-training. We then propose two novel downstream tasks for layout understanding with human labeling sets to evaluate the performance of our pre-trained CanvasEmb model. The first task is element role labeling. Only given the information of layout, the goal is to classify the semantic role of each element (e.g., title, subtitle). The second task is image captioning, which detects if a text box and an image in a slide belongs to the image captioning relation. Experimental results on the two tasks show that fine-tuning the CanvasEmb model achieves state-of-the-art performance. Furthermore, we conduct deep analysis to understand the modeling mechanismCanvasEmb. Also, we demonstrate the great potential use of our pre-trained CanvasEmb with two extended applications, including layout auto completion (Li et al., 2020b ) and layout retrieval. The contributions of this work are as follows: • We propose CanvasEmb, which to the best of our knowledge is the first pre-trained model for layouts in graphic design. It can be fine-tuned with a small size of training data for a wide range of downstream tasks. • We construct a large-scale dataset of presentation slides with rich layout information, as well as two novel tasks for layout understanding (i.e., element role labeling and image captioning) with human labeling sets. • We demonstrate that our model achieves state-of-the art performances on the two downstream tasks, and show the potential for more applications such as layout auto-completion and layout retrieval.

2. RELATED WORK

Layout representation is the focal point of design in rich media, including presentation slides, magazines, comics, posters and web pages. High-quality representations can be conductive to multiple practical design tasks. Early works on design layout or document layout mainly rely on templates (Hurst et al., 2009; Damera-Venkata et al., 2011) or heuristic rules (O'Donovan et al., 2014; Tabata et al., 2019) and require professional knowledge and manual efforts. To efficiently facili-



Figure 1: Example layouts for different document types: (a) Web Page (b) Slide. As shown on the right, elements have multiple properties, which make layout complex and diverse.

