CANVASEMB: LEARNING LAYOUT REPRESENTATION WITH LARGE-SCALE PRE-TRAINING FOR GRAPHIC DESIGN

Abstract

Layout representation, which models visual elements in a canvas and their interrelations, plays a crucial role in graphic design intelligence. With a large variety of layout designs and the unique characteristic of layouts that visual elements are defined as a list of categorical (e.g. shape type) and numerical (e.g. position and size) properties, it is challenging to learn a general and compact representation with limited data. Inspired by the recent success of self-supervised pre-training techniques in various natural language processing tasks, in this paper, we propose CanvasEmb (Canvas Embedding), which pre-trains deep representation from unlabeled graphic designs by jointly conditioning on all the context elements in the same canvas, with a multi-dimensional feature encoder and a multi-task learning objective. The pre-trained CanvasEmb model can be fine-tuned with just one additional output layer and with a small size of training data to create models for a wide range of downstream tasks. We verify our approach with presentation slides data. We construct a large-scale dataset with more than one million slides, and propose two novel layout understanding tasks with human labeling sets, namely element role labeling and image captioning. Evaluation results on these two tasks show that our model with fine-tuning achieves state-of-the-art performances. Furthermore, we conduct a deep analysis aiming to understand the modeling mechanism of CanvasEmb, and demonstrate its great potential use on more applications such as layout auto completion and layout retrieval.

1. INTRODUCTION

Graphic design leverages layout to set up and arrange visual elements in a canvas for conveying message in different types of documents, while layout representation is the reversed process to understand visual elements and their inter-relations in a canvas, which is the key for the analysis (Stoffel et al., 2010 ), retrieval (Beusekom et al., 2006) and generation (Li et al., 2020b; Lee et al., 2020) of graphic designs. However, elements in a layout are complex, which are defined with multi-dimensional properties such as type (e.g., text box, image or button), position and color. For example, the web page and presentation slide shown in Figure 1 is defined by a lot of settings, as each example is constructed by several elements and each element is defined by several proprieties. Due to the complex and sparse features of elements, as well as the rich diversity of layouts, learning a general and compact layout representation is challenging with limited amount of data. Previous works related to layout representations (Li et al., 2019; Tabata et al., 2019; Lee et al., 2020) are mostly task-oriented. They simplify the layout only as the positions of elements, and directly optimize task-specific labels with less than a few thousands instances. Recently a majority of self-supervised pre-trained models such as ELMO (Peters et al., 2018) , GPT (Radford, 2018) and BERT (Devlin et al., 2019) have shown promising results in improving a variety of natural language processing (NLP) tasks. The success of pre-trained models in NLP has inspired us to learn contextual layout representations from large-scale unlabeled graphic designs, which can facilitate various downstream tasks for design intelligence. As one highly related work, LayoutLM (Xu et al., 2019) is a document pre-trained model incorporating both text content and layout information for scanned documents. However, it is difficult to generalize to other document types, since its input is

