RECOVERING GEOMETRIC INFORMATION WITH LEARNED TEXTURE PERTURBATIONS

Abstract

Regularization is used to avoid overfitting when training a neural network; unfortunately, this reduces the attainable level of detail hindering the ability to capture high-frequency information present in the training data. Even though various approaches may be used to re-introduce high-frequency detail, it typically does not match the training data and is often not time coherent. In the case of network inferred cloth, these sentiments manifest themselves via either a lack of detailed wrinkles or unnaturally appearing and/or time incoherent surrogate wrinkles. Thus, we propose a general strategy whereby high-frequency information is procedurally embedded into low-frequency data so that when the latter is smeared out by the network the former still retains its high-frequency detail. We illustrate this approach by learning texture coordinates which when smeared do not in turn smear out the high-frequency detail in the texture itself but merely smoothly distort it. Notably, we prescribe perturbed texture coordinates that are subsequently used to correct the over-smoothed appearance of inferred cloth, and correcting the appearance from multiple camera views naturally recovers lost geometric information.



Since neural networks are trained to generalize to unseen data, regularization is important for reducing overfitting, see e.g. Goodfellow et al. (2016) ; Scholkopf & Smola (2001) . However, regularization also removes some of the high variance characteristic of much of the physical world. Even though high-quality ground truth data can be collected or generated to reflect the desired complexity of the outputs, regularization will inevitably smooth network predictions. Rather than attempting to directly infer highfrequency features, we alternatively propose to learn a low-frequency space in which such features can be embedded. We focus on the specific task of adding highfrequency wrinkles to virtual clothing, noting that the idea of learning a low-frequency embedding may be generalized to other tasks. (2020) . Rather than attempting to amend such errors directly, we perturb texture so that the rendered cloth mesh appears to more closely match the ground truth. See Figure 1 . Then given texture perturbations from at least two unique camera views, 3D geometry can be accurately reconstructed Hartley & Sturm (1997) to recover high-frequency wrinkles. Similarly, for AR/VR applications, correcting visual appearance from two views (one for each eye) is enough to allow the viewer to accurately discern 3D geometry. Our proposed texture coordinate perturbations are highly dependent on the camera view. Thus, we demonstrate that one can train a separate texture sliding neural network (TSNN) for each of a finite number of cameras laid out into an array and use nearby networks to interpolate results valid for any view enveloped by the array. Although an approach similar in spirit might be pursued for various lighting conditions, this limitation is left as future work since there are a great deal of applications where the light is ambient/diffuse/non-directional/etc. In such situations, this further complication may be ignored without significant repercussion. 

3. METHODS

We define texture sliding as the changing of texture coordinates on a per-camera basis such that any point which is visible from some stereo pair of cameras can be triangulated back to its ground truth position. Other stereo reconstruction techniques can also be used in place of triangulation because the images we generate are consistent with the ground truth geometry. See e.g. Bradley et al. (2008a); Hartley & Sturm (1997); Seitz et al. (2006) .



Figure 1: Texture coordinate perturbations (texture sliding) reduce shape inference errors: ground truth (blue), prediction (orange).

Because cloth wrinkles/folds are high-frequency features, existing deep neural networks (DNNs) trained to infer cloth shape tend to predict overly smooth meshes Alldieck et al. (2019a); Daněřek et al. (2017); Guan et al. (2012); Gundogdu et al. (2019); Jin et al. (2020); Lahner et al. (2018); Natsume et al. (2019); Santesteban et al. (2019); Wang et al. (2018); Patel et al.

While physically-based cloth simulation has matured as a field over the last few decadesBaraff & Witkin (1998);Baraff et al. (2003);Bridson et al. (2002; 2003);Selle et al. (2008), datadriven methods are attractive for many applications. There is a rich body of work in reconstructing cloth from multiple views or 3D scans, see e.g. Bradley et al. (2008b); Franco et al. (2006); Vlasic et al. (2008). More recently, optimization-based methods have been used to generate higher resolution reconstructions Huang et al. (2015); Pons-Moll et al. (2017); Wu et al. (2012); Yang et al. (2016). Some of the most interesting work focuses on reconstructing the body and cloth separately Bȃlan & Black (2008); Neophytou & Hilton (2014); Yang et al. (2018); Zhang et al. (2017). With advances in deep learning, one can aim to reconstruct 3D cloth meshes from single views. A number of approaches reconstruct a joint cloth/body mesh from a single RGB image Alldieck et al. (2019a;b); Natsume et al. (2019); Onizuka et al. (2020); Saito et al. (2019; 2020), RGB-D image Yu et al. (2019), or video Alldieck et al. (2018a;b); Habermann et al. (2019); Xu et al. (2018). To reduce the dimensionality of the output space, DNNs are often trained to predict the pose/shape parameters of human body models such as SCAPE Anguelov et al. (2005) or SMPL Loper et al. (2015) (see also Pavlakos et al. (2019)). Habermann et al. (2019); Natsume et al. (2019); Varol et al. (2018) leverage predicted pose information to infer shape. When only the garment shape is predicted, a number of recent works output predictions in UV space to represent geometric information as pixels Daněřek et al. (2017); Jin et al. (2020); Lahner et al. (2018), although others Gundogdu et al. (2019); Santesteban et al. (2019); Patel et al. (2020) define loss functions directly in terms of the 3D cloth vertices. Wrinkles and Folds: Cloth realism can be improved by introducing wrinkles and folds. In the graphics community, researchers have explored both procedural and data-driven methods for generating wrinkles De Aguiar et al. (2010); Guan et al. (2012); Hahn et al. (2014); Müller & Chentanez (2010); Rohmer et al. (2010); Wang et al. (2010

). Other works add real-world wrinkles as a postprocessing step to improve smooth captured cloth: Popa et al. (2009) extracts the edges of cloth folds and then applies space-time deformations, Robertini et al. (2014) solves for shape deformations directly by optimizing over all frames of a video sequence. Recently, Lahner et al. (2018) used a conditional Generative Adversarial Network Mirza & Osindero (2014) to generate normal maps as proxies for wrinkles on captured cloth. More broadly, deep learning on 3D meshes falls under the umbrella of geometric deep learning, which was coined by Bronstein et al. (2017) to characterize learning in non-Euclidean domains. Scarselli et al. (2008) was one of the earliest works in this area and introduced the notion of a Graph Neural Network (GNN) in relation to CNNs. Subsequent works similarly extend the CNN architecture to graphs and manifolds Boscaini et al. (2016); Maron et al. (2017); Masci et al. (2015); Monti et al. (2017). Kostrikov et al. (2018) introduces a latent representation that explicitly incorporates the Dirac operator to detect principal curvature directions. Tan et al. (2018) trains a mesh generative model to generate novel meshes outside an original dataset. Returning to the specific application of virtual cloth, Jin et al. (2020) embeds a non-Euclidean cloth mesh into a Euclidean pixel space, making it possible to directly use CNNs to make non-Euclidean predictions.

