COLLABORATIVE SYMMETRICITY EXPLOITATION FOR OFFLINE LEARNING OF HARDWARE DESIGN SOLVER Anonymous

Abstract

This paper proposes collaborative symmetricity exploitation (CSE), a novel symmetric learning scheme of contextual policy for offline black-box placement problems. Leveraging the symmetricity increases data-efficiency by reducing the solution space, and improves generalization capability by capturing the invariant nature present regardless of changing context. To this end, we design a learning scheme that reduces the order bias (ex., neural network recognizes {1, 2, 3} and {2, 1, 3} as difference placement design) inherited from a sequential decision-making scheme of neural policy by imposing action-permutation (AP)-symmetricity (i.e, the permuted sequences are symmetric-placement of the original sequence) of placement problems. We first defined the order bias and proved that AP-symmetricity is imposed when the order bias of neural policy becomes zero. Then, we designed two collaborative losses for learning neural policy with reduced order bias: expert exploitation and self-exploitation. The expert exploitation loss is designed to clone the behavior of the expert solutions considering order bias. The self-exploitation loss is designed to be a special form of order bias where it measures AP-symmetricity from a self-generated solution. CSE is applied to the decoupling capacitor placement problem (DPP) benchmark, a significant offline black-box placement design problem in hardware domain that requires contextual policy. Experiments show that CSE outperforms state-of-the-art solver for the DPP benchmark.

1. INTRODUCTION

With the CMOS technology shrinking and increasing data rate, the design complexity of very largescale integrated (VLSI) has increased. Human experts are no longer able to design hardware without the help of electrical design automation (EDA) tools, and EDA tools now suffer from long simulation time and insufficient computing power, making machine learning (ML) application to hardware design inevitable. Many studies have already shown that deep reinforcement learning (DRL), one of the representative ML methods for sequential decision making, is promising in various tasks in modern chip design; chip placement (Mirhoseini et al., 2021; Agnesina et al., 2020) , routing (Liao et al., 2019; 2020) , circuit design (Zhao & Zhang, 2020) , logic synthesis (Hosny et al., 2020; Haaswijk et al., 2018) and bi-level hardware optimization (Cheng & Yan, 2021) . However, most previous DRL-based hardware design methods do not take the following into consideration. (a) Online simulators for hardware are usually time intensive and inaccurate; thus, learning with existing offline data by experts is more reliable. Since there exists a limited number of offline hardware data, a data-efficient learning scheme is necessary. (b) Hardware design is composed of electrically coupled multi-level tasks where task conditions are determined by the design of higherlevel tasks; thus, a solver (i.e., contextualized policy conditioned by higher-level tasks) with high generalization capability to adapt to varying task conditions is necessary. In this paper, we leverage the solution symmetricity of placement problem for data efficiency and generalization capability. Conventional sequential decision-making schemes for placement problems (Park et al., 2020; Mirhoseini et al., 2021; Cheng & Yan, 2021) auto-regressively generate solutions without considering the solution symmetricity, thus having the order bias; the neural network identifies the action-permutation (AP) symmetric solutions (i.e. identical placement designs), for instance, {1, 2, 3} and {2, 1, 3}, as different solutions. Our proposed method overcomes the order bias limitation of the previous sequential decision-making schemes with a novel regularization technique. Tackling the order bias (i.e. inducing AP-symmetricity) improves the data efficiency of training and generalization capability of the trained policy due to the two following reasons. First, data efficiency in training can be improved as learning the AP-symmetricity reduces the exploration space (see Fig. 1 ); neural network can automatically learn not only from the explored trajectories but also from their symmetric solution trajectories without additional exploration and simulation. Second, generalization capability on task variation can be improved as AP-symmetricity is the task-agnostic nature of placement problems. Figure 1 : Conventional sequential decisionmaking method's heterogeneous trajectories from AP-Symmetric solution group. To this end, we devised collaborative symmetricity exploitation (CSE) framework, a simple but effective method to induce AP-symmetricity with two collaborative learning schemes: expert exploitation and self-exploitation. The expert exploitation simply augments the offline expert data (sequential data) with a random permutation operator and uses it for imitation learning. The self-exploitation generates pseudo-labeled solutions from the current training policy, transforms the pseudo-labeled solution with a random permutation operator, and forces the solver to have an identical probability to generate the original pseudo-labeled solution and the transformed solution. To verify the effectiveness of CSE, we applied CSE to the decoupling capacitance (decap) placement problem (DPP), one of the significant hardware design benchmarks. The objective of DPP is to place a given number of decaps on power distribution network (PDN) with two varying conditions: keep-out regions and probing port location, determined by higher-level problems such as chip placement and routing. The goal of CSE is to train a solver (i.e., contextualized policy) that has high generalization capability to any given task condition. Contribution 1: A novel symmetric learning scheme for contextualized policy. There exists several works (Cohen & Welling, 2016; Thomas et al., 2018; Fuchs et al., 2020; Satorras et al., 2021) that learn various symmetricities of input data in the domain space for regression and classification tasks. However, learning symmetricity in solution space is less studied as learning the symmertrcities in solution space of sequential policy (generative decision) is challenging. Bengio et al. (2021) tackled solution symmetricity of sequential policy by turning the Markov decision process (MDP) tree model into the directed acyclic graph (DAG)-based flow model. However, they target single-task optimization where the optimal solution set is unchanged. On the other hand, our CSE is an effective solution symmetric learning scheme for the contextualized policy capable of adapting to newly given task-condition. Contribution 2: DPP benchmark release. DPP is a widely studied task in hardware domain without public release of the simulation models and source codes for the methods. Also, DPP can be seen as a contextual offline black-box optimization benchmark with extended properties compared to the design-bench (Trabucco et al., 2022) , a representative non-contextual offline black-box optimization benchmark. In this work, by releasing the DPP benchmark with open-source simulation models and our reproduced baselines, DRL-based methods, meta-heuristic methods, behavior cloning-based methods, and our state-of-the-art CSE method, we expect huge industrial impacts on the hardware and the ML communities.

2. DECAP PLACEMENT PROBLEM (DPP) FORMULATION

This paper seeks to solve the decoupling capacitor placement problem (DPP), one of the essential hardware design problems. Decoupling capacitor (decap) is a hardware component that reduces power noise along the power distribution network (PDN) of hardware devices and improves the power integrity (PI). With transistor scaling and continuously decreasing supply voltage margin (Hwang et al., 2021) , power noise has become a huge technical bottleneck in high-speed computing systems. Generally, the more decaps are placed, the more reliable the power supply is. However, adding more decaps requires more space and is costly. Thus, finding an optimal placement of decaps is essential in terms of hardware performance and cost/space-saving. The goal of DPP benchmark is to optimally place a pre-defined number of decaps on a target PDN, given two conditions determined by higher-level tasks. First, keep-out regions are action-restricted areas where decaps cannot be placed as a design constraint. Second, probing port is the target chip/logic block location where the objective, power integrity (PI), is evaluated. The ports on PDN are represented as N row × N col grids, and the number of decaps is denoted by K. See Appendix A.2 for details of PDN and decap modeling for the benchmark. Remark that DPP cannot be formulated as a conventional mixed-integer linear programming (MILP)based combinatorial optimization because PI performance can not be formulated as a closed analytical form but can only be measured or simulated. This study aims to learn an effective DPP solver that can be used in practice.

2.1. CONTEXTUAL MARKOV DECISION PROCESS (MDP) OF DPP

As shown in Fig. 2 , the procedure for solving DPP is modeled as a contextual Markov decision process (CMDP). CMDP is an augmented MDP proposed by (Hallak et al., 2015) . The parameters of CMDP, transition and reward function, change based on the context variable. DPP can be formulated as CMDP, where the objective function J (; x) is determined by the task-condition x ∈ X (i.e. context); the X is task-condition space. Specifically, our objective function J (; x) is determined by PDN which is contextualized by x. The contextualized PDN is represented as a set of three-dimensional feature vectors x = {x i } Nrow×N col i=1 , where each grid (i.e., port) on PDN is represented as x i = (x i , y i , c i ), in which x i , y i indicate 2D coordinates of location, c i indicates the condition of port whether it belongs to a probing port I probe (c i = 2), keep-out regions I keepout (c i = 1), or decap allowed ports I allowed (c i = 0). Note that I keepout and I allowed represent index sets consisting of indices corresponding to keep-out regions and decap allowed ports, respectively. I probe refers to an index of probing port. See Appendix A.3. The design process sequentially places decaps on the available PDN ports until planning all the designated K decaps. We model this CMDP with state, action, and policy as follows: State s t contains task-condition x and previous selected actions: s t = {x, a 1:t-1 }. Action a t ∈ {1, ..., N row × N col } \ s t-1 is defined as an allocation of a decap to one of the available ports on PDN. The available ports are the ports on PDN except for the probing port, keep-out ports, and the previously selected ports. The concatenation of sequentially selected actions a = a 1:K indicates the final decap placement solution. Policy π θ (a|x) is the probability of producing a specific solution a = a 1:K , given task-condition x, and is factorized as: π θ (a|x) = K t=1 p θ (a t |s t ), where p θ (a t |s t ) is the segmented one-step action policy parameterized by the neural network. The objective of DPP is to find the optimal parameter θ * of the policy π θ (•|x) as: θ * = arg max θ E x∼p X (x) E a∼π θ (•|x) J (a; x) , where p X (x) is the probability distribution for varying task-condition x and J is objective function. Finding the optimal policy for various DPPs is a task contextual learning problem, in which each DPP has a distinct task condition. Once the task x is specified by p X (x), the state-action space with complexity of Nrow×N col -1-|I keepout | K is determined. Thus, an efficient policy π θ (a|x) should capture the contextual features among varying task conditions x. Note that CSE is based on imitation learning, by which objective function J is implicitly induced through offline expert data. The objective function of DPP is described in equation 3. Performance Evaluation Metric. The performance of DPP is evaluated by power integrity (PI) simulation that computes the level of impedance suppression over a specified frequency domain and is quantified as: J := f ∈F (Z initial (f ) -Z f inal (f )) • 1GHz f where Z initial and Z f inal are the initial and final impedance at the frequency f before and after placing decaps, respectively. F is the set of specified frequency points. As shown in Fig. 3 , the PI simulation for (N row × N col ) PDN requires a N row N col × N row N col × n f req (number of frequency points) sized Z-parameter matrix calculation because each port is electrically coupled to the rest of the ports and Z (i.e., impedance) is frequency-dependent. Thus, performance evaluation with a large Z-parameter matrix calculation is costly. The more impedance is suppressed, the better the power integrity and the higher the performance score. Note that this performance metric was also used for collecting the offline expert data using a genetic algorithm (GA).

3. METHODOLOGY

This section provides technical details of the proposed collaborative symmetricity exploitation (CSE) framework and the modified attention model (AM) neural architecture with two domain-specifically devised context neural networks for training a DPP solver.

3.1. ACTION-PERMUTATION SYMMETRICITY AND ORDER BIAS OF PLACEMENT TASK

The symmetricity found in placement problems is the action-permutation (AP)-symmetricity, the order of placement does not affect the design performance. Let us denote t i a permutation of an action sequence {1, ..., K}, where K is the length of the action sequence. We then define the APtransformation T AP = {t i } K! i=1 as a set of all possible permutations. The AP-symmetricity of DPP is induced to the learned solver through the AP-transformation T AP . Definition 3.1 (AP-symmetricity). For any a ∈ A, x ∈ X , t ∈ T AP where A is solution space and X is task-condition space, • Scala-valued Function f : A × X → R is AP-symmetric if, f (a, x) = f (t(a), x). • Conditional probability π is AP-symmetric if, π(a|x) = π(t(a)|x). The objective function J : A × X → R of DPP is an AP-symmetric function because a and t(a) have identical placement design. The main role of CSE is to induce AP-symmetricity to the policy π (conditional probability) to reflect the AP-symmetricity of an objective function J . Moreover, we define an order bias metric, b(π; p), to measure AP-symmetricity. Definition 3.2 (Order bias on distributions p = {p X , p A , p T AP }). For a conditional probability π(a|x), where x ∈ X (task-condition space) and a ∈ A (solution space), the order bias b(π; p) is defined as: b(π; p) = E x∼p X (x) E a∼p A (a) E t∼p T AP (t) [||π(a|x) -π(t(a)|x)|| 1 ] Intuitively, the order bias b(π; p) is a general property of a sequential solution generation scheme. It measures how much the solver π(a|x) has different probabilities to generate AP-symmetric solutions. The order bias metric holds for the following theorem: Theorem 3.1. Task-conditioned policy π(a|x) is AP-symmetric if and only if order bias is zero (b(π; p) = 0) while the distributions are non-zero, p X (x) > 0, p A (a) > 0, p T AP (t) > 0, for any x ∈ X , a ∈ A and t ∈ T AP . Proof. See Appendix G. The CSE framework was designed to induce the AP-symmetricity to the trained model to improve the generalization capability and to allow data-efficiency in training.

3.2. COLLABORATIVE SYMMETRICITY EXPLOITATION (CSE) FRAMEWORK

To train a contextualized policy with a limited number of expert data, we designed the CSE loss term L consisting of expert exploitation loss L Expert and self-exploitation loss L Self . Each loss function is mainly designed to reduce bias order (Definition 3.2): L := L Expert + λL Self (4) L Expert = -E a * ,x∼Daug [logπ θ (a * |x)] (5) L Self = E x∼U X E a ′ ∼πθ(•|x) E t∼U T AP [||π θ (a ′ |x) -π θ (t(a ′ )|x)|| 1 ] Expert Exploitation. The major role of expert exploitation is to train high-quality symmetric contextualized policy for various task-conditions x by leveraging the offline expert data a * with T AP . The T AP transforms the existing offline expert data a * for P times to augment the offline expert dataset D exp = {(x (i) , a (i) * )} N i=1 to reflect the AP-symmetric nature of the placement task. Specifically, we randomly choose {t 1 , ..., t P } ⊂ T AP to generate D aug = { x (i) , a (i) * , x (i) , t 1 (a (i) * ) , ..., x (i) , t P (a (i) * ) } N i=1 . Then, L Expert is expressed as a teacherforcing imitation learning scheme with the augmented expert dataset D aug . Note that expert exploitation is expected to reduce order bias defined with the three uniform distributions; x of U Dexp (x), a of U Dexp (a), and t of U T AP (t) . Self-Exploitation. While D aug only contains expert quality, self-exploitation involves self-generated data, whose quality is poor at the beginning but improves over the phase of training. Thus, the self-exploitation scheme is designed to induce the AP-symmetricity in a wider action space to achieve greater generalization capability. Formally, self-exploitation loss is a special form of order bias defined based on the distributions, x ∼U X , a ∼π θ (current policy) and t ∼U T AP , where U is a uniform distribution; L Self = b(π θ , p = {U X , π θ , U T AP }).

3.3. CONTEXTUAL ATTENTION MODEL

To further improve the generalization capability of the trained DPP solver, we modified the attention model (AM) (Kool et al., 2019) and termed contextual attention model. As described in Fig. 2 , the decision-making procedure consists of two newly devised context neural networks; (1) encoder capturing initial design conditions while contextualizing the probing port through the probing port context network (PCN), and (2) decoder sequentially allocating decaps on PDN while contextualizing the stages of the partial solution through the recurrent context network (RCN). See Appendix C for detailed implementation. Encoder. Our encoder consists of multi-head attention (MHA) and feedforward (FF), similar to the transformer network (Vaswani et al., 2017) . The encoder takes the task-conditioned PDN feature vector x as an input and outputs all node embedding h. Encoding is processed once at the initial state, t = 0. The node embedding h is time-invariant (i.e., fixed after the encoding process) and is used in the decoding process. We proposed a novel probing context network (PCN) in the encoding process so that the learned solver can adapt well to a new task. PCN is a simple but effective two-layer perceptron model with a ReLU activation layer that takes the hidden embedding of the probing port node h probe and outputs the probing port contextual vector c probe = MLP P CN (h probe ). Decoder. With the node embedding h generated by the encoder, decoder sequentially selects an action a t until placing all K decaps. At each step t, the decoder takes (1) the node embedding h (static information), ( 2) the current state s t , (3) the probing port contextual vector c probe , and (4) the previous action context vector c a t-1 = MLP RCN (h a t-1 ) generated by the recurrent context network (RCN) as inputs and outputs a new action a t . To leverage sequential state transitions in decoder, we devised the recurrent context network (RCN). RCN is a two-layer perceptron model with a ReLU activation layer that embeds the previously selected node's embedding h at-1 at step t into c a t-1 . Then, overall context vector is updated as c = c probe + c a t-1 and is used as query for attention decoder that eventually infers the next action a t by the attention mechanism, where the key and value comes from h. See Appendix C.2 for detailed process. 2020) applied convolutional neural network (CNN)-based Q approximators to solve a target DPP. However, their methods require large iterations involving costly reward calculations and their trained policies were non-reusable; if the DPP condition changes, they must be re-trained. In an effort to overcome the reusability limitation by training a solver, Park et al. (2022) ; Kim et al. (2021) implemented promising neural combinatorial optimization (NCO) models, the attention model (Kool et al., 2019) and pointer network (Vinyals et al., 2015) , to construct a contextualized policy without iterative exploration and domain knowledge. However, their methods still showed poor data-efficiency in training and unsatisfactory generalization performance.

Symmetricity Learning in Solution Space

. There exists several studies to leverage the symmetricity in solution space. Kwon et al. (2020) suggested a new reinforcement learning scheme, a policy optimization for multiple optima (POMO) to leverage the traveling salesman problem (TSP)'s solution symmetricity, the cyclic property that identical solution can be expressed as N heterogeneous trajectories by permuting initially visited node. Kim et al. (2022) proposed the symmetric neural combinatorial optimization (Sym-NCO) method, which is an extension of POMO to general-purpose symmetric learning for various combinatorial optimization tasks. Bengio et al. (2021) proposed a generative flow net (GFlowNet) to train policy distribution proportional to reward distribution π ∝ R considering solution symmetricity. The GFlownet suggests a sequential decision-making scheme with a directed acyclic graph (DAG), instead of classical tree structure, to induce solution symmetricity. The GFlowNet is applied to solve molecule optimization and bio-sequential design (Jain et al., 2022) . While POMO (Kwon et al., 2020) and Sym-NCO Kim et al. (2022) leverage DRL, CSE focuses on offline imitation learning. Though GFlowNet (Bengio et al., 2021) can be trained in a fully offline manner, it is not yet designed for training a contextualized policy. Thus, CSE is positioned between POMO and Gflownet as an offline symmetricity learning method to train contextualized policy. .4 ). Decap is modeled as a unit-cell with a single port that is attached to a specific port on PDN when an action is made. The RLGC electrical parameters of decap unit-cell are also shown in Appendix A.2. 100 PDN cases for test and another 100 PDN cases for validation were generated for performance evaluation. We made sure test data, validation data and training data did not overlap. We used the number of decap K = 20 for every training, but K can be changed during the inference. Offline Expert Data Collection. Since expert data for the DPP benchmark was not available, we synthetically generated offline expert data using genetic algorithm (GA) for this study. The number of iterations done for collecting a single data is represented as M . Note that the higher the M , the better the quality of data, but the higher the simulation cost. We used GA{M = 100} to collect the offline expert data. In addition, we denote N as the number of offline expert data used for training CSE. Note that the lower the N the more data-efficient the training is. Hyperparameters. For training, we generate three transformed data per expert data. We denote P (= 3) as the number of AP-transformed data per offline data. Thus, the total number of guiding data becomes N × (P + 1). For instance, P = 3, N = 50 makes total 50 × 3 + 50 = 200 guiding data. We set the distribution ρ, described in Section 2.1, as uniform distribution for training. We used N = 2000 offline expert data for training CSE and IL-based baselines. For the learning algorithm, we used ADAM (Kingma & Ba, 2015) with a learning rate of 10 -5 . We trained our model with batch size 100 for N < 200 and batch size 1, 000 for N = 1, 000 and 2, 000. We trained for a maximum of 200 epochs for each model; we used the model with the best validation score for CSE and other ML baselines to evaluate performance. See Appendix D for detailed setup. Baselines for Comparison. For search heuristic baseline methods, we implemented random search and genetic algorithm. Since these methods are iterative solvers, they require a large number of simulations (i.e., M >= 100. See Table 1 ) for each problem. For non-iterative learning-based solvers, we reported two RL baselines, AM-RL (Park et al., 2022) and Arb-RL (Kim et al., 2021) and two IL baselines, AM-IL and Arb-IL, which are modified AM-RL and Arb-RL with imitation learning instead of reinforcement learning, to investigate the effectiveness of CSE components compared to the same imitation learning approaches with different learning strategies/neural architecture. Implementation details of the baselines are provided in Appendix D.2 and Appendix D.3.

5.2. GENERALIZATION CAPABILITY EVALUATION

To verify the generalization capability of the trained solver, each method is given the same unseen 100 DPPs and the average performance score was measured, after allocating K = 20 decaps on each. As shown in Table 1 , our CSE significantly outperformed all baselines in terms of average performance score. Online search methods generally find solutions that give a high average performance. This is due to a large number of searching iterations M , which incurs the same number of costly simulations. On the other hand, the learning-based baselines and CSE do not require simulations to generate solutions; once trained, they only require a single simulation to measure the performance. Though learning-based methods can easily find such a solution, CSE is the only method capable of finding a solution that outperforms the highly iterative online search methods by a zero-shot inference. When the number of costly simulations was limited, RL-based methods (AM-RL, Arb-RL) showed poorer generalization capability than their IL versions (AM-IL, Arb-IL) due to inefficiency in exploring over extremely large combinatorial action space of DPP. We believe that imitation learning approach, fitting the policy with offline expert data, has greater exploration capability with the help of expert policy thus able to achieve higher performance with a limited simulation budget (see Appendix D.2). Note that if we have an infinite budget for reward simulation (which never happens in a real-world hardware setting), DRL could achieve greater performance and generalization capability with a sufficient learning loop. Among the IL approaches trained with the same number of offline expert data (N = 2, 000), CSE showed the highest performance. We believe that such higher generalization capability comes from both symmetricity exploitation schemes and the newly devised neural architecture: (1) expert exploitation and self-exploitation with symmetric label transformation amplify the number of data to train with and induce solution symmetricity to improve generalization capability. (2) the neural architecture with PCN and RCN makes the policy easily adapt to new task conditions. Extrapolation over Expert Method. The CSE policy trained with offline expert data generated by the expert policy, GA{100}, outperformed GA{500}, with zero-shot inference. That is, the CSE policy trained with low-quality offline expert data produced higher-quality designs. We believe this was possible because we trained a factorized form of policy that does not predict labels in a single step but produced a solution through a serial iterative roll-out process, during which a good strategy for placing decaps can be identified. In addition, the CSE with symmetric label transformation has further guided the policy to learn such an effective decap placement design scheme. Order Bias Measurement. We empirically show that our CSE successfully induces AP-symmetricity by reducing the order bias (see Appendix H). 

5.4. SCALABILITY EVALUATION

For scalability verification, learning-based DPP methods were pre-trained for a fixed scale PDN, (10 × 10), and a fixed number of decaps, K = 20. Then, the pre-trained models were asked to place decaps of varying K ∈ {12, 16, 20, 24, 30} on (10 × 10) PDN and varying K ∈ {20, 40} for a larger (15 × 15) PDN without additional training (i.e, zero-shot). We chose two baseline methods for comparison: GA {100} and AM-IL. As shown in Table 2 , our CSE outperformed GA {100} and AM-IL for all scales. Furthermore, CSE achieved greater performance with fewer decaps. Reducing the number of decaps has a significant industrial impact; as hardware devices are mass-produced, reducing a single decap saves enormous fabrication cost. The CSE was also validated on several scales and on multiple hardware devices. The CSE is a general purpose offline learning scheme for placement tasks that can be further applied to other hardware placement tasks including chip placement, ball grid array (BGA) placement, and via placement.

A DPP ELECTRICAL MODELING AND PROBLEM DEFINITION

This section provides electrical modeling details of PDN and decap models used for verification of CSE in DPP. Note that these electrical models can be substituted by those of your interest. There are three methods to extract PDN and decap models that are also used for objective evaluation; 3D EM simulation tool, ADS circuit simulation tool, and unit-cell segmentation method. For each method, there exists a trade-off between time complexity and accuracy. See Table 3 . Out of the three methods, we used the unit-cell segmentation method for a benchmark. Simulation time was evaluated using the same PDN model on Intel i7. Note that simulation time depends on the size and complexity of the PDN model. The development of AI has led to an increased demand for high-performance computing systems. High-performance computing systems not only require precise design of hardware chips such as CPU, GPU and DRAM, but also require stable delivery of power to the operating integrated circuits. Power delivery has become a huge technical bottleneck of hardware devices due to the continuously decreasing supply voltage margin along with the technology shrink of CMOS transistors (Hwang et al., 2021) . Fig. 9 (a) shows the power distribution network (PDN) consisting of all the power/ground planes from the voltage source to operating chips. Power is generated in VRM and delivered through electrical interconnections of PCB, package and chip. Finding ways to meet the desired voltage and current from the power source to destinations along the PDN is detrimental because failure in achieving power integrity (PI) leads to various reliability problems such as incorrect switching of transistors, crosstalk from neighboring signals, and timing margin errors (Swaminathan & Engin, 2007) . Decoupling capacitors (decaps) placed on the PDN allows the reliable power supply to the operating chips, thus improving the power integrity of hardware. As shown in Fig. 9 (b)-(c), the role of decap is analogous to that of water storage tanks, placed along the city, apartment, and household, that can provide water uninterruptedly and reliably. As if placing more water tanks can make the water supply more stable, placing more decaps can make power supply more reliable. However, because adding more decaps requires more space and is costly, optimally placement of decaps is important in terms of PI and cost/space-saving.

A.2 PDN AND DECAP MODELS FOR VERIFICATION

Unit-Cell Segmentation Method. The segmentation method (Kim et al., 2010 ) is a simple and fast way to generate approximated electrical models. Because the analysis of the full electrical model using EM simulation is very time-consuming, we divided the full PDN model into smaller unit-cells and constructed the full PDN model using the unit-cell segmentation method. For fast simulation, we used equation-based python implemented segmentation method, illustrated in Fig. 10 . Segmentation method was used for generation of PDN model consisting of a chip layer and a package layer for verification as illustrated in Fig. 10 (a) . The segmentation method was also used for objective evaluation of DPP. When a solution for DPP is made, decaps are placed on the corresponding ports on PDN using the segmentation method as illustrated in Fig. 10 (b ). Note that these electrical parameters and PDN structures were used as a benchmark. For practical use of CSE, these PDN and decap models can be substituted by those of your interests. 

B EXPERT LABEL COLLECTION

We used a genetic algorithm (GA) as the expert policy to collect expert guiding labels for imitation learning. GA is the most widely used search heuristic method for DPP (Erdin & Achar, 2019; de Paulis et al., 2020; Xu et al., 2021; Juang et al., 2021) . We devised our own GA for DPP, the objective of which is to find the placement of given number (K) of decaps on PDN with a probing port and 0-15 keep-out regions that best suppresses the impedance of the probing port. Notations. M is the number of samples to undergo an objective evaluation to give the best solution. The value of M is defined by the size of population P 0 times the number of generation G. K refers to the number of decaps to be placed. P elite is the number of elite population. Guiding Dataset. To generate expert labels, guiding problems were generated in the same way test dataset was generated. We made sure the guiding data problems do not overlap with the test dataset problems. Also, we made sure each guiding problem does not overlap with each other. Each guiding data problem goes through the following process described in Fig. 15 to collect the corresponding expert label. Once the initial population is generated randomly, a new population is generated through elitism, crossover, and mutation. This whole process of generating a new population makes one generation; the Generation process is iterated for G -1 times. Elitism. Once initial population is formulated, the entire population undergoes objective evaluation and gets sorted in order of objective value. The size of elite population is pre-defined as P elite = 4 for GA {M = 100} (expert policy). That means the top 4 solutions in the population become the elite population and are kept for the next generation. Crossover. Crossover is a process by which new population candidates are generated. Each solution of the current population including the elites is divided in half. Then, as described in Fig. 16 (c ), half the solutions on the left and the other half on the right go through random crossover for P 0 times to generate a new population. If the elite population is available, P 0 -P elite random crossover takes place so that the total population size becomes P 0 , including the elite population. Mutation. According to Fig. 16 (d) , there may exist solutions with overlapping numbers after the random crossover. We replace the overlapping number with a randomly generated number, and we call this mutation. Select Best. When G is reached, the final population is evaluated by the performance metric. Then, a solution with the highest objective value becomes the final guiding solution for the given DPP. The guiding problems and corresponding solutions generated as a result of GA are saved and used as guiding expert labels for imitation learning.

C DETAILS OF NEURAL ARCHITECTURE DESIGN

Our neural architecture has the AM (Kool et al., 2019) with context modification. The AM is a transformer (Vaswani et al., 2017) -based encoder-decoder model designed to solve combinatorial optimization problems. We used conventional notations from transformer (Vaswani et al., 2017) and AM (Kool et al., 2019) , including multi-head attention (MHA), feed forward (FF), query, key and value (Q, K, V ). Because their terminologies are well organized, we tried to keep every notation as possible. In this paper, we focused on presenting the main differences between AM and our architecture. See Kool et al. (2019) for detailed mechanism of AM.

C.1 CHANGE OF NOTATIONS.

There are small revisions we made from Kool et al. (2019) . In AM, TSP nodes are presented as x i , i ∈ {1, ..., N }, where N refers to the number of TSP nodes. This paper uses I probe for the node of the probing port, I keepout for nodes of the keep-out regions and I allowed for nodes of the decap-allowed ports. Kool et al. (2019) denotes action as π (for representing permutation action), but we denoted action as a. In, Kool et al. (2019) , the notation, h (N ) , refers to N times MHA in encoder; we denoted this notation as h just for readability. There are two additional notations: c probe is the probing context embedding from the probing port context network (PCN in section 3.2) and c at-1 is the recurrent context embedding from the recurrent context network (RCN in section 3.2) for step = t. C.2 HIGHLIGHT OF MODIFICATIONS: CONTEXT EMBEDDING. The main difference between the AM and ours is the context embedding and is illustrated in Fig. 17 . AM's (Kool et al., 2019) context embedding is presented as follows: h (c) = M HA([h (g) , h aτ-1 , h a1 ], h) Context embedding of AM. Since the AM was originally designed for TSP and its invariant problems, AM's context embedding is implemented for capturing the entire graph by taking the average of all  h (c) = M HA([h (g) , h a t-1 , h p ], h) Context embedding of Ours. We observed that h (g) degrades the performance of the model for DPP. DPP is different from TSP; we need a new DPP-specific context embedding strategy. Therefore, we tried to focus on the probing port more than others by proposing the PCN. We removed h (g) and h a1 from the context embedding and replaced them with our newly designed context embedding. Our context embedding is described as follows: h (c) = M probe + c a t-1 , h) c probe = MLP P CN (h probe ) (10) c a t-1 = MLP RCN (h a t-1 ) Note that both MLP P CN and MLP RCN are two-layer perceptron models with ReLU activation, where input and output dimensions are identical (d = 128 in all experiments). C.3 CALCULATION OF PROBABILITY. Probability calculations using context hidden embedding h (c) , and PDN hidden embedding h i , i ∈ {1, ..., N row × N col } in (11-14) are exactly identical to (5-8) in Kool et al. (2019) except the masking mechanism in equation 13 and equation 14. Because Kool et al. (2019) solves TSP, so they mask the previously selected actions by forcing -∞ as compatibility u (c)j . For DPP, we mask not only the previously selected actions a 1:t-1 but also the probing port index I probe and the keep-out region indices I keepout ; it is forbidden to choose the I probe , I keepout and previously selected actions a 1:t-1 Query, key and value are computed by: q c = W Q h (c) , k i = W K h i , v i = W V h i AM-RL. AM-RL is a AM-based DPP solver proposed by Park et al. (2022) . We reproduced AM-RL by following implementation of Kool et al. (2019) foot_1 and paper of Park et al. (2022) . We set the training step 2, 000 with batchsize B = 100 that makes total 200, 000 PI simulation. During the inference phase, each learned model produces a greedy solution from their policies (i.e., M = 1) following (Kool et al., 2019) .

D.3 IMPLEMENTATION OF META-HEURISTIC BASELINES.

Genetic Algorithm (GA). GA {M = 100} and GA {M = 500} are implemented as baselines. For detailed procedures and operators used for GA, see Appendix.B. GA {M = 100} is the expert policy used to generate expert data for imitation learning in CSE. For GA {M = 100}, the size of population, P 0 , is 20, number of generation, G, is 5 and elite population, P elite , is 4. For GA {M = 500}, P 0 is 50, G is 10 and P elite is 10. Random Search (RS). The random search method generates M random samples for a given problem and selects the best sample with the highest objective value. Fig. 18 shows the performance of GA and RS depending on the number of iterations (M ). The performance was measured by taking the average of 100 test data solved by each method at each M . GA outperformed RS at every M , and the performance increased with increasing M for both methods. However, the gradient of performance increment decreased with increasing M . On the other hand, our CSE showed higher performance than GA{M = 100} and RS {M = 10, 000} with a single inference M = 1.

E EXPERIMENTAL RESULTS IN TERMS OF POWER INTEGRITY

The objective of DPP is to suppress impedance of the probing port as much as possible over a specified frequency range and is measured by the objective metric, Obj : = f ∈F (Z initial (f ) - Z f inal (f )) • 1GHz f . Performance of CSE was evaluated in comparison to GA {M = 100} (expert policy), GA {M = 500}, RS {M = 10, 000}, AM-RL and AM-IL on unseen 100 PDN cases. Each method was asked to place 20 decaps (K = 20) on each test. The role of placing decaps in hardware design is to decouple loop inductance of PDN. In terms of PI, analysis of loop inductance is critical, but at the same time, is complex (Farrahi & Koether, 2019) .

E.1 IMPEDANCE SUPPRESSION PLOTS

The loop inductance distribution of PDN highly depends on various design parameters such as the location of probing port, spacing between power/ground, size of PDN, and hierarchical layout of PDN (Fan et al., 2000) . When human experts place decaps on PDN, there are too many domain rules to consider. On the other hand, CSE understands the PDN structure and its electrical properties by data-driven learning. According to Fig. 21 , CSE tends to place decaps near the probing port, which is a well-known expert rule in the PI domain. 

F FURTHER ABLATION STUDY

This section reports ablation studies on action permutation invariance and hyperparameters N (number of guiding samples), λ (weight of self-exploitation loss term), and P (number of permutation transformed labels). F.1 ABLATION STUDY ON N N is the number of expert labels generated by the expert policy, GA {M = 100}. We ablate N ∈ {100, 500, 1000, 2000} with fixed P = 3 and λ = 8 and compare to AM-IL baseline for all N . As shown in Table 7 , CSE with N = 2000 gives the best performance and CSE outperforms AM-IL for all N variations. Performance of AM-IL is saturated at N > 500 while the performance of CSE continuously increases with the increase of N . Table 7 : Ablation study on N for CSE (P = 3, λ = 8) and AM-IL. F.2 ABLATION STUDY ON λ λ refers to the weight of self-exploitation loss term L Self , in the collaborative learning loss L := L Expert + λL Self . To set λ × L U be 0.1 ∼ 1, we first multiplied 10 32 to λ because the probability of a specific solution is extremely small. Then, we ablated for λ ∈ {1, 2, 4, 6, 7, 8, 9, 10} (10 32 is omitted) with fixed N = 100 and P = 3. For every λ, it prevents overfitting of the model in comparison to the baselines trained only with L Expert (see Fig. 23 ). According to the Table 9 , λ = 8 gives the best validation scores. F.3 ABLATION STUDY ON P P is the number of permutation transformed labels per each expert label used for imitation learningbased expert exploitation. We ablate P ∈ {3, 5, 7} with fixed N = 100 and λ = 8 and compared collaborative symmetricity exploitation (i.e., both expert and self-exploitation) to only expert exploitation training case. As shown in Table 9 , P = 3 with {Expert exploitation + Self-exploitation} give best performances. For every P , {Expert exploitation + Self-exploitation} gives the better performances, indicating self-exploitation scheme well prevents overfitting of training process for sparse dataset. Table 9 : Ablation study on P with and without unsupervised loss term. Validation Score Expert exploitation {P = 3} 12.97 + Self-exploitation {λ = 8} 12.98 Expert exploitation {P = 5} 12.95 + Self-exploitation {λ = 8} 12.95 Expert exploitation {P = 7} 12.93 + Self-exploitation {λ = 8} 12.95 Assume that there exist a * ∈ A, x * ∈ X , and t * ∈ T AP , such that π(a * |x * ) ̸ = π(t(a * )|x * ). Then 

H ORDER BIAS MEASUREMENT

This section reports the order bias measurements of AM-IL and our CSE. We measured b(π, p = {U X , π, U T AP } for sample width 100 and took the average value. As shown in Appendix H, our CSE significantly reduced the order bias, verifying that CSE successfully induced the AP-symmetricity. 



https://github.com/pemami4911/neural-combinatorial-rl-pytorch https://github.com/wouterkool/attention-learn-to-route



Figure 2: Overall pipeline of DPP contextualized policy parameterization with offline learning.

2.2 OBJECTIVE FUNCTION OF DPP (a) Unit-cell representation of target PDN of a hardware device (b) Z-parameter of target PDN.

Figure 3: Unit-cell and Z-parameter representations of real-world target PDN.

Figure 4: Illustration of collaborative symmetricity exploitation (CSE) process.

Machine Learning-based Methods for DPP. Deep reinforcement learning (DRL) has been widely used to solve DPP. Park et al. (2018) employed Q-learning and Park et al. (2020); Zhang et al. (

Figure 5: Ablation study on CSE componentsAblation Study. We conducted ablation studies to validate the effectiveness of CSE components and context neural networks with sparse offline data (N = 100). We ablated the effectiveness of expert exploitation (EE) and self-exploitation (SE) in two policy networks: original AM and contextual AM (ours). The original AM refers to the AM-IL baseline. Each component of CSE supported increasing generalization capability in both policy networks and the contextual AM with newly devised context neural networks was verified to outperform the original AM. Therefore, we verified that both CSE components and the modifications of AM successfully contributed to the promising performance.

Figure 6: Offline data-efficiency evaluation (P = 3, λ = 8) We investigated how the number N of offline data generated by the expert method, GA{M = 100}, affects the design performance of CSE and the two baselines, AM-IL and Arb-IL. As shown in Fig. 6, CSE outperformed the baselines in all N variation; CSE trained with N = 100 even performed better than the baselines trained with N = 2000. Moreover, CSE monotonically improved with N while the others saturated when N > 500.

Figure 7: Structure of HBM PDN model

(a) An example of hierarchical power distribution network (PDN). (b) Electrical circuit model of the hierarchical PDN in (a). (c) Water supply chain from the source to household.

Figure 9: Illustration of Hierarchical Power Distribution Network (PDN) analogous to Water Supply Chain.

Figure 10: Segmentation Method Implemented for PDN Generation and Decap Placement on PDN. The PDN model we used for verification has a two-layer structure; a package layer at the bottom and a chip layer on top of it as illustrated in Fig. 11. The PDN was modeled through the unit-cell segmentation method. Package layer was composed of 40 × 40 package unit-cells and chip layer was composed of 10 × 10 (i.e, N row × N col ) chip unit-cells. Because the DPP benchmark places MOS type decaps, which are placed on chip, ports are only available on chip. Thus, we extracted 10 × 10 ports information from the chip layer. See Fig. 14 (a), illustrating the chip PDN divided into 10 × 10 units and each unit-cell numbered.

(a) Top-View of PDN model. (b) Side-View of PDN model.

Figure 11: Top-view and Side-view of PDN Model used for Verification

Figure 12: Electrical Modeling of Chip and Package Unit-Cells for PDN Model generation.

Figure 13: Decap Unit-Cell with the Electrical Parameters used for Verification.

Figure 14: Illustration of how the DPP problem with specific condition is given as an input and decap placement solution is generated as an output.

Figure 15: Process Flow of Genetic Algorithm for DPP.

Figure 16: Illustration of each GA Operators used for DPP Guiding Data Generation.

Figure 17: Overview of main difference between AM and modified version of AM.

{N = 2000} 200,000 (N = 2000, M = 100 from GA expert) AM-IL {N = 2000} 200,000 (N = 2000, M = 100 from GA expert) CSE {N = 100} (ours) 10,000 (N = 100, M = 100 from GA expert) CSE {N = 1000} (ours) 100,000 (N = 1000, M = 100 from GA expert) CSE {N = 2000} (ours) 200,000 (N = 2000, M = 100 from GA expert)

Figure 18: Performance of GA and RS with varying number of iterations (M ) in comparison to CSE at M = 1.

Figure 19: Impedance suppressed by each method, GA {M = 100} (expert policy), GA {M = 500}, RS {M = 10, 000}, AM-RL , AM-IL and CSE (Ours) for 6 example PDN cases out of 100 test dataset. (The lower the better.)

Figure 20: Corresponding decap placement solutions to Fig. 19 by each method. Red represents probing port, black represents keep-out ports and blue represents decap locations.

Fig. 21 shows the decap placement solutions of 6 PDN cases plotted in Fig. 19. The solutions by the search-heuristic methods, GA and RS, tend to be scattered while the solutions by learning-based methods, AM-RL, AM-IL and CSE, are clustered. Since search-heuristic methods are based on

POWER NOISE ANALYSIS ON HBM PDN (a) Impedance suppression. (b) Initial power noise before decap placement. (c) Power noise after decap placement by our CSE{M = 1, K = 26} and GA{M = 100, K = 40}.

Figure 21: Power noise analysis in terms of simultaneous switching noise (SSN) on HBM PDN before and after decap placed by our CSE{M = 1, K = 26} and GA{M = 100, K = 40}.

Figure 22: Validation graph CSE in comparison to AM-IL for varying number of offline expert data N ∈ {100, 500, 1000, 2000}.

Figure 23: Validation graph of λ ∈ {1, 2, 4, 6, 7, 8, 9, 10} on fixed P = 3 and N = 100.

Figure 24: Validation score of P ablation with and without self-exploitation loss term.

Performance evaluation with the average score of 100 PDN cases (the higher the better).

Scalability evaluations on larger PDN scale and varying number of decap K.

Time Taken for an Objective Evaluation of a PDN model described in Appendix A.2

. AM-IL is an imitation learning version of AM-RL trained by our training data. For experiments in Table 1, we set N = 2000 and B = 1000 for training. For ablation study, we mainly ablate N , when N = 100 we set B = 100. Here is the training sample complexity (the number of PI simulations during training) of each ML baselines and CSE: Training sample complexity of ML baselines and CSE.

Ablation study of λ on fixed P = 3 and N = 100.

, b(π; p) = E x∼p X (x) E a∼p A (a) E t∼p T AP (t) [||π(a|x) -π(t(a)|x)|| 1 ] ≥ p X (x * )p A (a * )p T AP (t * )||π(a * |x * ) -π(t(a * )|x * )|| 1 > 0,which results in a contradiction. Therefore, π(a|x) = π(t(a)|x) for any Fa ∈ A, x ∈ X , t ∈ T AP : i.e, policy π(a|x) is AP-symmetric.

Evaluation of Order Bias

annex

Note that W Q , W K and W V are 128-to-128 linear projections.After that, compatibility u (c)j is computed by the dot product of query and key, with masking mechanism (setting -∞ not to select actions in s t-1 ).The tanh clipping is done following Bello et al. (2016) and Kool et al. (2019) .Finally, probability can be computed using softmax function as follows:

D DETAILED EXPERIMENTAL SETTINGS

This section provides detailed experimental settings for main experiments and ablation studies.

D.1 TRAINING HYPERPARAMETERS.

There are several hyperparameters for training; we tried to fix the hyperparameters as Kool et al. (2019) did for showing their frameworks' practicality. We then provided several ablation studies on each hyperparameter to analyze how each component contributes to performance improvement.Training hyperparameters are set to be identical to those presented in AM for TSP (Kool et al., 2019) except learning rate, unsupervised regularization rate λ, the number of expert data N , number of action permutation transformed data per expert data P and batch size B. There are two main ML baselines, Arb-RL (Kim et al., 2021) and AM-RL (Park et al., 2022) .Arb-RL. Arb-RL is a PointerNet-based DPP solver proposed by Kim et al. (2021) . However, reproducible source code was not available. Therefore, we implemented the Arb-RL following the implementation of Bello et al. (2016) 1 and paper of Kim et al. (2021) . We set the training step 1, 600 with batchsize B = 100 that makes total 160, 000 PI simulation. 

