DISTRIBUTED INFERENCE AND FINE-TUNING OF LARGE LANGUAGE MODELS OVER THE INTERNET Anonymous

Abstract

Large language models (LLMs) are useful in many NLP tasks and become more capable with size, scaling to over 100 billion parameters. With the release of BLOOM-176B and OPT-175B, everyone can download pretrained models of this scale. Still, using a pre-trained 100B+ model requires high-end hardware, making it inaccessible to most researchers. Recent studies in memory-efficient training (e.g. offloading) could alleviate these costs, but they do not cover important use cases of LLMs, such as autoregressive inference. In this work, we investigate methods for cost-efficient inference of large language models, comparing local and distributed strategies. We observe that a large enough model (100B+) could run efficiently on geodistributed devices in a consumer-grade network, for example by connecting existing compute resources of multiple research groups or pooling under-utilized compute from multiple cloud regions. To run LLMs in this unconventional setting, we develop a fault-tolerant algorithm for inferencing language models. We propose PETALS -a decentralized system for running LLMs -and show that it can run BLOOM-176B over the Internet over 10× faster than offloading for sequential generation. We evaluate the performance of our system in both simulated conditions and an actual distributed system spanning two continents. The design of PETALS allows participants to inference, and fine-tune, or inference fine-tuned models simultaneously without affecting each other's results.

1. INTRODUCTION

In recent years, the NLP community has found that pretrained language models greatly accelerated progress on many research problems through either fine-tuning (Radford et al., 2018) or simple prompting (Brown et al., 2020) . Furthermore, their quality tends to improve as we increase model scale (Radford et al., 2019; Kaplan et al., 2020) . Following this trend, modern language models often have hundreds of billions of parameters (Brown et al., 2020; Rae et al., 2021; Zeng et al., 2021; Kim et al., 2021) . Most recently, several research groups open-sources their pretrained LLMs with over 100B parameters for everyone to use (Zhang et al., 2022; Khrushchev et al., 2022; Zeng et al., 2022) . Even though these models are publicly available, they are still difficult to use due to the sheer size in terms of parameters. For example, OPT-175B and BLOOM-176B need over 350GB accelerator memory for inference and significantly more for fine-tuning. As a result, even basic inference for large language models requires multiple high-end GPUs or multi-node clusters. Several recent studies propose algorithms for running large models with more affordable hardware (Pudipeddi et al., 2020; Ren et al., 2021) , e.g. by offloading parameters to RAM. However, these techniques are inefficient in many practical LLM usage scenarios as we show in Section 3.1. In this work, we search for a cost-effective way of running large pre-trained language models in their main use cases: inference, in-context learning, fine-tuning.We systematically analyze latency and throughput for training and inference tasks for these use cases and determine which factors become dominant for very large models. Notably, for models with over 100B parameters, communicating activations over a slow network can be faster than swapping layers from local RAM or SSD. Based on these observations, it should be possible to run LLMs cost-effectively by pooling together commodity hardware over the Internet. However, existing algorithms are not designed to run inference with unreliable devices or high-latency networks. To bridge this gap, we formulate a novel algorithm for fault-tolerant distributed inference of very large models. This algorithm takes care of inference-specific problems: keeping distributed attention caches between inference steps and recovering when some of remote devices fail or leave abruptly. This algorithm allows for several cost-effective ways of using LLMs, such as combining under-utilized GPUs in multiple cloud regions, or forming a collaboration of multiple research groups and connecting their existing infrastructure to run large models together. The contributions of this work can be summarized as follows: • We analyze the performance model for training very large language models and identify bottlenecks that are not addressed by existing algorithms. Notably, both local offloading and existing distributed algorithms struggle with sequential generation in different ways. • We propose a novel distributed algorithm that can inference large (100B+) language models on distributed unreliable devices. To the best of our knowledge, this is the first algorithm that can inference LLMs with 100B+ parameters over the Internet. • Using this algorithm, we design PETALS -a decentralized system for inferencing and fine-tuning LLMs over the Internet. When inferencing BLOOM-176B, PETALS outperforms offloading by roughly 10× for autoregressive generation. The implementation of our algorithms and PETALS system is available online.foot_0 

2. BACKGROUND: EFFICIENT TRAINING AND INFERENCE

There is a wide variety of methods that can optimize training and inference for most deep learning workloads. In this section, we focus on two lines of research that are relevant for our analysis: model parallelism and parameter offloading.

2.1. MODEL PARALLELISM

Model parallelism is a family of distributed training algorithms that assigns each device to hold a subset of model parameters, run a subset of computations and communicate output activations. There are two main types of model parallelism: "Traditional" model parallelism and pipeline parallelism. Traditional model parallelism (or tensor parallelism) assigns each device to compute a subset of each model layer (e.g., a subset of neurons), then communicate results between each other and proceed to the next layer (Krizhevsky et al., 2012; Ben-Nun & Hoefler, 2019; Tang et al., 2020) . Each device performs a symmetric computation, applied to a different slice of model weights, which makes traditional model parallelism compatible with MPI-based communication. In turn, the main performance overhead of this strategy comes from all-to-all communication (and synchronization) after each layer (Krizhevsky, 2014). Pipeline parallelism reduces the communication overhead by assigning each device with one or several full layers (Huang et al., 2019; Narayanan et al., 2019; Yang et al., 2019) . During the forward pass, each stage applies its subset of layers to the inputs supplied by the previous stage, then sends the outputs of the last layer to the next stage. For the backward pass, this process is reversed, with each pipeline stage passing the gradients to the same device that previously supplied it with input



See github.com/iclr2023-anonymous/petals



Figure1: An overview of PETALS system. Servers store pre-trained LLM layers and temporarily hold attention caches for inferencing. Clients hold embedding layers, as well as learned prompts/adapters when inferencing fine-tuned models. Arrows denote temporary chains formed for inference.

