# Understanding PCIe performance for end host networking

*Rolf Neugebauer*, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, Andrew W. Moore



# The idea of end hosts participating in the implementation of network functionality has been extensively explored in enterprise and datacenter networks Enabling End-host Network Functions Host-network Exception Handlers: Most-network Control in Enterprise Networks

# roela, Contelos Gianteidis, Matthew P. Grosvenor, Extending Networking into the Virtualization Laver Fabric: A Retrospective on Evolving SDN

Martin Casado Nicira

Teemu Koponen

#### Abstract

MPLS was an attempt to simplify network hardware while improving the Seathility of network control. Software-Defined Networking (SDW) war designed to make further progress along both of these dimensions. While a significant step forward in some respects, it was a step backwards in others. In this paper we disease 37 shartonnings and propose have they can be averaged by ( the insight underlying MPLS. We believe this lightid approx. enable on era of simple hardware and fietible control.

Nicirai.

#### Categories and Subject Descriptors

C.2.5 [Computer-Communication Networks]: Local and Wide-Area Networks-Internet; C.2.1 [Computer-Communication Networks]: Network Architecture and Design

#### SideCar: Building Programmable Datacenter Networks without Programmable Switches

Thomas Karagiannis, Richard Monior and Anony Avanta Inomhar, anarganico advisor and Anony Avanta Microsoft com, mesagen advised Anony Avanta

Alan Shiehtt Srikanth Kandulat Emin Gun Sirert \* Microsoft Research and \* Cornell University.

Abstract- This paper examines an extreme point in the design space of programmable switches and network policy enforcement. Rother than relying on extensive changes to switches to provide more programmability. SideCar distributes custom processing code between shims running on every end host and general purpose sideoar processors, such as server blades, connected to each switch via commonly available redirection mechanisms. This provides applications with pervasive network instrumentation and programmability on the forwarding plane. While not a perfect replacement for programmable switches, this solves several pressing problems while requiring little or no change to existing switches. In particular, in the context of public cloud data centers with access of tenants, we present newal solutions for multicast, controllable network handwidth allocation (e.g., use-what-

ABSTRACT

general purpose sidear processor, but are otherwise minimaily modified, i.e., no internal changes to the software or hardware of the switch. With these constraints, SideCar enables applications to install custom packet processing rules. that execute within the network: these rules consist of a packet classifies, combined with associated code that processes every packet matching that classifier.

Our key insight in realizing this programming model entails pushing packet classification to the edge and offload ing custom processing to commodity servers. By having end hosts designate packets as needing special processing and having switches redirect designated packets, the hard ware requirements for switches are substantially reduced: each switch need only process a small set of packet classifiers. rather than a large set of complex packet formata. By lim-

# More recently, programmable NICs and FPGAs enable offload and NIC customisation High Performance Packer Processing with Flexible

#### Enabling End-host Network Functions

Hitesh Ballani, Paolo Costa, Christos Gkantsicia, Matthew P. Grosvanor, Thomas Karagiannis, Lazaros Keromilas, 1 and Greg O'Shea Microsoft Research

#### ABSTRACT

SEARC: Scoute MC for East-How Rate Limiting

south annancemar, Trong Geng

and of an electric size, said size (KVS) continues her of an electron a large wild a store (KAG) continues or incontraction in marking and back of KAG (CAG) continues thereas a chine marking and backgroup a back in

Winahuna Ersenner

Yongqiang Xiong \* Andrew Putnant Enhong Chep & Lintao Zhang\*

CCS CONCEPTS

Handezin saltean talenan

Laborantion stateme - Key-value stores; Mardware

Microsoft Research SUSIC BUCLA Beihang Laiversig

#### Keywords

Many network functions executed in modern datasenters, e.g., load balancing, application-level QrS, and congestion control, exhibit three common preparties at the data plane: they need to access and modify state, to perform computations, and to access application sensa- this is critical since many network functions are KV-Direct: High-Performance In-Memory Key-Value best expressed in terms of application-level messages. In this paper, we appear that the end bests are a paintal

Software Defined Networking: SDN; Network Managemost: Data-glass programming: Network Functions

#### 1 Introduction

HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Recent years have seen a lot of innovation in functionality deployed across datacenter networks. Network junctieux range from management tresks like kead

- Isolation
- QoS
- Load balancing
- Application specific processing

- Networks --- Data center astroneks; - Information tems → Remote replication, · Computer systems

nization -> Cloud computing.

Daehyeok Kim<sup>1\*</sup>, Amirsaman Memaripour<sup>3\*</sup>, Anirudh Badam<sup>3</sup>,

Storage systems in data centers are an important component of large-scale online services. They typically perform really

ut an Arrevare our me betwares they symplant ferround repr cabel transactional operations for high data availability and

Yibo Zhu<sup>3</sup>, Hongqiang Harry Liu<sup>3</sup>, Jitu Padhye<sup>3</sup>, Shachar Raindel<sup>3</sup>, Steven Swanson<sup>2</sup>, Vyas Sekar<sup>1</sup>, Srinivasan Seshan<sup>1</sup> <sup>1</sup>Curnegie Mellon University, <sup>2</sup>UC San Diego, <sup>4</sup>Microsoft

## Not "just" in academia, but in production!

#### Azure Accelerated Networking: SmartNICs in the Public Cloud

Daniel Firestone Andrew Putnam Sambhrama Mundkur Derek Chiou Alireza Dabagh Mike Andrewartha Hari Angepat Vivek Bhanu Adrian Caulfield Eric Chung Harish Kumar Chandrappa Somesh Chaturmohta Matt Humphrey Jack Lavier Norman Lam Fengfen Liu Kalin Ovtcharov Jitu Padhye Gautham Popuri Shachar Raindel Tejas Sapre Mark Shaw Gabriel Silva Madhan Sivakumar Nisheeth Srivastava Anshuman Verma Qasim Zuhair Deepak Bansal Doug Burger Kushagra Vaid David A. Maltz Albert Greenberg

#### Microsoft

#### Abstract

Modern cloud architectures rely on each server running its own networking stack to implement policies such as tunneling for virtual networks, security, and load balancing. However, these networking stacks are becoming increasingly complex as features are added and as network speeds all virtual networking features, such as private virtual networks with customer supplied address spaces, scalable L4 load balancers, security groups and access control lists (ACLs), virtual routing tables, bandwidth metering, QoS, and more. These features are the responsibility of the host platform, which typically means software running in the hypervisor.

## Implementing offloads is not easy

Many potential bottlenecks

## Implementing offloads is not easy

Many potential bottlenecks

# PCI Express (PCIe) and its implementation by the host is one of them!

# PCIe overview



- De facto standard to connect high performance IO devices to the rest of the system. Ex: NICs, NVMe, graphics, TPUs
- PCle devices transfer data to/from host memory via DMA (direct memory access)
- DMA engines on each device translate requests like "Write these 1500 bytes to host address 0x1234" into multiple PCIe Memory Write (MWr) "packets".
- PCIe is almost like a network protocol with packets (TLPs), headers, MTU (MPS), flow control, addressing and switching (and NAT;)

### PCIe protocol overheads



Model: PCIe gen 3 x8 64 bit addressing

### PCIe protocol overheads



Model: PCIe gen 3 x8 64 bit addressing

### PCIe protocol overheads



Model: PCIe gen 3 x8 64 bit addressing

### PCIe latency



Exablaze ExaNIC x40, Intel Xeon E5-2637v3 @3.5GHz (Haswell)

### PCIe latency imposes constraints



Exablaze ExaNIC x40, Intel Xeon E5-2637v3 @3.5GHz (Haswell)

### PCIe latency imposes constraints



Exablaze ExaNIC x40, Intel Xeon E5-2637v3 @3.5GHz (Haswell)

## It get's worse...

### Distribution of 64B DMA Read latency



Xeon E5

- 547ns median
- 573ns 99th percentile
- 1136ns max

#### Xeon E3

- 1213ns(!) median
- 5707ns(!) 99th percentile
- 5.8ms(!!!) max

Netronome NFP-6000, Intel Xeon E5-2637v3 @ 3.5GHz (Haswell) Netronome NFP-6000, Intel Xeon E3-1226v3 @ 3.3GHz (Haswell)

### Distribution of 64B DMA Read latency



Netronome NFP-6000, Intel Xeon E5-2637v3 @ 3.5GHz (Haswell) Netronome NFP-6000, Intel Xeon E3-1226v3 @ 3.3GHz (Haswell)

### PCIe host implementation is evolving

- Tighter integration of PCIe and CPU caches (e.g. Intel's DDIO)
- PCIe device is local to some memory (NUMA)
- IOMMU interposed between PCIe device and host memory



PCIe transactions are dependent on temporal state on the host and the location in host memory

### PCIe host implementation is evolving

- Tighter integration of PCIe and caches (e.g. Intel's DDIO)
- PCIe is local to some memory (NUMA)
- IOMMU interposed between PCIe device and host memory



PCIe transactions are dependent on temporal state on the host and the location in host memory

# PCIe data-path with IOMMU (simplified)

- IOMMUs translate addresses in PCIe transactions to host addresses
- Use a Translation Lookaside Buffer (TLB) as cache
- On TLB miss, perform a costly pageable walk, replace TLB entry



# Measuring the impact of the IOMMU

- DMA reads of fixed size
- From random addresses on the host
- Systematically change the address range (window) we access
- Measure achieved bandwidth (or latency)
- Compare with non-IOMMU case

### IOMMU results



- Different transfer sizes
- Throughput drops dramatically once region exceeds 256K.
- TLB thrashing
- TLB has 64 entries (256KB/4096B) Not published by Intel!
- Effect more dramatic for smaller transfer sizes

### Understanding PCIe performance is important

• A plethora of tools exist to analyse and understand OS and application performance

... but very little data available on PCIe contributions

• Important when implementing offloads to programmable NICs

... but also applicable to other high performance IO devices such as ML accelerators, modern storage adapters, etc

## Introducing pcie-bench

- A model of PCIe to quickly analyse protocol overheads
- A suite of **benchmark tools** in the spirit of lmbench/hbench
- Records latency of individual transactions and bandwidth of batches
- Allows to systematically change
  - Type of PCIe transaction (PCIe read/write)
  - Transfer size of PCIe transaction
  - Offsets for host memory address (for unaligned DMA)
  - Address range and NUMA location of memory to access
  - Access pattern (seq/rand)
  - State of host caches

Provides detailed insights into PCIe host and device implementations

### Two independent implementations

- Netronome NFP-4000 and NFP-6000
  - Firmware written in Micro-C (~1500 loc)
  - Timer resolution 19.2ns
  - Kernel driver (~400 loc) and control program (~1600 loc)
- NetFPGA and Xilinx VC709 evaluation board
  - Logic written in Verilog (~1200 loc)
  - Timer resolution 4ns
  - Kernel driver (~800 loc) and control program (~600 loc)

[implementations on other devices possible]

### Conclusions

- The PCIe protocol adds significant overhead esp for small transactions
- PCle implementations have a significant impact on IO performance:
  - Contributes significantly to the latency (70-90% on ExaNIC)
  - Big difference between two the implementations we measured (what about AMD, arm64, power?)
  - Performance is dependent on temporal host state (TLB, caches)
  - Dependent on other devices?
- Introduced pcie-bench to
  - understand PCIe performance in detail
  - aid development of custom NIC offload and other IO accelerators
- Presented the first detailed study of PCIe performance in modern servers

# Thank you!

Source code and all the data is available at:

https://www.pcie-bench.org https://github.com/pcie-bench