

## INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

ARKAPRAVA BASU, JOSEPH L. GREATHOUSE, GURU VENKATARAMANI, JÁN VESELÝ

AMD RESEARCH, ADVANCED MICRO DEVICES, INC.





CPUs







6<sup>th</sup> Gen. AMD A-Series Processor "Carrizo"



Accelerators from Industry and Academia

#### 6<sup>th</sup> Gen. AMD A-Series Processor "Carrizo"



#### Accelerators from Industry and Academia

- Machine Learning
- Databases
- Computer Vision
- Regular Expressions
- Physics
- Graph Analytics
- Finite State Machines
- Genome Sequencing
- Reconfigurable (e.g., FPGA)

CPU

Data

Transfer

Accelerator



















• Work launched by the CPU to accelerators



- Work launched by the CPU to accelerators
- Data marshalled by the CPU



- Work launched by the CPU to accelerators
- Data marshalled by the CPU
- Communication and computation are coarse-grained

CPU

Data

Transfer

Accelerator

























Accelerators and CPUs can launch work to one another (and themselves)

# CPU Data Transfer

- Accelerators and CPUs can launch work to one another (and themselves)
- Data transfer implicit based on usage

Accelerator



- Accelerators and CPUs can launch work to one another (and themselves)
- Data transfer implicit based on usage
- Communication and computation are fine-grained

#### **ACCELERATORS CAN NOW REQUEST OS SERVICES**

• GPU-to-CPU Callbacks (Owens et al., UCHPC 2010)

- GPU-to-CPU Callbacks (Owens et al., UCHPC 2010)
- Page Faults / Demand Paging (Veselý et al., ISPASS 2016)

- GPU-to-CPU Callbacks (Owens et al., UCHPC 2010)
- Page Faults / Demand Paging (Veselý et al., ISPASS 2016)
- GPU-Initiated Network Requests (GPUnet, USENIX 2014)

- GPU-to-CPU Callbacks (Owens et al., UCHPC 2010)
- Page Faults / Demand Paging (Veselý et al., ISPASS 2016)
- GPU-Initiated Network Requests (GPUnet, USENIX 2014)
- GPU-Initiated File System Requests (GPUfs, ASPLOS 2013)

- GPU-to-CPU Callbacks (Owens et al., UCHPC 2010)
- Page Faults / Demand Paging (Veselý et al., ISPASS 2016)
- GPU-Initiated Network Requests (GPUnet, USENIX 2014)
- GPU-Initiated File System Requests (GPUfs, ASPLOS 2013)
- "Generic" System Calls (Genesys, ISCA 2018)
  - ioctl() for other devices
  - Memory management (e.g., sbrk(), mmap())
  - Signals









Step 1: Set up request arguments in memory









CPU\_0





Step 1: Set up request arguments in memory Step 2: Send request interrupt to a CPU Step 3: (a) Schedule bottom-half handler







Step 1: Set up request arguments in memory Step 2: Send request interrupt to a CPU Step 3: (a) Schedule bottom-half handler



Memory

# CPU 2



Step 1: Set up request arguments in memory
Step 2: Send request interrupt to a CPU
Step 3: (a) Schedule bottom-half handler
(b) ACK request to GPU

3a

Memory

 $(\mathbf{P})$ 

3b



Step 1: Set up request arguments in memory
Step 2: Send request interrupt to a CPU
Step 3: (a) Schedule bottom-half handler
(b) ACK request to GPU

CPU

















• GPUs and accelerators can request OS (system) services

- GPUs and accelerators can request OS (system) services
- These SSRs can interfere with unrelated CPU-based work

- GPUs and accelerators can request OS (system) services
- These SSRs can interfere with unrelated CPU-based work
- Unrelated CPU-Based work can slow down GPU SSR handling

- GPUs and accelerators can request OS (system) services
- These SSRs can interfere with unrelated CPU-based work
- Unrelated CPU-Based work can slow down GPU SSR handling
- CPUs lose opportunity to sleep because of GPU SSRs













10 | INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS | October 5, 2018















Mode/Task Switch



CPU User Work



Mode/Task Switch

GPU User Work



















INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS | October 5, 2018 10





Low-Performance User-Mode Operation



CPU2



10 | INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS | October 5, 2018



Low-Performance User-Mode Operation



INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS | October 5, 2018 10



**GPU User Work** 

Kernel Task Operation













• Direct CPU Overheads



- Direct CPU Overheads
- Indirect CPU Overheads



- Direct CPU Overheads
- Indirect CPU Overheads
- GPU Overheads

- AMD A10-7850K APU
  - 4x 3.7 GHz CPU cores. AMD Family 15h Model 30h
  - 720 MHz AMD GCN 1.1 ("Sea Islands", gfx700) GPU, 8 CUs
  - 32 GB Dual-Channel DDR3-1866

- AMD A10-7850K APU
  - 4x 3.7 GHz CPU cores. AMD Family 15h Model 30h
  - 720 MHz AMD GCN 1.1 ("Sea Islands", gfx700) GPU, 8 CUs
  - 32 GB Dual-Channel DDR3-1866
- Ubuntu 14.04.3 LTS (AMD64)
  - Linux<sup>®</sup> kernel 4.0.0 with HSA Drivers (amdkfd 1.6.1)

- AMD A10-7850K APU
  - 4x 3.7 GHz CPU cores. AMD Family 15h Model 30h
  - 720 MHz AMD GCN 1.1 ("Sea Islands", gfx700) GPU, 8 CUs
  - 32 GB Dual-Channel DDR3-1866
- Ubuntu 14.04.3 LTS (AMD64)
  - Linux<sup>®</sup> kernel 4.0.0 with HSA Drivers (amdkfd 1.6.1)
- PARSEC benchmarks for "CPU work"

- AMD A10-7850K APU
  - 4x 3.7 GHz CPU cores. AMD Family 15h Model 30h
  - 720 MHz AMD GCN 1.1 ("Sea Islands", gfx700) GPU, 8 CUs
  - 32 GB Dual-Channel DDR3-1866
- Ubuntu 14.04.3 LTS (AMD64)
  - Linux<sup>®</sup> kernel 4.0.0 with HSA Drivers (amdkfd 1.6.1)
- PARSEC benchmarks for "CPU work"
- OpenCL<sup>™</sup> benchmark applications modified to create SSRs (page faults)
  - BPT [Veselý et al., ISPASS 2016]
  - XSBench [Veselý et al., ISPASS 2016]
  - SHOC BFS
  - SHOC SpMV
  - Pannotia SSSP
  - µBenchmark

#### **CPU PERFORMANCE W/ SSR-USING GPU APPLICATIONS**



# **INDIRECT CPU OVERHEADS FROM GPU SSRS**

30% 25% 20% 15% 10% 5% 0% streamcluster blackscholes nuidanimate SW20tions frequine raytrace dedup Facesim cameal +264

**Increase** of User-Level

**Branch Mispredict Rate** 



**Increase** of User-Level

# **GPU PERFORMANCE W/ CONCURRENT CPU APPS**



# **GPU PERFORMANCE W/ CONCURRENT CPU APPS**



Interrupt Coalescing

# **Mitigation Strategies**

TAKE INSPIRATION FROM OTHER DOMAINS (E.G. HIGH PERFORMANCE NETWORKING) Interrupt Steering

Merged SSR Handlers

Driver for Enforcing CPU QoS









Wait until multiple SSR requests arrive before interrupting CPU core





Wait until multiple SSR requests arrive before interrupting CPU core





Wait until multiple SSR requests arrive before interrupting CPU core









Ensure that all interrupts go to the same core, rather than round-robin to all cores





Ensure that all interrupts go to the same core, rather than round-robin to all cores





Ensure that all interrupts go to the same core, rather than round-robin to all cores









Merge SSR pre-proceeding (4) with interrupt handler (3)





Merge SSR pre-proceeding (4) with interrupt handler (3)





Merge SSR pre-proceeding (4) with interrupt handler (3)

## PARETO CURVE OF MITIGATION STRATEGY TRADEOFFS

















CPU Cycles Handling SSRs > <u>Th</u>reshold













#### **CPU PERFORMANCE AT DIFFERENT QOS LEVELS**



#### **GPU PERFORMANCE SUFFERS FOR CPU QOS**



#### SUMMARY

- Heterogeneous systems can include increasingly more accelerators
- GPUs and accelerators now request system services
- These can cause interference between accelerators & unrelated CPU work
- Problem may worsen in the future
- Existing mitigation strategies help, but are not complete solution

## 

# **QUESTIONS?**

#### Disclaimer

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Linux is a registered trademark of Linus Torvalds. OpenCL is a trademark of Apple Inc. used by permission by Khronos. Other names used herein are for identification purposes only and may be trademarks of their respective companies.

### 

# **BACKUP SLIDES**







2 Geomean of ubench performance running Interrupt Steering + better) **Interrupt Coalescing Merged Handler** (Higher is Merged Handler + 1.5 Interrupt Coalescing Merged Handler + Interrupt Steering + applications Coalescing 1 CPU Default  $^{\rm w}$ 0.5 0.4 0.5 0.6 0.7 0.8 Geomean of CPU workload performance running with µbench (Right is better)

2 Geomean of ubench performance running Interrupt Steering + better) **Interrupt Coalescing Merged Handler** (Higher is Merged Handler + 1.5 Interrupt Coalescing Merged Handler + Interrupt Steering + Interrupt applications Coalescing Coalescing 1 Merged Handler + **Interrupt Steering** CPU Default  $^{\rm w}$ Interrupt Steering 0.5

0.6

Geomean of CPU workload performance running with µbench

(Right is better)

0.7

#### **PARETO CURVE OF MITIGATION STRATEGY TRADEOFFS**

0.5

0.4

0.8

#### SSRS LIMIT LOW-POWER SLEEP STATES



#### **MITIGATION EFFECT ON SLEEP STATES**

