

# A new perspective on processing-in-memory architecture design

Dong Ping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joe Greathouse, Mitesh Meswani, Mark Nutter, Mike Ignatowski

**AMD Research** 

These data are submitted with limited rights under Government Contract No. DE-AC52-8MA27344 and subcontract B600716 from Advanced Micro Devices, Inc. on behalf of AMD Advanced Research LLC. These data may be reproduced and used by the Department of Energy on a need-to-know basis, with the express limitation that they will not, without written permission of the AMD Advanced Research LLC, be used for purposes of manufacture nor disclosed outside the Government.

This notice shall be marked on any reproduction of these data, in whole or in part.



# PROCESSING-IN-MEMORY?







## **HIGHLIGHT**

- Memory system is a key limiter
  - At 4TB/s, vast majority of node energy could be consumed by the memory system
- Prior PIM research constrained by
  - Implementation technology
  - Non-traditional programming models
- Our focus
  - 3D die stacking
  - Use base logic die(s) in memory stack
    - General-purpose processors
    - Support familiar programming models
    - Arbitrary programs vs a set of special operations





# PROCESSING-IN-MEMORY (PIM) --- OVERVIEW (I)

- Moving compute close to memory promises significant gains
  - Memory is a key limiter (performance and power)
  - Exascale goals of 4TB/s/node and <5pj/bit</li>
- Prior research
  - Integration of caches and computation
    - "A logic-in-memory computer" (1970)
    - No large scale integration was possible.
  - Logic in DRAM processes
    - In-memory processors with reduced performance or highly specialized for a limited set of operations.
    - Reduced DRAM due to the presence of compute logic.
  - Embedded DRAM in logic processes
    - Not cost-effectively accommodate sufficient memory capacity
    - · Reduced density of embedded memory



# PROCESSING-IN-MEMORY (PIM) --- OVERVIEW (II)

- New opportunity: logic die stacked with memory
  - Logic die needed anyway for signal redistribution and integrity
  - Potential for non-trivial compute
- Key benefits
  - Reduce bandwidth bottlenecks
  - Improve energy efficiency
  - Increase compute for a fixed interposer area
  - Processor can be optimized for high BW/compute ratio
- Challenges
  - Programming models and interfaces
  - Architectural tradeoffs
  - · Application refactoring



## **OUTLINE**

- PIM architecture baseline
- API specification
- Emulator and performance models
- Application studies



#### NODE HARDWARE ORGANIZATION

- Single-level of PIM-attached memory stacks
- Host has direct access to all memory
  - Non-PIM-enabled apps still work
- Unified virtual address space
  - Shared page-tables between host and PIMs
- Low-bandwidth inter-PIM interconnect





#### PIM API OVERVIEW

- Current focus is on single PIM-enabled node
  - Current PIM hardware baseline is general-purpose processor
- Key goals of API
  - Facilitate data layout control
  - Dispatch compute to PIMs using standard parallel abstractions
- A "convenient" level of abstraction for early PIM evaluations
  - Aimed at PIM feasibility studies
  - annotation of data and compute for PIM with reasonable programmer effort
- Key API features:
  - discover device
  - query device characteristics
  - manage locality
  - dispatch compute



#### PIM PTHREAD EXAMPLE: PARALLEL PREFIX SUM

```
list of pims = malloc(max pims * sizeof(pim device id));
failure = pim get device ids (PIM CLASS 0, max pims, list of pims, &num pims);
for (i = 0; i < num pims; i++) {
    failure = pim get device info(list of pims[i], PIM CPU CORES, needed size, device info, NULL);
for (i = 0; i < num pims; i++) {
    parallel input array[i] = pim_malloc(sizeof(uint32 t) * chunk size, list of pims[i],
                                         PIM MEM DEFAULT FLAGS, PIM PLATFORM PTHREAD CPU);
   parallel output array[i] = pim malloc(sizeof(uint64 t) * chunk size, list of pims[i],
                                          PIM MEM DEFAULT FLAGS, PIM PLATFORM PTHREAD CPU);
for (i = 0; i < num pims; i++) {
    pim args[PTHREAD ARG THREAD] = &(pim threads[i]); // pthread t
    arg size[PTHREAD ARG THREAD] = sizeof(pthread t);
    pim args[PTHREAD ARG ATTR] = NULL; // pthread attr t
    arg size[PTHREAD ARG ATTR] = sizeof(pthread attr t);
    pim args[PTHREAD ARG INPUT] = &(thread input[i]); // void * for thread input
    arg size[PTHREAD ARG INPUT] = sizeof(void *);
    pim function.func ptr = parallel prefix sum;
    spawn error = pim spawn (pim function, pim args, arg size, NUM PTHREAD ARGUMENTS,
                            list of pims[i], PIM PLATFORM PTHREAD CPU);
for (i = 0; i < num pims; i++) {
    pthread join(pim threads[i], NULL);
```



#### PIM EMULATION AND PERFORMANCE MODEL



- Phase 1: Native execution on commodity hardware
  - Capture execution trace and performance stats
- Phase 2: Post-process with performance models
  - Predict overall performance on future memory and processors



#### PARALLEL PREFIX SUM - HOST-ONLY VS HOST+PIM





Host frequency: 4GHz PIM frequency: 2GHZ Host latency: current

PIM latency: 30% reduction

Total runtime: 0.820s 15% Faster

PIM computation benefits from reduced latency

#### PARALLEL PREFIX SUM - HOST-ONLY VS HOST+PIM



#### WAXPBY - HOST-ONLY VS HOST+PIM



### SUMMARY AND FUTURE WORK

- PIM architecture baseline
- API specification
- Emulator and performance models
- Application studies
- Design space study
- Evaluation of the performance models
- Further work on API, execution model, applications etc.



#### **ACKNOWLEDGEMENT:**

Lee Howes
Gabe Loh

**QUESTIONS?** 

# **FURTHER DISCUSSIONS:**

Dongping.zhang@amd.com

