An Introduction to Modern GPU Architecture

1. Graphics Pipelines For Last 20 Years

Pipelines

2. Unified Pipeline

Why Unify

Shader

Balanced workload

Why Scalar Instruction Shader

Vector ALU efficiency varies co-issue – better but not perfect

Vector/VLIW architecture require more compiler work, scalar always 100% efficient, simple to compile, build a unified architecture with scalar cores where all shader operations are done on the same processors.

3. Stream Processing

How CPUs and GPUs differ

Latency Intolerance vs Latency Tolerance
Task Parallelism vs Data Parallelism
Multi-thread Cores vs SIMT Cores
10s of Threads vs 10000s of Threads

CPUs is low latency low throughput processors GPUs is high latency high throughput processors

GPUs can have more ALUs for the same sized chip and therefore run many more threads of computation, modern GPUs run 10,000s of threads concurrently cpuvsgpu

What is the Stream Processing

Given a (typically large) set of data(“Stream”), run the same series of operations (“Kernel” or “Shader”) on all of the data(SIMD)

GPU designed to solve problems that tolerate high latencies
High latency tolerance, lower cache requirements
Less transistor area for cache, more area for computing units
More computing units, 10,000s of SIMD threads and high throughput
Threads managed by hardware, you are not required to write code for each thread and manage them yourself
Easier to increase parallelism by adding more processors

So, fundamental unit of a modern GPU is a stream processor.

4. G80 and GT200 Streaming Processor Architecture

G80 gt200

Workloads are partitioned into blocks of threads among multiprocessors
- a block runs to completion
- a block doesn’t run until resources are available
Allocation of hardware resource
- shared memory is paritioned among blocks
- registers are partitioned among threads
Hardware thread scheduling
- any thread not waiting for something can run
- context switching is free -every cycle

Need large number of threads to hide latency

Minimum: 128 threads/SM typically
Maximum: 1024 threads/SM on GT200

Warp size

If threads diverge, both sides of branch will execute on all 32. More efficient compared to architecture with branch efficiency of 48 threads warp size