High Performance Cloud Computing

High Performance Cloud Computing on Multi-core Computers Jianchen Shan

Department of Computer Science, DeMatteis School of Engineering & Applied Science Research Problem

Root Cause: Time Sharing in Multi-tenant Clouds

• Applications are parallelized to benefit from many-core architectures. What happen when they are migrated to virtualized environment?

• vCPU discontinuity: vCPUs (Virtual CPUs) are descheduled to time-share a pCPU (Physical CPUs), which leads to excessive spinning in thread synchronization Physical Machine

vCPU 0

VM 1 vCPU1

vCPU2

vCPU3

Time slice De-schedule

VM 2 vCPU4

vCPU1

vCPU2

Time

vCPU3

vCPU4

vCPU 1

Time

pCPU

Time

VMM pCPU1

pCPU2

pCPU3

pCPU4

Time shared

• vCPU inactivity leads to a severe I/O inactivity problem. After the vCPU is descheduled, the I/O tasks on it become inactive and cannot generate I/O requests time-slice - milliseconds VCPU0

Experimental Results

VCPU1

• Multithreading induces high virtualization overhead, mainly caused by synchronization, spinning at user level and NUMA management.

Lock Holder Preemption Problem (LHP) excessive spinning VCPU0

150

100

δηr (%)

VCPU1 VCPU2

50

Lock Waiter Preemption Problem (LWP) – ticket spinlock

0

avg|max

OC, guest OC, host

avg|max

Bl a

Ba rn e ck sc s ho Bo les dy tra c Ca k nn ea l D ed u Fa p ce si m Fe rre t

UC, guest UC, host

• vCPU discontinuity causes the delays in launching and receiving GPU kernel. These delays lead to vGPU underutilization

avg|max OC2, guest OC2, host

(b)

Fl F ui da FT ni m at e FM M Fr eq m in e LU C LU B N C O ce B an O ce CP an N C p. P R ay tra c R ad e io si ty R ad s. ix R St aytr re a am ce cl u Sw ste ap r tio ns Vi ps W V at er olre n N Sq d ua W at re er d Sp at ia l X2 64

-50

PHYSICAL MACHINE CPU

VIRTUAL MACHINE vCPU

GPU

• vCPUs in big VM are actually asymmetric, and load balancer is unaware of that and makes uninformed scheduling decisions

vGPU

PHYSICAL MACHINE CPU

VIRTUAL MACHINE vCPU

GPU

vGPU

DE-SCHEDULED

CPU EXEC.

INV KERNEL

DATA SYNC

GPU EXEC.

IDLE

Copy data

Poll

CPU EXEC.

INV KERNEL

DATA SYNC

GPU EXEC.

IDLE

BLOCK

WAKE LAT.

Copy data

6 5 4 3 2 1 0

10 9 8 7 6 5 4 3 2 1 0

GPU% Norm. Performance

# CORUNNERS:

NAMD

GROMACS

r.particle_filter

1

2

11.1

10

3

4

1

NAMDpolling

r.srad

2

3

4

NAMDblocking

1

2

3

Baseline Shared1 Shared2

8 7

GPU % Norm. Runtime

6

4

GROMACS

20

9 Normalized Runtime

100 90 80 70 60 50 40 30 20 10 0

no interference w/ interference

100 90 80 70 60

5

50

4

40

3

30

2

20

1

10

0

NAMDpolling

NAMDblocking

16 vCPUs

GROMACS

NAMDpolling

NAMDblocking

32 vCPUs

GROMACS

NAMDpolling

NAMDblocking

64 vCPUs

GROMACS

0

Solutions to Optimize Performance in Clouds vGPU Utilization

no interference w/ stream cluster w/ synthetic w/ fluidanimate

vGPU Utilization

Normalized Runtime

7

Normalized Runtime

• GPU workloads suffer from poor and unpredictable performance [1] "Diagnosing the interference on CPU-GPU Synchronization Caused by CPU Sharing in Multi-Tenant GPU Clouds", in 2021 40th IEEE International Performance, Computing and Communications Conference (IPCCC) [2] "CoPlace: Effectively Mitigating Cache Conflicts in Modern Clouds", in 2021 ACM International Conference on Parallel Architectures and Compilation Techniques (PACT) [3] "Paratick: Reducing Timer Overhead in Virtual Machines", in 2021 50th International Conference on Parallel Processing (ICPP) [4] "The Linux Load Balance: Wasted vCPUs in Clouds", in 2020 IEEE Cloud Summit [5] "Pace Control in Federated Training System via Adaptive Dropout", in 2020 IEEE Cloud Summit [6] "ptlbmalloc2: Reducing TLB Shootdowns with High Memory Efficiency", in 2020 18th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA) [7] "vSMT-IO: Improving I/O Performance and Efficiency on SMT Processors in Virtualized Clouds", in 2020 USENIX Annual Technical Conference (ATC) [8] "Effectively Mitigating I/O Inactivity in vCPU Scheduling", in 2018 USENIX Annual Technical Conference (ATC) [9] "Virtualization Overhead of Multithreading in X86: State of the Art & Remaining Challenges", in 2021 IEEE Transactions on Parallel and Distributed Systems (TPDS)

Turn static files into dynamic content formats.

Create a flipbook