High Performance Cloud Computing on Multi-core Computers Jianchen Shan
Department of Computer Science, DeMatteis School of Engineering & Applied Science Research Problem
Root Cause: Time Sharing in Multi-tenant Clouds
• Applications are parallelized to benefit from many-core architectures. What happen when they are migrated to virtualized environment?
• vCPU discontinuity: vCPUs (Virtual CPUs) are descheduled to time-share a pCPU (Physical CPUs), which leads to excessive spinning in thread synchronization Physical Machine
vCPU 0
VM 1 vCPU1
vCPU2
vCPU3
Time slice De-schedule
VM 2 vCPU4
vCPU1
vCPU2
Time
vCPU3
vCPU4
vCPU 1
Time
pCPU
Time
VMM pCPU1
pCPU2
pCPU3
pCPU4
Time shared
• vCPU inactivity leads to a severe I/O inactivity problem. After the vCPU is descheduled, the I/O tasks on it become inactive and cannot generate I/O requests time-slice - milliseconds VCPU0
Experimental Results
VCPU1
• Multithreading induces high virtualization overhead, mainly caused by synchronization, spinning at user level and NUMA management.
Lock Holder Preemption Problem (LHP) excessive spinning VCPU0
150
100
δηr (%)
VCPU1 VCPU2
50
Lock Waiter Preemption Problem (LWP) – ticket spinlock
0
avg|max
OC, guest OC, host
avg|max
Bl a
Ba rn e ck sc s ho Bo les dy tra c Ca k nn ea l D ed u Fa p ce si m Fe rre t
UC, guest UC, host
• vCPU discontinuity causes the delays in launching and receiving GPU kernel. These delays lead to vGPU underutilization
avg|max OC2, guest OC2, host
(b)
Fl F ui da FT ni m at e FM M Fr eq m in e LU C LU B N C O ce B an O ce CP an N C p. P R ay tra c R ad e io si ty R ad s. ix R St aytr re a am ce cl u Sw ste ap r tio ns Vi ps W V at er olre n N Sq d ua W at re er d Sp at ia l X2 64
-50
PHYSICAL MACHINE CPU
VIRTUAL MACHINE vCPU
GPU
• vCPUs in big VM are actually asymmetric, and load balancer is unaware of that and makes uninformed scheduling decisions
vGPU
PHYSICAL MACHINE CPU
VIRTUAL MACHINE vCPU
GPU
vGPU
DE-SCHEDULED
DE-SCHEDULED
DE-SCHEDULED
CPU EXEC.
INV KERNEL
DATA SYNC
GPU EXEC.
IDLE
Copy data
Poll
CPU EXEC.
INV KERNEL
DATA SYNC
GPU EXEC.
IDLE
BLOCK
WAKE LAT.
Copy data
6 5 4 3 2 1 0
10 9 8 7 6 5 4 3 2 1 0
GPU% Norm. Performance
# CORUNNERS:
NAMD
GROMACS
r.particle_filter
1
2
11.1
10
3
4
1
NAMDpolling
r.srad
2
3
4
NAMDblocking
1
2
3
Baseline Shared1 Shared2
8 7
GPU % Norm. Runtime
6
4
GROMACS
20
9 Normalized Runtime
100 90 80 70 60 50 40 30 20 10 0
no interference w/ interference
100 90 80 70 60
5
50
4
40
3
30
2
20
1
10
0
NAMDpolling
NAMDblocking
16 vCPUs
GROMACS
NAMDpolling
NAMDblocking
32 vCPUs
GROMACS
NAMDpolling
NAMDblocking
64 vCPUs
GROMACS
0
Solutions to Optimize Performance in Clouds vGPU Utilization
no interference w/ stream cluster w/ synthetic w/ fluidanimate
vGPU Utilization
Normalized Runtime
7
Normalized Runtime
• GPU workloads suffer from poor and unpredictable performance [1] "Diagnosing the interference on CPU-GPU Synchronization Caused by CPU Sharing in Multi-Tenant GPU Clouds", in 2021 40th IEEE International Performance, Computing and Communications Conference (IPCCC) [2] "CoPlace: Effectively Mitigating Cache Conflicts in Modern Clouds", in 2021 ACM International Conference on Parallel Architectures and Compilation Techniques (PACT) [3] "Paratick: Reducing Timer Overhead in Virtual Machines", in 2021 50th International Conference on Parallel Processing (ICPP) [4] "The Linux Load Balance: Wasted vCPUs in Clouds", in 2020 IEEE Cloud Summit [5] "Pace Control in Federated Training System via Adaptive Dropout", in 2020 IEEE Cloud Summit [6] "ptlbmalloc2: Reducing TLB Shootdowns with High Memory Efficiency", in 2020 18th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA) [7] "vSMT-IO: Improving I/O Performance and Efficiency on SMT Processors in Virtualized Clouds", in 2020 USENIX Annual Technical Conference (ATC) [8] "Effectively Mitigating I/O Inactivity in vCPU Scheduling", in 2018 USENIX Annual Technical Conference (ATC) [9] "Virtualization Overhead of Multithreading in X86: State of the Art & Remaining Challenges", in 2021 IEEE Transactions on Parallel and Distributed Systems (TPDS)