Matthew Elbing - 2020 Student Research and Creativity Forum - Hofstra University by Hofstra University

The Linux Load Balance: Wasted vCPUs in Clouds Matthew Elbing and Jianchen Shan Ph.D.

Fred DeMatteis School of Engineering and Applied Science, Hofstra University

ABSTRACT

BACKGROUND

Load balancing in Linux is essential to good performance. The load balancer is a component of the kernel which ensures that each of the cores in a multi-core system have similar amounts of work. This is done through two primary methods. Firstly, placing newly created threads on the least loaded core and secondly by migrating threads off of highly loaded cores to less loaded cores. The aforementioned approach allows modern multi-core systems to fully utilize the performance capabilities of their underlying hardware. In a virtual environment Linux also performs load balancing; however, instead of balancing the load on physical cores, the load is balanced amongst the vCPUs of the virtual machine. The Linux load balancer does not perform load balancing differently when in a virtual environment. In order to diagnose the relationship between vCPU capacity and other users of the system, we utilize the steal time percentage. Steal time is the amount of time the system spends in the hypervisor during the CPU time allocated to the virtual machine. In a multitenant system the percentage is dominated by time stolen by other users sharing the CPU resources.

Our observation indicates that the vCPUs are dynamically asymmetric when the physical machine is time-shared by multiple virtual machines (VMs). A vCPU can reach its maximal capacity when all other vCPUs colocated on the same core are idle. In this case, all the CPU time of the core can be utilized by the vCPU due to the work-conserving principle, which may even allow a vCPU to consume more CPU time than it was assigned. On the other hand, a vCPU would have lower capacity on a core that is high`ly contended by multiple co-running vCPUs. A vCPU’s capacity varies when the contention on the core changes. This makes the vCPU capacity dynamic. The Linux load balancer is unaware of the heterogenous and dynamic nature of vCPUs when running on a multi-tenant cloud system. This leads to load balance problems on the physical hardware which then reduce reliability of throughput and overall performance. Each vCPU has its own unique capacity that changes over time, since the Linux load balancer is unaware of this fact, it assumes that each core has the same capacity. Since that is not the case, scheduling choices are not effectively balancing the system, and thus not taking advantage of the full performance offered by modern multicore systems. 6

Capacity Min-Max

5 Coefficient of Variation (%)

Coefficient of Variation Steal Time

4 Capacity

The vCPUs of a multi-tenant cloud system are dynamic and asymmetric. Each individual vCPU has a unique capacity that changes dynamically over time. Thus, vCPUs in multi-tenant cloud systems behave more like heterogenous systems than traditional symmetric multi-core systems, but unlike heterogenous multi-processors, vCPUs in a multi-tenant cloud system frequently change their capacity. The Linux operating system is unaware of this behavior and attempts to make load balancing choices under the assumption that the underlying system utilizes a symmetric multi-processor (SMP) where all processor cores have that same capacity.

PROBLEM STATEMENT

vCPUs

200

400

600

800 1000 Time (s)

1200

1400

1600

1800

Fig 1. vCPU Capacity at a Given Time Fig 2. Coefficient of Variation of Capacity compared to % steal time Figure 1 demonstrates the capacity of each of the vCPUs at a given time, normalized by the capacity of the slowest core over the course of thirty minutes. The minimum and maximum capacities are shown with error bars to demonstrate the dynamic nature of the vCPU capacities. The second chart shows that as the steal time increases, so does the coefficient of variation of the capacities of each of the cores. This means that increased contention on the system leads to increased asymmetry between the capacities of cores which increases the unawareness problem.

MOTIVATING EXPERIMENT To show the load balancing problem on multi-tenant cloud systems we utilized a 6-core intel i7-8700K processor. We used Linux kernel version 5.4.47 and created two virtual machines via the KVM kernel module. This allowed for the simulation of a multitenant cloud environment by configuring two virtual machines as a main and a co-runner. Each have six vCPUs scheduled one-toone to physical cores. In order to monitor the scheduling choices that Linux makes we use a profiling tool on the host that allows us to see where threads are scheduled from both virtual machines which exposes non-ideal thread placement. It can be seen that both single-threaded and multi-threaded workloads will make scheduling mistakes. These mistakes mean that heavy loaded threads are balanced within each individual virtual machine, but on the host, there is a clear balance issue. For single-threaded workloads we ran CPU-bound tasks pinned to cores 0 through 4 on the co-runner. Then within the main virtual machine we created one CPU-bound task and allowed the operating system to perform the requisite scheduling. For the multi-threaded test we created six threads on the co-runner, the first three threads are pinned to the first three cores and run a CPU-bound task, while the second three threads are pinned to cores 3,4,5 and run a light task. Then on the main virtual machines we create three heavy CPU-bound tasks and three light tasks and allow the operating system to perform the scheduling. In the following figures, the hardware CPU cores are shown, each has two one-to-one pinned vCPUs bound to the core, one vCPU from each virtual machine. The bars show the number of jobs scheduled to each core, where white means the core is idle, green means the core has one job, and orange indicates two jobs. The purple and red bars indicate runqueues larger than two which are the result of very short running background operating system jobs which do not impact the results.

Fig 3. Single-threaded execution on DA-vCPUs Fig 4. Multi-threaded execution on DA-vCPUs In both cases the operating system fails to make fair initial scheduling decisions and the load balancer fails to correct for those mistakes. In the single-threaded case we can see that despite there being an idle hardware, the heavy CPU-bound task on the main virtual machine is scheduled on a vCPU which is mapped to the same physical core as a vCPU in the co-ruuner which is running already running a heavy CPU-bound task. The result is that while the load is evenly distributed over the vCPUs of the co-runner virtual machine, and the load is evenly distributed over the cores of the main virtual machine, the actual load on the physical machine is not balanced as there is one physical core which is completely unutilized and one physical core which has twice as much work as the other cores which contain jobs. If the load balancing was done correctly than there should have been one CPU-bound task on each of the six cores; five from the co-runner on cores 0,1,2,3,4 and one from the main virtual machine on core 5. In the multi-threaded case we also found poor scheduling decisions. In the ideal balanced case we would see the light load threads from the main virtual machine sharing cores 1,2,3 with the heavy loaded tasks from the co-runner. We would also see the heavy CPU-bound tasks scheduled to cores 3,4,5 to be sharing cores with a light tasks from the co-runner. Instead we see that only one of the heavy CPU-bound tasks is scheduled with the lighter tasks and that only one of the lighter tasks from the main virtual machine is scheduled on a core with an existing heavy CPU-bound load from the co-runner. The Linux load balancer can not effectively schedule tasks when multiple virtual machines are contending for CPU resources.

PROPOSED SOLUTION (work in progress)

We propose to periodically collect and expose the vCPU capacity in order to assist the Linux load balancer in making more informed decisions and optimize the resource utilization in the cloud.

FUTURE WORK Investigation of the behavior of DA-vCPUs through evaluations using realistic workloads will be performed to measure related performance degradation. More investigation could be done to discover other issues as well (e.g. the fairness problem) which are also caused by the unaware-ness of the load balancer to the nature of the vCPUs. Optimization may be possible in other areas of the operating system (eg. memory) for the heterogenous nature of the DA-vCPUs. This optimization could be drawn and modified from existing ideas in the literature about improving utilization of resources on asymmetric hardware systems.

REFERENCES [1] Bouron, Justinien, et al. ”The Battle of the Schedulers: FreeBSD ULE vs. Linux CFS.” USENIX ATC 2018. [2] Ding, Xiaoning, et al. ”Gleaner: Mitigating the blocked-waiter wakeup problem for virtualized multicore applications.” USENIX ATC 2014. [3] Cheng Luwei, et al. ”vScale: automatic and efficient processor scaling for SMP virtual machines.” EuroSys 2016. [4] Zhao Yong, et al. ”Characterizing and optimizing the performance of multithreaded programs under interference.” PACT 2016. [5] Koufaty David, et al. ”Bias scheduling in heterogeneous multi-core architectures.” EuroSys 2010. [6] LoziJean-Pierre,etal.”TheLinuxscheduler:adecadeofwastedcores.” EuroSys 2016. [7] vCPU capacity measurement, https://github.com/melbing1/da-vcpus