# **KVM-devirt Extending KVM to a Zero-overhead Partition Hypervisor**

Liang Deng System Technologies and Engineering (STE) team

KVM forum 2022



# Motivation

#### **Trends:** the core number of a single server increases rapidly

- Genoa: 384 cores
- Icelake: 224 cores

#### **Problems**

Scalability: Linux kernel encounters many-core scalability bottlenecks, e.g., lock contentions in file system, network stack, scheduler, memory management and so on.

Fault isolation: More applications run on a single server. If one application crashes kernel, the whole sever crashes.

Kernel customization: a single kernel is hard to fulfill applications' custom requirements (e.g., kernel configuration, kernel boot parameters)

#### **Partition the server using current KVM Virtualization**

Separate guest kernels

Virtualization overhead is non-trivial



#### **ByteDance**字节跳动 **KVM Virtualization Overhead**

### 1. VM exits

- Timer virtualization
- **IPI** virtualization
- virtio notifications
- HLT and MWAIT instructions
- **CPUID** instruction
- Host interrupt  $\bullet$

#### 2. Posted interrupt

- Although VM exit can be eliminated
- Overhead is still non-trivial due to complex hardware path  $\bullet$

- 3. Additional address translations
- Stage-2 address translations (EPT or NPT) lacksquare
- DMA remap translation on IOMMU page tables ullet



# **KVM-devirt**

- Interrupt passthrough  $\bullet$
- IPI passthrough  $\bullet$
- Timer passthrough  $\bullet$
- Memory de-virtualization  $\bullet$
- DMA de-virtualization lacksquare
- Virtio notification passthrough •

- Remove all VM exits after guest kernel initialization phase
- Remove all additional address translations
- Support both Intel and AMD platforms



# Interrupt Passthrough

- Posted interrupt is not used due to its extra overhead  $\bullet$ in hardware path
- Pass local APIC registers (IRR, ISR, EOI) directly to lacksquarethe BM. Use separate host and guest interrupt vectors to avoid mixture.
- Guest interrupts arrived on guest cores are  $\bullet$ configured as IRQs which are directly delivered to non-root mode.
- Host interrupts arrived on guest cores are configured  $\bullet$ as NMIs which cause VM exits.
- Re-trigger a self-IPI (IRQ) in host NMI handler to solve  $\bullet$ IRQ-mask issue.
- Send self-IPIs at VM entry to inject virtual guest  $\bullet$ interrupts of emulated devices





# Interrupt Remap

- Interrupt posting capability is not used in IOMMU interrupt remapping.
- VFIO fills in the IRTE with guest vector of guest device interrupt and APIC\_ID of the physical core where BM runs.
- When BM changes the virg-vcpu binding relations or guest vectors, VFIO updates the IRTE with the new value.





| KVM         |  |
|-------------|--|
| re<br>PIC n |  |
|             |  |

# **IPI Passthrough**

- KVM maps a vAPIC\_ID-to-pAPIC\_ID mapping into BM at BM startup and updates it whenever a VCPU thread is migrated to a new physical core.
- At the sending core, BM maps vAPIC\_ID of target VCPU to pAPIC\_ID, and directly accesses ICR to send IPI with guest vector.
- At the receiving core, IPI (IRQ) is directly delivered to BM without VM exits.





# Timer Passthrough

- BM uses physical Local APIC timer and Host Linux uses broadcast timer.
- KVM maps the TSC offset value into BM and updates it whenever modified.
- At lapic\_next\_event, BM subtract TSC offset value from guest TSC deadline and set it to MSR TSCDEADLINE
- On LAPIC timer expiration, timer interrupt (IRQ) is directly delivered to BM without VM exits.





# **Memory De-virtualization**

- BM only uses stage-1 page table, stage-2 (EPT or NPT) is disabled.
- At BM startup, KVM statically pins BM's guest memory, initializes both gfn-to-pfn and pfn-to-gfn tables and maps them into BM.
- When BM writes its own guest page tables (with set\_pgd/set\_pud/set\_pmd/set\_pte PV interfaces), it translates the gfn into pfn. Thus guest page tables directly use pfns.
- The hardware MMU directly uses BM's guest page tables
- When BM reads guest page tables (with pgd\_val/ pud\_val/pmd\_val/pte\_val PV interfaces), it translates the pfn into gfn.
- Use a hypercall in guest page fault handler to emulate a MMIO trap.





# **DMA De-virtualization**

- At BM startup, KVM statically pins BM's guest memory and maps both gfn-to-pfn and pfn-togfn mappings into BM.
- When the passthrough device driver invokes dma\_map to map dma buffer before issuing a dma request, it first relies on the gfn-to-pfn mappings to translate the gpa in the request to hpa. Thus DMA remap is not required.
- The VFIO in host configures the IOMMU as passthrough mode to disable DMA remap address translations.
- Additional modifications to the device driver is required to ensure that the dma\_map is invoked as PAGE\_SIZE granularity.





#### Virtio Notification Passthrough **ByteDance**字节跳动

- When virtio frontend in BM sends notification to  $\bullet$ backend, it directly accesses the ICR to send an IPI (with host vector) to the host core, without any VM exit.
- When backend sends notification to frontend in BM, it sends IPI (with guest vector) to the guest core. The IPI (configured as IRQ) is then directly delivered to BM based on the interrupt passthrough.





## **ByteDance**字节跳动 **Other Optimizations to Remove VM exits**

- CPU isolation and no-hz full  $\bullet$
- Handle cpuid in BM with dynamic binary rewriting
- The handling of some host IPIs to guest cores are delayed to next VM exit.
- HLT and MWAIT instruction passthrough lacksquare



# **ByteDance**字节跳动 **Micro-benchmark Result for IPI latency**

| IPI(cycles)   | 邓良 0860 |         | 邓良 0860 |   |
|---------------|---------|---------|---------|---|
| 0860          |         | 双良 0860 |         |   |
|               | 邓良 0860 |         | 邓良 0860 |   |
| concurrency=1 | 2       | 4       |         | 9 |
|               |         |         |         |   |

Intel result

| IPI(cycles)                               | 邓良 0860           | 双良 0860      |
|-------------------------------------------|-------------------|--------------|
| 78 08 08 08 08 08 08 08 08 08 08 08 08 08 | 60                | BR 0860      |
|                                           | 7KB2 0860         | AD BR 0860   |
| concurrency=1                             | 2                 | 4            |
| 水長のお                                      | ost 📃 swx2apic VM | avic VM 📕 BM |

**AMD** result





# **ByteDance**字节跳动 **Micro-benchmark Result for Timer latency**

| 双限 0860 |            | 5860<br>5860  |           |
|---------|------------|---------------|-----------|
|         |            |               |           |
|         | JR # 0860  |               | 111日 0860 |
|         | Timer late | ency (cycles) | APP       |

Intel result



**AMD** result





## **Micro-benchmark Result for Cache Line Prefetch**

|                      | Cache Line prefetch latency<br>(lower is better) |
|----------------------|--------------------------------------------------|
| Native Host          | 9.32                                             |
| BM                   | 9.35                                             |
| VM with 1GB-size EPT | 11.1                                             |
| VM with 4KB-size EPT | 14.3                                             |



# **ByteDance**字节跳动 **Real-world Application Result (BM vs VM)**

#### **Two Real-world Applications in ByteDance**

• XX

**BM vs VM Result** 

| Optimizations                       | XX end-to-end latency improvement based on VM |
|-------------------------------------|-----------------------------------------------|
| Interrupt + IPI + Timer Passthrough | 8%                                            |
| Memory devirtualization             | 14%                                           |
| DMA devirtualization                | 2%                                            |
| ALL                                 | 20%-30%                                       |



# **ByteDance**字节跳动 **Real-world Application Result (BM vs Native Host)**

|             | XX End-to-end latency<br>(normalized, lower is better) |
|-------------|--------------------------------------------------------|
| Native Host | 1                                                      |
| BM          | 1.01                                                   |

**One Partition Result** 

#### **One Partition**

- Native host: partition the server with only one runc container and run an XX in it.
- BM: partition the server with only one BM and run an XX in it.

|             | XX End-to-end latency<br>(normalized, lower is better) |
|-------------|--------------------------------------------------------|
| Native Host | 1                                                      |
| BM          | 0.91                                                   |

**Four Partitions Result** 

#### **Four Partitions**

- Native host: partition the server with four runc  $\bullet$ containers and run an XX in each partition
- BM: partition the server with four BMs and run an XX in each partition





## **Status and Future Work**

#### **Status**

- Support both Intel and AMD
- Support both QEMU and Cloud-hypervisor as VMM

#### **Future work**

- Kernel patches posted to upstream ullet
- Live migration support lacksquare
- virtio-balloon and memory hot plug support





# Thanks very much

Q&A



