Xen Hypervisor Core

Understanding the Hypervisor

The Xen hypervisor is the foundational layer that sits between hardware and virtual machines, providing virtualization services while maintaining minimal size and complexity. This page explores the internal workings of the hypervisor, including its core subsystems, scheduling algorithms, memory management techniques, and interaction models with guest domains.

Understanding hypervisor internals is valuable for performance tuning, troubleshooting, and advanced configuration scenarios where knowledge of low-level behavior is essential.

Hypervisor Components Overview

The Xen hypervisor binary (xen.gz or xen.efi) contains several key subsystems that work together to provide virtualization services:

CPU Scheduler

Manages allocation of physical CPU time to virtual CPUs across all domains.

Multiple scheduler implementations
Configurable scheduling policies
Load balancing algorithms
NUMA-aware placement

Memory Manager

Controls physical memory allocation and implements memory virtualization.

Page frame allocation
Shadow page tables
Hardware-assisted paging
Memory sharing and ballooning

Interrupt Handler

Receives and routes hardware interrupts to appropriate domains.

Interrupt virtualization
Event channel delivery
MSI/MSI-X support
Interrupt affinity management

Timer Subsystem

Provides virtual timer services to domains for time-based operations.

Virtual timer interrupts
Wallclock time synchronization
TSC virtualization
Periodic and one-shot timers

Hypercall Interface

System call-like interface for domains to request hypervisor services.

Privileged operations
Domain management
Memory operations
Event channel management

IOMMU Support

Hardware memory protection for direct device assignment.

VT-d (Intel) support
AMD-Vi support
DMA isolation
Interrupt remapping

CPU Virtualization Deep Dive

Virtual CPU Management

Each domain can have multiple virtual CPUs (VCPUs). The hypervisor maintains complete CPU state for each VCPU, including registers, flags, segment descriptors, and control registers. When the scheduler switches between VCPUs, the hypervisor performs a full context switch, saving the current VCPU state and loading the new one.

                VCPU State Components
                General Purpose Registers: RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, R8-R15
Instruction Pointer: RIP register indicating next instruction to execute
Flags Register: RFLAGS containing condition codes and system flags
Segment Registers: CS, DS, ES, FS, GS, SS with descriptors and limits
Control Registers: CR0, CR3 (page table base), CR4 for CPU features
Debug Registers: DR0-DR7 for breakpoint and debug support
FPU/SSE State: Floating point and vector instruction state
Extended State: AVX, AVX-512 and other extended CPU features

            

Privilege Levels and Protection

x86 processors have four privilege rings (0-3), with ring 0 being most privileged. Xen uses these rings differently depending on virtualization mode:

Ring	PV Mode	HVM Mode	Purpose
Ring 0	Xen Hypervisor	Guest Kernel	Most privileged code
Ring 1	Guest Kernel	Unused	Guest OS kernel in PV
Ring 2	Unused	Unused	Legacy use only
Ring 3	Guest User Apps	Guest User Apps	Least privileged

In PV mode, the guest kernel runs in ring 1 instead of ring 0, preventing it from executing truly privileged instructions. When the guest needs to perform privileged operations, it makes hypercalls to the hypervisor running in ring 0.

Hypercalls

Hypercalls are the interface through which domains request services from the hypervisor, similar to how system calls allow user applications to request services from the OS kernel.

Common Hypercall Categories

Domain Management: domain_create, domain_destroy, domain_pause, domain_unpause
Memory Operations: memory_op for allocation, update_va_mapping for page tables
Virtual CPU Control: vcpu_op for VCPU management and control
Event Channels: event_channel_op for inter-domain communication
Grant Tables: grant_table_op for memory sharing setup
Scheduling: sched_op for yield, block, and scheduling hints
Console: console_io for emergency output and debugging
Physical I/O: physdev_op for hardware access (Dom0 only)

Trap and Exception Handling

When a guest executes a privileged instruction in PV mode, or when any exception occurs (page fault, divide by zero, etc.), control transfers to the hypervisor. The hypervisor determines whether to handle the exception itself or reflect it back to the guest OS.

Exception Reflection

Most exceptions are reflected back to the guest OS, which handles them as if they occurred natively. The hypervisor injects the exception into the guest's exception handler, maintaining the guest's normal exception handling semantics.

Scheduler Implementation

Credit Scheduler (Default)

The Credit scheduler is Xen's default scheduler, implementing work-conserving proportional share scheduling. Each domain receives credits based on its weight, and credits are consumed as VCPUs execute.

                Credit Scheduler Concepts
                Weight: Proportional share of CPU (default 256, range 1-65535)
Cap: Maximum CPU usage limit (percentage of one physical CPU)
Credits: Accounting units allocated each scheduling period (30ms default)
OVER State: Domain has positive credits, runs before UNDER domains
UNDER State: Domain has negative credits, runs if no OVER domains
Load Balancing: Periodic rebalancing of VCPUs across physical CPUs

            

Credit Allocation Formula

Domain Credit = (Domain Weight / Sum of All Weights) × Total Credits per Period

Example: If Domain A has weight 256 and Domain B has weight 512, and there are 1000 credits per period, Domain A gets 333 credits and Domain B gets 667 credits.

Credit2 Scheduler

Credit2 is a newer scheduler designed to address scalability issues in Credit scheduler for large systems with many cores and many domains.

Credit2 Improvements

Per-CPU Run Queues: Reduces lock contention on multi-core systems
Hierarchical Structure: VCPUs organized in hierarchy for efficient scheduling
Better Fairness: More accurate proportional share implementation
Lower Overhead: Reduced scheduling overhead on large systems
Improved Latency: Better worst-case latency characteristics

RTDS (Real-Time Deferrable Server) Scheduler

RTDS provides hard real-time scheduling guarantees using Earliest Deadline First (EDF) algorithm. Each VCPU has a period and budget, guaranteeing it receives its budget amount of CPU time each period.

RTDS Parameters

Period: Time interval in microseconds (e.g., 10000 = 10ms)

Budget: Guaranteed CPU time within each period (e.g., 5000 = 5ms)

A VCPU with period=10000 and budget=5000 is guaranteed 5ms of CPU every 10ms, or 50% CPU utilization.

ARINC 653 Scheduler

Designed for avionics and safety-critical systems requiring certification. Implements fixed time partitioning where each domain gets exclusive CPU access during predetermined time slots.

Memory Management Internals

Physical Memory Layout

When Xen boots, it establishes its own memory map and reserves regions for hypervisor code, data structures, and per-domain metadata. The remaining memory is available for domain allocation.

                Xen Memory Regions
                Hypervisor Code: The hypervisor binary itself (typically 1-2 MB)
Hypervisor Data: Global data structures and per-CPU data
Domain Descriptors: Metadata for each domain (struct domain)
VCPU Descriptors: Per-VCPU metadata (struct vcpu)
Page Frame Database: Array tracking state of every physical page frame
M2P Table: Machine-to-Physical address translation table (PV mode)
Frame Table: Per-page metadata including ownership and state

            

Page Frame Allocation

Xen maintains a free list of available page frames. When a domain needs memory, the hypervisor allocates pages from this free list and updates the frame table to record ownership.

Page Frame States

free: Available for allocation
allocated: Assigned to a domain but not yet mapped
page_table: Used as part of a page table
segdesc: Used as segment descriptor table (x86-specific)
shared: Shared between multiple domains via grant tables
loaned: Loaned to another domain

Shadow Page Tables

In HVM mode without hardware-assisted paging, Xen maintains shadow page tables that mirror guest page tables but contain actual machine addresses instead of guest-physical addresses.

Shadow Paging Process

Guest OS updates its page tables (guest virtual to guest physical)
Xen intercepts these updates (page tables are write-protected)
Xen updates corresponding shadow page table (guest virtual to machine physical)
CPU's MMU uses shadow page table for actual memory access
Guest reads see original page tables; CPU uses shadow tables

Hardware-Assisted Paging (HAP)

Modern processors provide Extended Page Tables (EPT on Intel) or Nested Page Tables (NPT on AMD) that eliminate the need for shadow page tables. With HAP, the hardware performs two-level address translation automatically.

Two-Level Address Translation

Level 1: Guest virtual → Guest physical (managed by guest OS)

Level 2: Guest physical → Machine physical (managed by hypervisor)

Hardware combines both translations, eliminating VM exits for most memory accesses and significantly improving performance.

Memory Ballooning

The balloon driver allows dynamic memory adjustment. When Dom0 needs to reclaim memory from a guest, it tells the balloon driver to "inflate," allocating pages within the guest and then releasing them back to the hypervisor for use elsewhere.

Memory Sharing (Page Deduplication)

Xen can identify identical memory pages across domains and share them, using copy-on-write to maintain isolation. This is particularly effective when running multiple instances of the same OS or application.

Interrupt Virtualization

Physical Interrupt Handling

When hardware generates an interrupt, it's delivered to the hypervisor first. The hypervisor determines which domain should receive the interrupt and routes it accordingly.

Interrupt Flow

Hardware device generates interrupt
CPU delivers interrupt to hypervisor interrupt handler
Hypervisor determines target domain (typically Dom0 for hardware interrupts)
Hypervisor converts to event channel notification
Event channel fires in target domain
Domain's event channel handler processes interrupt

Event Channels

Event channels are Xen's virtual interrupt mechanism. They provide asynchronous notifications between domains or from the hypervisor to domains.

                Event Channel Types
                Physical IRQ: Binds to a physical hardware interrupt (Dom0 only)
Inter-domain: Connects two domains for notifications
Virtual IRQ: Hypervisor-generated virtual interrupt
IPI: Inter-processor interrupt between VCPUs

            

MSI and MSI-X Support

Modern devices use Message Signaled Interrupts instead of traditional wire-based interrupts. Xen supports MSI/MSI-X, allowing efficient interrupt delivery with less overhead than traditional interrupts.

Timer Management

Virtual Timers

Each domain has access to multiple time sources:

Timer Type	Description	Use Case
System Time	Monotonically increasing time since boot	Elapsed time measurement
Wall Clock	Real-world time (can jump forward/backward)	Current time display
Virtual TSC	Virtualized Time Stamp Counter	High-resolution timing
Periodic Timer	Regular interval interrupts	Scheduler ticks
One-shot Timer	Single event at specified time	Timeout handling

TSC Virtualization

The Time Stamp Counter (TSC) is a high-resolution CPU counter that increments with every clock cycle. Xen provides multiple TSC modes to balance accuracy and performance:

TSC Modes

Native: Guest reads actual hardware TSC (fast but may be inconsistent)
PV: Hypervisor provides virtual TSC via memory area (good for PV guests)
Emulated: Trap TSC reads and emulate (slow but accurate)
Hybrid: Combination approach for best performance/accuracy balance

IOMMU Support

Purpose of IOMMU

The IOMMU (Input-Output Memory Management Unit) provides memory protection for DMA operations. Without IOMMU, a device assigned to a domain could access any physical memory, potentially compromising other domains.

                IOMMU Benefits
                DMA Isolation: Devices can only access memory belonging to their assigned domain
Safe Device Assignment: PCI passthrough is safe even with untrusted guests
Interrupt Remapping: Protects against interrupt injection attacks
Address Translation: Translates DMA addresses from guest-physical to machine-physical

            

VT-d (Intel) and AMD-Vi

Both Intel and AMD provide IOMMU implementations that Xen supports. These allow secure direct device assignment where a physical PCI device is exclusively assigned to a guest domain with full DMA protection.

Boot and Initialization

Hypervisor Boot Sequence

Detailed Boot Process

Bootloader Handoff: GRUB loads hypervisor and transfers control
Protected Mode Setup: Hypervisor establishes 64-bit long mode
Memory Detection: Queries firmware for physical memory map
CPU Initialization: Sets up per-CPU data structures
MMU Setup: Creates hypervisor page tables
Interrupt Setup: Initializes interrupt controllers (APIC/IOAPIC)
Timer Initialization: Sets up system timers and calibrates TSC
IOMMU Init: Discovers and initializes IOMMU hardware
Domain 0 Creation: Builds initial domain (Dom0)
Dom0 Kernel Load: Loads Dom0 kernel into memory
Dom0 Start: Transfers control to Dom0 kernel entry point

Command-Line Parameters

The hypervisor accepts various parameters at boot time to control behavior:

Common Xen Boot Parameters:

dom0_mem=2G - Set Dom0 memory allocation
dom0_max_vcpus=2 - Limit Dom0 virtual CPUs
sched=credit2 - Select scheduler
iommu=on - Enable IOMMU support
console=com1 - Use serial console
loglvl=all - Increase log verbosity
tsx=0 - Disable Intel TSX (if vulnerable)

Hypercall Interface Details

Making a Hypercall

In x86-64 Linux PV guests, hypercalls are invoked using the SYSCALL instruction with hypercall number in RAX and arguments in other registers:

; Hypercall example (x86-64 assembly)
mov rax, __HYPERVISOR_console_io  ; Hypercall number
mov rdi, CONSOLEIO_write           ; First argument
mov rsi, buffer_address            ; Second argument
mov rdx, buffer_length             ; Third argument
syscall                            ; Invoke hypercall

Privileged vs. Unprivileged Hypercalls

Hypercall	Privilege	Purpose
set_timer_op	Any	Set virtual timer
console_io	Any	Emergency console output
grant_table_op	Any	Manage grant table entries
event_channel_op	Any	Manage event channels
sched_op	Any	Yield CPU, block
domctl	Dom0 only	Domain management operations
sysctl	Dom0 only	System-wide operations

Performance Optimization

CPU Pinning

Pin VCPUs to specific physical CPUs to reduce cache misses and improve locality.

xl vcpu-pin domainname 0 2

NUMA Placement

On NUMA systems, place VCPUs and memory on the same node to minimize remote memory access.

xl numa-placement enable

Scheduler Tuning

Adjust scheduler parameters for specific workload characteristics.

xl sched-credit -d domain -w 512

Timer Mode

Select appropriate timer mode for guest workload type.

tsc_mode="native"

Debugging and Troubleshooting

                Hypervisor Debugging Tools
                xl dmesg: View hypervisor console log
xl debug-keys: Trigger debug output (system state, domain info, etc.)
xentrace: Capture hypervisor execution traces
xenpm: Monitor power management and CPU states
Serial Console: Access hypervisor console via serial port

            

Debug Keys

Pressing Ctrl-A three times on the hypervisor console enables debug key mode, allowing special debugging commands:

Useful Debug Keys:

d - Dump domain info
q - Dump domain run queues
m - Dump memory info
i - Dump interrupt bindings
t - Dump timer queues
h - Show help for all keys

Security Considerations

Hypervisor Security Best Practices

Keep hypervisor updated with security patches
Enable IOMMU for device assignment security
Use stub domains for QEMU isolation
Minimize Dom0 attack surface
Consider XSM/FLASK for mandatory access control
Monitor hypervisor logs for anomalies
Disable unused features to reduce attack surface

Note: Understanding hypervisor internals helps optimize performance and troubleshoot issues. However, most users can achieve excellent results with default settings and standard best practices.

Xen Hypervisor Core

Understanding the Hypervisor

Hypervisor Components Overview

CPU Scheduler

Memory Manager

Interrupt Handler

Timer Subsystem

Hypercall Interface

IOMMU Support

CPU Virtualization Deep Dive

Virtual CPU Management

VCPU State Components

Privilege Levels and Protection

Hypercalls

Common Hypercall Categories

Trap and Exception Handling

Exception Reflection

Scheduler Implementation

Credit Scheduler (Default)

Credit Scheduler Concepts

Credit Allocation Formula

Credit2 Scheduler

Credit2 Improvements

RTDS (Real-Time Deferrable Server) Scheduler

RTDS Parameters

ARINC 653 Scheduler

Memory Management Internals

Physical Memory Layout

Xen Memory Regions

Page Frame Allocation

Page Frame States

Shadow Page Tables

Shadow Paging Process

Hardware-Assisted Paging (HAP)

Two-Level Address Translation

Memory Ballooning

Memory Sharing (Page Deduplication)

Interrupt Virtualization

Physical Interrupt Handling

Interrupt Flow

Event Channels

Event Channel Types

MSI and MSI-X Support

Timer Management

Virtual Timers

TSC Virtualization

TSC Modes

IOMMU Support

Purpose of IOMMU

IOMMU Benefits

VT-d (Intel) and AMD-Vi

Boot and Initialization

Hypervisor Boot Sequence

Detailed Boot Process

Command-Line Parameters

Hypercall Interface Details

Making a Hypercall

Privileged vs. Unprivileged Hypercalls

Performance Optimization

CPU Pinning

NUMA Placement

Scheduler Tuning

Timer Mode

Debugging and Troubleshooting

Hypervisor Debugging Tools

Debug Keys

Security Considerations

Hypervisor Security Best Practices

Virtualization Platforms