top of page
Writer's picturevP

The Role of Dual DPU Support in vSphere 8 for High Availability

In vSphere 8, VMware has introduced dual Data Processing Unit (DPU) support to enhance high availability and performance in virtualized environments. DPUs, also known as SmartNICs, offload networking and security tasks from the main CPU, improving efficiency. With dual DPU support, each ESXi host can utilize two DPUs, configured in either Active/Standby mode for redundancy or as independent devices to double offload capacity. This advancement ensures continuous operation during DPU failures and boosts network function acceleration, simplifying management and enhancing system resilience.


vSphere Distributed Services Engine is a core vSphere capability that enables customers to use DPUs with vSphere and VMware Cloud Foundation. As illustrated in the below diagram, DPUs can also handle additional capabilities such as storage offload and bare metal management, but these additional capabilities are currently not supported.



vSphere Distributed Services Engine offloads and accelerates infrastructure functions on the DPU by introducing a VMware vSphere Distributed Switch on the DPU and VMware NSX Networking and Observability, which allows you to proactively monitor, identify, and mitigate network infrastructure bottlenecks without complex network taps. The DPU becomes a new control point to scale infrastructure functions and enables security controls that are agentless and decoupled from the workload domain.


With vSphere Distributed Services Engine, you can:

  • Install and update ESXi images simultaneously on the x86 server and the attached supported DPU to reduce operational overhead of DPU lifecycle management with integrated vSphere workflows.

  • Set alarms for DPU hardware alerts and monitor performance metrics on core, memory, and network throughput from the familiar vCenter interfaces, without the need of new tools.

  • Accelerate vSphere Distributed Switch on the DPU to improve network performance and utilize available CPU cycles to achieve higher workload consolidation per ESXi host.

  • Get vSphere DRS and vSphere vMotion support for VMs running on hosts with DPUs attached to get the benefits of passthrough without sacrificing on VM portability.

  • Improve the security of infrastructure with zero-trust security.


vSphere Distributed Services Engine does not require a separate ESXi license. An internal network that is isolated from other networks, connects the DPUs with ESXi hosts. ESXi 8.0 server builds are unified images, which contain both x86 and DPU content. In your vSphere system, you see DPUs as new objects during installation and upgrade, and in networking, storage, and host profile workflows.


Installation of VMware vSphere Distributed Services Engine with 2 DPUs

vSphere Distributed Services Engine does not require a separate ESXi license. ESXi 8.0 Update 3 server builds are unified images, which contain both x86 and DPU content, and you cannot install x86 and DPU content separately. The installation procedure on both DPUs, either interactive or scripted, also happens in parallel and you see minimal performance loss as compared to a single-DPU system.


With vSphere 8.0 Update 3, you can get a pre-installed server configuration with 2 DPUs from Dell or Lenovo, or add a second DPU to a single DPU system on the supported dual DPU servers from Dell or Lenovo.


Supported by vendors like NVIDIA and Pensando, and available in Lenovo server designs, dual DPU systems integrate seamlessly into existing infrastructures. vSphere Lifecycle Manager facilitates synchronized updates for both DPUs and CPUs, ensuring consistent performance and simplified maintenance.


Error Handling, Failover, and Rollback for VMware vSphere Distributed Services Engine

Before installing VMware vSphere Distributed Services Engine, see the error handling, failover, and rollback options.


Error Handling

An installation failure of either x86 and DPU content on an ESXi host marks the entire installation procedure as failed.


While the expectation is that the software state of DPUs remains identical at all times, in the unlikely case of an error during a lifecycle operation, such as installation or upgrade of a Component, the operation might pass on one DPU but fail on the other. Since each lifecycle operation occurs within the boundaries of each DPU, errors do not affect the state of the other DPU, but the overall result of the installation is still marked as a failure.


During interactive install, in vSphere Lifecycle Manager workflows, and when you use ESXCLI, you receive information about the DPU on which the operation failed.


After a successful installation, in case of DPU errors, the recommended action is to restart the affected ESXi host. If the DPU is still accessible from the host, the general log bundle collection is sufficient for troubleshooting. If the DPU is not accessible from the host, logging in to the DPU from a BMC, iLO, or iDRAC interface can provide troubleshooting logs.


Failover

Failover support in vSphere 8.0 Update 3 is limited to one of the DPUs becoming non-functional due to software errors within the DPU or a physical disconnect of one of the DPUs, such as cable disconnect. Failover due to Peripheral Component Interconnect (PCI) level errors is not supported.


Rollback

Rollback is a best effort mechanism to restore the system to a previous working state in case of a failure before the jumpstart phase of the ESXi boot. Rollback on both x86 servers and the attached supported DPUs is automatic in case of an error during booting. You can also opt for a manual rollback by pressing Shift+R before the bootloader starts, to return to a previous good state.

Any failure after the jumpstart phase starts does not result in a rollback.


Rollback scenarios for VMware vSphere Distributed Services Engine installation

Image Courtesy - VMware
Rollback scenarios for VMware vSphere Distributed Services Engine installation

In summary, vSphere 8's dual DPU support offers enhanced high availability and performance, providing a robust solution for modern data centers. By offloading tasks from the CPU, DPUs free up resources for critical workloads, improving overall system efficiency. The dual DPU configuration further enhances this by providing redundancy and increased capacity, ensuring that virtualized environments remain resilient and performant.


Thank you for reading!



*** Explore | Share | Grow ***

1 view0 comments

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page