Technical informationposition:Baoxingwei > Technical information > 正文
How can PCIe architecture and RAID unlock the full potential of GPUDirect storage
Edit:Baoxingwei Technology | Time:2023-02-01 09:00 | Number of views:215
With faster graphics processing units (Gpus) able to provide significantly higher computing power, the data path bottleneck between the storage device and the GPU memory has been unable to achieve optimal application performance. NVIDIA's Magnum IO GPUDirect storage solution can greatly help solve this problem by enabling a direct path between the storage device and the GPU memory. However, it is equally important to use fault-tolerant systems to optimize their already excellent capabilities to ensure that critical data is backed up in the event of a catastrophic failure. The solution connects logical RAID volumes through a PCIe structure, which can increase data rates up to 26 GB/s under the PCIe 4.0 specification. To understand how to realize these benefits, you first need to examine the key components of the solution and how they work together to provide results.
Magnum IO GPUDirect Memory
The key advantage of the Magnum IO GPUDirect storage solution is its ability to eliminate one of the major performance bottlenecks by not using system storage in the CPU to load data from the storage device into the GPU for processing. Typically, data is moved to host storage and transferred to the GPU, which relies on a rebound buffer in the CPU's system memory, where multiple copies of the data are created before it is transferred to the GPU. However, moving large amounts of data through this path creates latency, degrades GPU performance, and consumes many CPU cycles in the host. With the Magnum IO GPUDirect storage solution, you eliminate CPU access and avoid rebound buffer inefficiencies (Figure 1).
Figure 1 The Magnum IO GPUDirect storage solution eliminates the need to access the CPU, avoiding the need to bounce back from the data path and directly improving the buffering performance as the amount of data transferred increases. The amount of data transferred grows exponentially with the large distributed data sets required for artificial intelligence (AI), machine learning (ML), deep learning (DL), and other data-intensive applications. These advantages can be realized when data is stored locally or remotely, allowing petabytes of remote storage to be accessed faster than the page cache in CPU memory.
Optimizing RAID performance
The next element in the solution is to include RAID capabilities for data redundancy and fault tolerance. Although software RAID can provide data redundancy, the underlying software RAID engine still uses a reduced instruction set computer (RISC) architecture for operations such as parity calculations. When comparing the write I/O latency of advanced RAID levels, such as RAID 5 and RAID 6, hardware RAID is still much faster than software RAID because dedicated processors are provided to perform these operations and write back caching. In streaming applications, the long RIAD response time of software riads can cause data to accumulate in the cache. Hardware RAID solutions are free of cache data buildup and have dedicated backup batteries to prevent data loss in the event of a catastrophic system power failure.
Standard hardware RAID eases the burden of host parity management. However, a large amount of data must pass through the RAID controller to be sent to NVMe drives, which makes data paths more complex. The solution to this problem is NVMe-optimized hardware RAID, which provides a simplified data path to transfer data without going through firmware or controllers on RAID chips. It also allows maintenance of hardware-based protection and encryption services.
Hybrid PCIe architecture
PCIe Gen 4 is now the basic system interconnect interface within the storage subsystem, but the standard PCIe switched network has the same basic tree-based hierarchy as previous generations. This means that inter-host communication requires non-transparent bridging (NTB) to enable cross-partition communication, which makes it complicated, especially in multi-host multi-switched network configurations. Solutions such as Microchip's PAX PCIe Advanced Structured Switched Network are able to overcome these limitations because they support redundant paths and loops that are not possible using traditional PCIe. A structured switched network has two separate domains, the host virtual domain (dedicated to each physical host) and the structural domain (containing all endpoints and structural links). Transactions from the host domain are translated into ids and addresses in the domain, and vice versa for non-hierarchical routing of the domain communication. This allows all the hosts in the system to share the structural links to the switched network and the endpoints. The structural firmware running on an embedded CPU virtualizes a PCIe standard compliant switching network with a configurable number of downstream ports. As a result, the switched network will always appear as a standard single-layer PCIe device with directly connected endpoints, regardless of their location in the structure. It can do this because the structured switched network intercepts all configuration plane traffic from the host (including the PCIe enumeration process) and selects the best path. This allows endpoints such as the GPU to be bound to any host in the domain (Figure 2).
In the following example (Figure 3), we show the dual-host PCIe structural engine setup. Here, we can see that structural virtualization allows each host to see a transparent PCIe topology with one uplink port, three downlink ports, and three endpoints connected to them, and the host can enumerate them correctly. What is interesting in Figure 3 is an SR-IOV SSD with two virtual functions. The virtual functions of the same drive can be shared with different hosts through Microchip's PCIe advanced structured switching network.
This PAx-structured switched network solution also supports direct cross-domain point-to-point transmission between structures, thereby reducing root ports
Block and further relieve the CPU performance bottleneck, as shown in Figure 4.
Performance optimization
Having explored all the components involved in optimizing the performance of data transfer between NVMe drives and Gpus, you can now use these components in combination to achieve the desired results. The best way to illustrate this is to illustrate the steps graphically. Figure 5 shows the host CPU with its root port and the various configurations that can achieve the best results. As shown on the left side of Figure 5, the PCI Gen 4 x4 (4.5 GB/s) is also limited to a maximum data rate of 3.5 GB/s due to the overhead of the root port, despite the use of a high-performance NVMe controller. However, the SmartRAID controller can create two RAID volumes for each of the four NVMe drives and create traditional PCIe point-to-point routes through the root port by aggregative multiple drives (as shown on the right) at the same time using RAID (logical volume). This increases the data rate to 9.5GB /s. However, with cross-domain point-to-point transmission (bottom figure), it is possible to route the communication over a structural link rather than the root port to achieve a rate of 26 GB/s, which is the highest possible rate with the SmartROC 3200 RAID controller. In the final scenario, the switched network provides a direct data path that is not affected by firmware, while still maintaining hardware-based RAID protection and encryption services, while leveraging the full potential of GPUDirect storage.
Conclusion
High-performance PCIe architecture switched networks, such as Microchip's PAX, allow multiple hosts to share drives that support single root I/O virtualization (SR-IOV), as well as dynamically divide pools of Gpus and NVMe SSDS that can be shared between multiple hosts.
Microchip's PAX structured switched network can dynamically redistribute endpoint resources to any host that needs them. This solution also uses the SmartPQI driver supported by the SmartROC 3200 RAID controller family, so there is no need for a custom driver. Microchip's SmartROC 3200 RAID controller is the only device currently capable of delivering the highest transfer rate, namely 26 GB/s. It has extremely low latency time, provides up to 16 PCIe Gen 4 channels to the host, and is backward compatible with PCIe Gen 2. When combined with Microchip's Flashtec-based NVMe SSDS, the full potential of PCIe and Magnum IO GPUDirect storage can be realized in multi-host systems. Taken together, all of these features enable it to build a powerful system that can meet the real-time needs of AI, ML, DL, and other high-performance computing applications.