Exxact Gpu Server – In this post, we’ll take an in-depth look at communication bottlenecks on the PCIe bus and how some of Exxact’s latest systems are designed to push the limits of internode communication with far-reaching benefits for GPU-heavy fields such as Machine Learning and Life Science Research.
Before we discuss the advantages and disadvantages of different types of PCIe bus layouts, it is first necessary to understand why the communication speed of the PCIe bus has become so important. Until the GPU computing revolution, which began in earnest in 2007, the PCI bus was generally only used for communication to and from disk or through interconnects such as infiniband to other nodes. Information that is written to disk or communicated resides in DRAM attached to CPUs, and since such communication usually occurs from DRAM, through the CPU to the PCI bus, through the memory controller hub (MCH), and then onto the attached device. A typical PCIe bus layout for such a node is shown in Figure 1.
Exxact Gpu Server
When it comes to GPU computing, where it is important to be able to send data between individual GPUs as quickly as possible, there are three main problems with this traditional design.
Cluster Nodes For Relion Gpu
The first is that GPUs are split into two domains with separate memory controller hubs attached to different CPU sockets. The nature of the QPI link connecting the two CPUs is such that direct P2P copying between GPU memory is not possible if the GPUs reside in different PCIe domains. So a copy from GPU 0 memory to GPU 2 memory requires first copying the PCIe link to the memory connected to CPU 0, then transferring the QPI link to CPU 1 and PCIe again to GPU 2. As you might imagine, this process adds a significant amount of overhead in both latency and bandwidth.
The second is that the number of PCIe channels available is limited to the number of channels provided by the CPU. In the case of the current generation of Intel Haswell CPUs, such as the E5-26XXv3 series, it is limited to 40 channels. In a dual socket system you are limited to a maximum of 4 GPUs on a node, and in a single socket system only two GPUs if they all have the full x16 PCIe bandwidth. Using multiple GPUs simultaneously requires a multi-node configuration, which is expensive, and compared to PCIe P2P communication, interconnects such as infiniband are relatively slow. Even the number of additional PCIe channels limits the interconnect bandwidth that can be implemented, and in the example above leads to unwanted heterogeneity and bottlenecks in terms of efficiency where different GPUs can communicate with the internode link. For example, GPU 2 and 3 must go through the CPU to communicate over the infiniband link.
The third problem is that even for pairs of GPUs that can communicate directly via P2P copies, the nature of MCH is such that it prevents the full PCIe bandwidth from being fully utilized.
A node’s PCIe topology and its bandwidth limitations can be further investigated using the two tools provided with the NVIDIA driver and the CUDA toolkit. The first is the nvidia-smi command, which can be used to display the PCIe topology:
Nvidia Announces Rtx Product Update
This table shows us what Figure 1 above graphically describes, that the GPUs are in pairs, 0 & 1 and 2 & 3 and are connected via a PCIe host bridge. Using the p2pBandwidthLatencyTest provided with NVIDIA CUDA Samples, we can easily visualize the bottlenecks introduced by such a traditional PCIe topology.
The difference in communication bandwidth when the QPI connection is involved is immediately apparent, ~12GB/s vs. ~19GB/s. What is not so obvious though, but will be clearer when we look at alternative motherboard designs below, is that due to the involvement of the PCIe host bridge, the achievable bi-directional bandwidth even in the P2P case is below the theoretical maximum of x16 PCIe Gen . 3.0 limit of 32 GB/s. This communication bottleneck also shows up in communication latency between GPUs, which the p2pBandwidthLatencyTest tool also shows.
Here it is immediately apparent that the communication delay between GPUs in different PCIe domains is much higher than those that can communicate via P2P.
The net result of these issues is that, while often economical, the traditional PCIe bus design on single and multi-socket motherboards is not suitable for modern multi-GPU accelerated software. Fortunately, the design of PCIe is such that one is not limited by this traditional CPU-centric approach to motherboard design and the use of PCIe (PLX) switch chips (Figure 2), which are currently available in 48, 80 and 96 channel designs , it is possible to design GPU-centric motherboards and maximize the intranode GPU to GPU communication potential while remaining very cost-effective. This approach is what Exxact has done with their range of GPU-optimized workstation and server designs. In the following sections, we will examine each of the different designs available and highlight the pros and cons of each.
H273 Z80 (rev. Aaw1)
The Spectrum TXN003-0128N, better known as the Deep Learning Dev Box was first offered by Exxact in January 2014 and the design was subsequently used as the basis for NVIDIA’s Digits Dev Box (https://developer.nvidia.com/devbox). This system is designed to provide an optimal balance between PCIe communication speed with 4 GPUs and price in a desktop form factor. The PCIe topology for this system is based on combining two 8747 PLX switches to allow 4 GPUs to be hosted, each with full x16 bandwidth, on a single CPU socket system while providing communication support of P2P between all 4 GPUs. The PCIe topology is shown in Figure 3 below.
Contrary to the traditional design, the use of two cost-effective 8747 PLX switches makes it possible to host 4 GPUs in the same PCIe domain and for the system to only need one CPU socket. This allows Exxact to offer this complete system with the full Deep Learning software stack for only $8999. P2P communication between all 4 GPUs is possible here, although there is still a x16 bottleneck (and MCH) between the two banks of GPUs. However, as can be seen from the p2pBandwidthLatencyTest result, the communication performance between the 4 GPUs is much improved compared to the traditional motherboard design.
Not only does the bandwidth between GPU pairs 0 & 1 and 2 & 3 improve significantly, getting closer to the theoretical maximum of 32 GB/s due to replacing the MCH link with a direct PLX switch, so does the bandwidth between two GPU groups, although there is still a drop due to MCH and the fact that all communication must ultimately share a single x16 connection between the two GPU banks. However, it does provide a very cost-effective yet high-performance GPU-centric workstation. The design also greatly improves communication latency.
If we compare this to the traditional design above, we can see that in the worst case, the latency between any GPU is no more than 6.31 microseconds. This is a huge improvement over the 16.86 microseconds of the traditional design. This is why this design is so effective as a workstation for machine learning applications.
Nvidia: Expect Amazon To Buy More Pricey Tesla Gpus (nasdaq:nvda)
Often, having 4 GPUs on a node is not enough horsepower. Of course, one can always connect nodes together using high-speed links, but this is both expensive and a major communication bottleneck. The solution to this is Exxact’s Tensor TXR430-1500R 8 GPU system.
This effectively provides two Deep Learning Dev Boxes on the same node. The topology for each bank of 4 GPUs is the same as the Dev Box above. The design has the obvious limitation that P2P communication is not possible between two banks of 4 GPUs as seen in the bandwidth figures below, but the use of cost effective 8747 PLX switches allows for a very efficient node design. In fact, this 8-way system could prove to be cheaper than two separate Deep Learning Dev Boxes, while providing the flexibility of having 8 GPUs available, with the savings of only needing one case, a disk system, a set of CPU memory DIMMS etc. on a single node.
As mentioned above, there are currently three versions of the PLX PCIe switch, the 48-channel, 80-channel and 96-channel versions. Designs 1 and 2 above use the cost-effective 48-channel switch, but if one were to switch to using a higher channel count switch, it is possible, within certain spacing limitations imposed by the requirements of PCIe, to build systems with very high bandwidth throughout the system’s PCIe peer to peer support. Exxact offers two such designs, one supporting up to 8 GPUs and the other up to a whopping 20 GPUs. The design of the 8 GPU system (Tensor TXR550-512R) is as follows.
It is the first system to offer peer-to-peer support at all
G262 Zo0 (rev. A00)
Dell gpu server, 1u gpu server, gpu server rental, gpu server hosting, 16 gpu server, 2u gpu server, gpu rack server, cloud gpu server, a100 gpu server, lambda gpu server, supermicro gpu server, 4 gpu server