As the field of GPU Computing continues to advance from a hardware perspective, the ability to actually utilize this enhanced compute power rests in the hands of programmers who must write the code which manages the large volumes of data that flow through this massively parallel architecture.
Addressing the needs of GPU programmers NVIDIA announced the NVIDIA® CUDA® Toolkit 4.0 Release Candidate on Friday, March 4th. While this update includes a long list of productivity improvements, two innovations are especially noteworthy with respect to taking advantage of available system resources.
NVIDIA’s GPUDirect™ 2.0 technology offers support for peer-to-peer communication among GPUs within a single server or workstation. For GPGPU computing platforms built with multiple GPUs it’s possible to transfer data directly between GPUs in a peer-to-peer fashion without the added step of copying data to system memory. With GPUDirect 2.0, which is embedded in the CUDA 4.0 Toolkit, kernels running on one GPU can directly read from, or write to, memory on another GPU that is resident on a common PCI Express bus. This evolution in programming methodology removes the system chipset, CPU memory controller and main system memory from the picture for these operations.
Future releases of CUDA are expected take this approach a step further by allowing GPUs located on servers residing within an InfiniBand cluster to directly access each other’s data and memory. In this fashion, peer-to-peer communication can occur over the PCI Express bus at a system level, as well as via InfiniBnd links across multiple servers. This shift will further reduce the processing load on system CPUs and allow for more efficient use of the hundreds to thousands of GPU cores within such systems.
A new technology innovation present in CUDA Toolkit 4.0 is NVIDIA’s Unified Virtual Addressing (UVA). Taking advantage of the 64-bit memory addressing present in Fermi architecture GPUs, UVA provides a single merged-memory address space for the main system CPU memory and combined GPU memory, and that means developers no longer have to deal with host pointers and device pointers as CUDA can keep tract of where the data is stored.
Where is this technology ultimately heading? Time will tell, but I can already imagine banks of servers, each containing multiple GPUs, clustered with massive solid-state storage devices (SSDs), operating in a unified peer-to-peer fashion. New supercomputing hybrids will continue to evolve over the next few years as designers leverage the unique advantages of CPUs and GPUs to maximize performance, and that promises to make the future of GPU Computing quite exciting.
To obtain further information on the features and capabilities of NVIDIA’s CUDA Toolkit 4.0 please visit: http://www.nvidia.com/cuda.
As member of the NVIDIA Tesla Preferred Partner (TPP) program, Trenton can design a GPU Computing platform that meets your unique specifications. Give us a call at 770-287-3100 to discuss your needs.






