Right now, the market for GPUs for use in machine learning is essentially a market of one: Nvidia.
AMD, the only other major discrete GPU vendor of consequence, holds around 30 percent of the market for total GPU sales compared to Nvidia’s 70 percent. For machine-learning work, though, Nvidia’s lead is near-total. Not just because all the major clouds with GPU support are overwhelmingly Nvidia-powered, but because the GPU middleware used in machine learning is by and large Nvidia’s own CUDA.
AMD has long had plans to fight back. It’s been prepping hardware that can compete with Nividia on performance and price, but it’s also ginning up a platform for vendor-neutral GPU programming resources — a way for developers to freely choose AMD when putting together a GPU-powered solution without worrying about software support.
AMD recently announced its next steps toward those goals. First is a new GPU product, the Radeon Vega, based on a new though previously unveiled GPU architecture. Second is a revised release of the open source software platform, ROCm, a software layer that allows machine-learning frameworks and other applications to leverage multiple GPUs.
Both pieces, the hardware and the software, matter equally. Both need to be in place for AMD to fight back.
AMD’s new star GPU performer: Vega
AMD has long focused on delivering the biggest bang for the buck, whether by way of CPUs or GPUs (or long-rumored combinations of the two). Vega, the new GPU line, is not simply meant to be a most cost-conscious alternative to the likes of Nvidia’s Pascal series. It’s meant to beat Pascal outright.
Some preliminary benchmarks released by AMD, as dissected by Hassan Mujtaba at WCCFTech, shows a Radeon Vega Frontier Edition (a professional-grade edition of the GPU) beating the Nvidia Tesla P100 on the DeepBench benchmark by a factor of somewhere between 1.38 and 1.51, depending on which version of Nvidia’s drivers were in use.
Benchmarks are always worth taking with a jumbo-sized grain of salt, but even that much of an improvement is still impressive. What matters is at what price AMD can deliver that kind of improvement. A Tesla P100 retails for approximately $13,000, and no list price has been set yet for the Vega Frontier. Still, even offering the Vega at the same price as the competition is tempting, and falls in line with AMD’s general business approach.
AMD’s answer to CUDA: ROCm-roll
What matters even more for AMD to get a leg up, though, is not beating Nvidia on price, but ensuring its hardware is supported at least as well as Nvidia’s for common machine-learning applications.
By and large, software that uses GPU acceleration uses Nvidia’s CUDA libraries, which work only with Nvidia hardware. The open source OpenCL library provides vendor-neutral support across device types, but performance isn’t as good as it is with dedicated solutions like CUDA.
Rather than struggle with bringing OpenCL up to snuff—a slow, committee-driven process — AMD’s answer to all this has been to spin up its own open source GPU computing platform, ROCm, the Radeon Open Compute Platform. The theory is that it provides a language- and hardware-independent middleware layer for GPUs—primarily AMD’s own, but theoretically for any GPU. ROCm can also talk to GPUs by way of OpenCL if needed, but also provides its own direct paths to the underlying hardware.
There’s little question ROCm can provide major performance boosts to machine learning over OpenCL. A port of the Caffe framework to ROCm yielded something like an 80 percent speedup over the OpenCL version. What’s more, AMD is touting how the process of converting code to use ROCm can be heavily automated, another incentive for existing frameworks to try it. Support for other frameworks, like TensorFlow and MxNet, is also being planned.
AMD is playing the long game
The ultimate goal AMD has in mind isn’t complicated: Create an environment where its GPUs can work as drop-in replacements for Nvidia’s in the machine-learning space. Do that by offering as good, or better, hardware performance for the dollar, and by ensuring the existing ecosystem of machine-learning software will also work with its GPUs.
In some ways, porting the software is the easiest part. It’s mostly a matter of finding manpower enough to convert the needed code for the most crucial open source machine-learning frameworks, and then to keep that code up to date as both the hardware and the frameworks themselves move forward.
What’s likely to be toughest of all for AMD is finding a foothold in the places where GPUs are offered at scale. All the GPUs offered in Amazon Web Services, Azure, and Google Cloud Platform are strictly Nvidia. Demand doesn’t yet support any other scenario. But if the next iteration of machine-learning software becomes that much more GPU-independent, cloud vendors will have one less excuse not to offer Vega or its successors as an option.
Still, any plans AMD has to bootstrap that demand are brave.They’ll take years to get up to speed, because AMD is up against the weight of a world that has for years been Nvidia’s to lose.