AMD’s Opteron, unlike other 64-bit CPUs, can run the 32-bit software companies use now, as well as the 64-bit software they will increasingly rely on. Tom Yager looks at the newcomer.
The launch two months ago of Opteron, Advanced Micro Devices’ new 64-bit processor, was auspicious in many ways. At the time of its introduction, AMD revealed System Performance Evaluation and Transaction Processing Performance Council benchmark results that confirm what AMD has been saying all along: Opteron is not only a very capable 32-bit CPU, but even in its first-generation 64-bit performance compares well against market leading processors Power, Alpha, and Itanium 2 from IBM, Hewlett-Packard, and Intel, respectively.
Opteron is already available in two- and four-processor servers now shipping. And users will have their pick of 64-bit operating systems. Microsoft committed to a 64-bit edition of Windows Server 2003, Red Hat voiced its support, SuSE Linux and MandrakeSoft are already shipping commercial Linux for Opteron, and Sun Microsystems has announced plans to build an Opteron version of Solaris. IBM perhaps made the biggest splash with a dual commitment: the leading Linux database, DB2, is now available as a 64-bit downloadable beta for Opteron. And IBM will soon build and market Opteron-based servers, potentially making Opteron a serious contender against Intel’s universally adopted Xeon and fast-rising Itanium 2.
Opteron is a contender in entry-level HPC (high-performance computing) servers and workstations today, with several system and component manufacturers already delivering Opteron-based products. But chief rivals Intel and IBM aren’t standing still. Indeed, both vow to bury Opteron with Power5 and Itanium 2. The only questions we can examine now relate to Opteron’s strengths and weaknesses compared with other CPUs, mainly Intel’s. The sword hanging over AMD’s head is the planned reshaping of the CPU market by much larger competitors.
AMD’s primary challenger is undoubtedly Intel. Xeon owns the 32-bit server market and has a healthy chunk of overall server sales up to the mid-tier level. Intel has given Xeon faster bus and memory speeds, and Xeon scales up to 32 processors per system. However, nothing Intel does can give Xeon the 64-bit computation and data-handling capabilities that make Opteron and Itanium 2 so appealing on large-scale database and scientific and technical workloads. But Itanium 2 has a longer track record, more buy-in from major vendors (including IBM), and the ability to extract substantial speed gains from smarter compilers. AMD is hoping that Itanium 2 will be hampered by a higher price tag, higher power consumption, and the inability to run 32-bit x86 software at native speeds.
Native versus emulated
Opteron’s main claim to fame is a capability no other 64-bit CPU possesses: it can run 32-bit x86 applications at the chip’s full clock speed. The 1.8GHzOpteron runs 32-bit applications at speeds comparable to or exceeding a 2GHz Xeon MP, and that will improve as AMD makes Opteron’s clock faster. Booted in Legacy Mode, Opteron is indistinguishable from an Athlon CPU. It runs 32-bit and 16-bit applications. A four-processor Opteron server will boot into DOS, a capability that can be a lifesaver if you must run system recovery tools such as Ghost or reorganise your disk with Partition Magic. Thirty-two-bit operating systems including Windows, Linux, and Unix install and run in Opteron’s Legacy Mode exactly as they would on Xeon.
In Long Mode, Opteron’s 64-bit hardware kicks in, removing memory size barriers. This turns on access to a large set of 64-bit registers, the temporary storage that holds operands of mathematical computations and points to addresses in memory. Long Mode is activated by a 64-bit operating system, but remarkably Long Mode will still run 32- and 16-bit software (but not DOS). Thirty-two-bit applications running in Long Mode are unaware that they’re running in a 64-bit environment. However, the operating system’s base services — covering such fundamentals as memory management and disk storage — handle data at 64-bit speed and have access to Opteron’s extended registers. That removes barriers and conveys huge performance benefits to 32-bit applications hosted in Long Mode.
Intel has announced plans to release a software emulator that allows its Itanium family of 64-bit CPUs to run x86 programs, claiming that performance under emulation will be comparable to that of a 1.5GHz Xeon. That’s about half of the top speed of a Xeon or Pentium 4, and we can’t predict the performance impact that running x86 emulation will have on native Itanium programs running at the same time. If history is any guide, maintaining performance and compatibility will be major challenges. However, 32-bit x86 emulation on Alpha and PowerPC processors does work. An open source project called Bochs emulates x86 well enough to run x86 Linux and Windows NT on a variety of processor architectures. But software CPU emulation has proved to be something one only does as an academic exercise or when no alternatives exist. Perfect emulation is impossible. Unless Intel has achieved some breakthrough in software CPU mimicry, its approach to 32-bit software support will be disappointing. It certainly poses no threat to AMD.
Not another x86
Aside from its x86 compatibility, AMD’s architecture addresses many of the problems of other CPUs. A multichannel, high-speed I/O bus, dubbed HyperTransport, links the Opteron CPU to memory, peripherals, and other CPUs in the same system. Memory and bus controllers are external components in other CPU designs; AMD integrated these functions into the core processor chip, resulting in simpler system designs. On-chip controllers also reduce latency to external components. In other words, the processor spends less time waiting for data to be sent to or retrieved from memory and devices. I/O speed and main memory throughput are key performance criteria for large-scale applications, and they are usually more important than raw calculation speed.
The Opteron’s L1 (Level 1) cache — the fast on-chip memory that sits between the CPU’s execution circuitry and main memory — is the same size as AMD’s Athlon MP 32-bit server CPU on which Opteron is based. The Opteron’s L2 (Level 2) cache has quadrupled in size from Athlon’s 256KB to a full megabyte (note that there is a desktop Athlon now with a 512KB L2 cache, and a matching Athlon MP is planned).
The maximum speed of Opteron’s channels to main memory far outstrips Xeon’s: Opteron can pump a total of 5.4GBps of data through its dual memory interfaces. Xeon’s top memory throughput is 3.2GBps. Itanium 2 runs four channels to memory, giving it a maximum potential throughput of 6.4GBps.
Opteron innately implements NUMA (nonuniform memory access) to greatly accelerate access to main memory. NUMA gives each CPU in an SMP (symmetric multiprocessor) system a dedicated bank of RAM. This separation isn’t visible to users or software; if you have a dual processor system with 3GB of memory per processor, your OS reports 6GB total. Using the HyperTransport controllers built into the Opteron CPU, each processor transparently negotiates with others for access to their memory. This approach is aided by the fact that the HyperTransport bus controllers are on the processor chip.
The typical downside of NUMA is that access to remote memory — memory attached to another processor — is very slow. Traditionally, maximum performance is achieved by running a NUMA SMP system in what amounts to a cluster configuration: Each CPU has its own distinct workload, communicating with other CPUs only when it needs to pass application data back and forth. Opteron’sHyperTransport speeds access to remote memory. What’s more, it has the capability of working as either an SMP or NUMA system, even in Legacy Mode. The 32-bit retail version of Windows Server 2003 has a boot configuration setting that enables NUMA on Opteron. In our testing, it worked like a charm. Sixty-four-bit Linux distributions for Opteron have separate kernels for NUMA and SMP modes. It’s easy to switch between them.
The immediate benefit of NUMA for a 32-bit OS is the elimination of the 4GB memory ceiling associated with x86. In Legacy Mode, Opteron is limited to 4GB of memory as are Xeon and Athlon MP. But in Opteron’s case, that’s 4GB per CPU. A dual-processor system can have as much as 8GB of RAM, a quad-processor machine can host as much as 16GB, and so on. Memory size limits effectively vanish when Opteron is switched into Long Mode. Then it can address as much as a terabyte (1024GB) of physical memory and 256TB of virtual memory.
Can’t stand still
Opteron has two Achilles heels: cache speed and floating-point performance. Intel has caching down to a science. Opteron’s cache is markedly slower than those in Xeon and Itanium 2, and both Intel CPUs can utilise an L3 (Level 3) cache. It helps that Opteron’s caches are larger, but for the most part, size is something a CPU designer resorts to only if he or she can’t get speed. Fortunately, fast main memory access and the NUMA architecture help compensate for Opteron’s slower cache.
Intel and IBM processors excel at floating-point maths, a capability that plays a critical role in scientific, technical, and digital media software. Opteron’s integer performance is on a par with 64-bit competitors, but its shortcomings in floating point give rivals powerful ammunition to use against AMD. That competition will intensify with the delivery of IBM’s Power5 (later this year or early next) and future generations of Itanium. The next major target for all CPU makers will be multicore designs. Multicore will be the norm within five years. It is already on the roadmap for Power, Itanium, and Sparc.
AMD’s 64-bit debut was nothing short of extraordinary. But its engineers can’t rest. They’ll have to work doubly hard to match competitors’ advances in cache, floating-point, and multicore technology. That’s an expensive undertaking for a company that’s been losing money quarter after quarter. However, no one expected Opteron would be so capable right out of the chute. Intel and IBM should brace for more surprises from AMD.