Thursday, 27 August 2015

DirectX 12 tested: An early win for AMD and disappointment for Nvidia


Image result for amd direct x

First DX12 gaming benchmark shows R9 290X going toe-to-toe with a GTX 980 Ti.


Windows 10 brings a slew of features to the table—the return of the Start menu, Cortana, the Xbox App—but the most interesting for gamers is obvious: DirectX 12 (DX12). The promise of a graphics API that allows console-like low-level access to the GPU and CPU, as well as improved performance for existing graphics cards, is tremendously exciting. Yet for all the Windows 10 information to trickle out in the three weeks since the OS launched, DX12 has remained the platform's most mysterious aspect. There's literally been no way to test these touted features and see just what kind of performance uplift (if any) there is. Until now, that is.
Enter Oxide Games' real-time strategy game Ashes of the Singularity, the very first publicly available game that natively uses DirectX 12. Even better, Ashes has a DX11 mode too. For the first time, we can make a direct comparison between the real-world (i.e. actual game) performance of the two APIs across different hardware. While earlier benchmarks like 3DMark's API Overhead feature test were interesting, they were entirely synthetic. Such tests only focused on the maximum number of draw calls per second (which allows a game engine to draw more objects, textures, and effects) achieved by each API.

What's so special about DirectX 12?

DirectX 12 features an entirely new programming model, one that works on a wide range of existing hardware. On the AMD side, that means any GPU featuring GCN 1.0 or higher (cards like the R9 270, R9 290X, and Fury X) are supported, while Nvidia says anything from Fermi (400-series and up) will work. Not every one of those graphics cards will support every feature of DirectX 12 though, because the API is split into different feature levels. These include extra features like Conservative Rasterization, Tiled Resources, Raster Order Views, and Typed UAV Formats.
Some of those features are interesting and very technical (I refer you to this handy glossary if you're interested in exactly what some of them do). But the good news is that the most important features of DirectX 12 are supported across the board. In theory, that means most people should see some sort of performance uplift when moving to DX12. And AMD has been particularly vocal about the performance of its new API, a move that's undoubtedly tied to its poor DX11 performance (particularly on low-end CPUs) compared to Nvidia.

Enlarge / Diagram illustrating the difference between DX11 and DX12 graphics pipeline.

Before a graphics card renders a scene, the CPU first has to send instructions to the GPU. The more complex the scene, the more draw calls need to be sent. Under DX11, Nvidia's driver tended to process those draw calls more efficiently than AMD's, leading to more consistent performance. However, both were held back by DX11. GPUs mostly consist of thousands of small cores (shaders), so they tend to excel at parallel workloads. But DX11 was largely serial in its thinking: it sends one command to the GPU at a time, usually from just one CPU core. In contrast, DX12 introduces command lists. These bundle together commands needed to execute a particular workload on the GPU. Because each command list is self-contained, the driver can pre-compute all the necessary GPU commands up-front and in a free-threaded manner across any CPU core. The only serial process is the final submission of those command lists to the GPU, which is theoretically a highly efficient process. Once a command list hits the GPU, it can then process all the commands in a parallel manner rather than having to wait for the next command in the chain to come through. Thus, DX12 increases performance.
In the DX11 era, Nvidia was the undisputed king, but this is great news for AMD. The company's GCN architecture has long featured asynchronous compute engines (ACE), which up until now haven't really done it any favours when it comes to performance.  Under DX12, those ACEs should finally be put to work, with tasks like physics, lighting, and post-processing being divided into different queues and scheduled independently for processing by the GPU. On the other hand, Nvidia's cards are very much designed for DX11. Anandtech found that any pre-Maxwell GPU from the company (that is, pre-980 Ti, 980, 970, and 960) had to either execute in serial or pre-empt to move tasks ahead of each other. That's not a problem under DX11, but it potentially becomes one under DX12.
There's another big feature in DX12 that's going to be of particular interest to those with an iGPU or APU: Explicit Multiadaptor. With DX12, support for multiple GPUs is baked into the API, allowing separable and contiguous workloads to be executed in parallel on different GPUs, regardless of whether they come from Intel, AMD, or Nvidia. Post-processing in particular stands to gain a lot from Explicit Multiadaptor. By offloading some of the post-processing to a second GPU, the first GPU is free to start on the next frame much sooner.
Tied into to this is Split-Frame Rendering (SFR). Instead of a multiple GPUs rendering an entire frame each, a process known as Alternate Frame Rendering (AFR), each frame is split into tiles for each GPU to render before being transferred to a display. In theory, this should eliminate much of the frame variance that afflicts current multi-GPU CrossFire and SLI setups.
Finally, DX12 will allow for multiple GPUs to pool their memory. If you've got two 4GB graphics cards in your machine, the game will have access to the full 8GB.

The benchmark

Unfortunately, both Explicit Multiadaptor and Split-Frame Rendering aren't currently supported in the Ashes benchmark (trials for each are due to arrive soon). The rest of DX12 is supported thankfully, and the Ashes benchmark does a surprisingly good job of digging out performance data from the nascent API.
The benchmark runs a three-minute real-time demo that executes live game code, complete with AI scripts, audio processing, physics, and more. That means the benchmark isn't exactly the same each time, but the variations are low enough to reliably establish larger performance trends.
One the benchmark has completed, it spits out a huge amount of useful data. At a high level this includes the overall average frame rate as well as a breakdown by the scene categories of normal, medium, and heavy. The normal scene features a low amount of draw calls, around 10,000. The medium scene doubles this to around 20,000, while the heavy scenes pushes things further still. While the overall average frame rate can be useful as a rough indicator of performance, it's the heavier scenes that are of particular interest for testing DirectX 12.

Enlarge / An example of what the Ashes benchmark spits out when run under DX12.
Because there are more draw calls, the CPU has to do more work dishing them out to the GPU, which is a good indicator of CPU performance. In theory, the faster the CPU and the more cores it has, the more draw calls it can send to the GPU. Ashes also adds a "percent GPU bound" score to the results, which shows if it's the GPU or the CPU that's proving to be the bottleneck in a particular scene. The benchmark tracks the GPU to see if it has completed its work before the game sends the next frame to be rendered. If it hasn’t, then the CPU must wait for the GPU, and thus it's the GPU that's the bottleneck. If the percentage is anything below 99 percent, then it's the CPU that's beginning to struggle to keep up with the GPU.
But what if we had an infinitely fast GPU? Ashes provides that data too via the CPU frame rate. It shows the theoretical frame rate of a GPU that could process all of the data the CPU throws at it, if that CPU is doing so quicker than the GPU can handle. This gives a good indication of the performance gulf between, say, a six-core system with hyperthreading and a quad-core system without or a stock clocked CPU compared to one that's been overclocked.
Other useful data includes weighted frame rates, where the benchmark squares the millisecond timing of each frame and then takes the square root at the end, weighting slower frames more than faster frames.
You can go totally nuts with the data from Ashes of the Singularity if you want to. Among other metrics, the benchmark can record individual frame times for the CPU and GPU, the approximate total frame time spent in driver, the amount of commands sent to the GPU for that frame, and if the game was CPU or GPU bound at that point.

A couple of quid pro quos


Before the results, there are a few provisos to the benchmark. Firstly, as mentioned earlier, there's no Explicit Multiadaptor support, so we could only test single GPU setups. Secondly, despite our best efforts, the benchmark won't run on the integrated GPU of a shiny new Intel Skylake Core i7-6700K. That's particularly disappointing, because it would have been interesting to see if the performance uplift from DX12 was enough to make Intel's integrated GPUs a viable option for 1080p gaming at medium to high settings. Third and perhaps most intriguingly, there was a bit of a kerfuffle over the weekend when Oxide sent the benchmark out to press. A follow-up e-mail from Oxide recommended that the press didn't use MSAA during testing. "It is implemented differently" under DX12 than DX11 according to the company, and "it has not been optimized yet by us or by the graphics vendors."  That was slightly annoying given that we'd just ran a set of benchmarks with it enabled but fair enough. However, that note was swiftly followed by another e-mail from one of the GPU companies saying that MSAA was broken and unable to be used at all.
To somehow make things even more complicated, a second Ashes reviewer's guide was set around by the same GPU company. This confirmed the previously mentioned bug and offered up differing guidelines for benchmarking. Finally in a blog post from Oxide co-founder Dan Baker, the company outlined what the benchmark's numbers mean—and stated that MSAA was not broken.
"There are incorrect statements regarding issues with MSAA," Baker wrote. "Specifically, that the application has a bug in it which precludes the validity of the test. We assure everyone that is absolutely not the case. Our code has been reviewed by Nvidia, Microsoft, AMD and Intel. It has passed the very thorough D3D12 validation system provided by Microsoft specifically designed to validate against incorrect usages. All IHVs have had access to our source code for over year, and we can confirm that both Nvidia and AMD compile our very latest changes on a daily basis and have been running our application in their labs for months. Fundamentally, the MSAA path is essentially unchanged in DX11 and DX12. Any statement which says there is a bug in the application should be disregarded as inaccurate information."
Normally, the behind-the-scenes goings on with vendors, developers, the press, and whoever else is—despite popular belief—a rather dry affair. But the fuss made over the Ashes benchmark was quite something, and it may be rather telling. This is, after all, not a definitive look at DX12 performance across different graphics cards. Instead, it's an insight into the performance of a single game. It might be representative of future performance, but until there are more DX12 games out there (roll on Fable Legends), take the numbers from this benchmark with a pinch of salt.
All that said, in order to remove any doubt from the benchmarks, all tests were run with MSAA disabled.

The benchmarks

TEST SYSTEM SPECIFICATIONS
OS Windows 10
CPU Intel Core i7-5930K (6-core) @ 4.5GHz
RAM 32GB Corsair DDR4 at 3000MHz
HDD Samsung SM951 512GB M.2 PCIe SSD
Motherboard Asus X99 Deluxe
Power Supply Corsair HX1200i
Cooling Hydro Series H110i GTX 280mm Liquid Cooler
GPUs Nvidia GTX 980 Ti, AMD R9 290X
In order to get the best out of the Ashes benchmark, you need to test it on a range of configurations. In our case that meant testing on the Ars Technica UK benchmarking PC. For graphics, we used an Nvidia GTX 980 Ti along with an AMD R9 290X. Unfortunately, the R9 290X is the newest AMD card we had access to at the time of benchmarking—a card that is generations behind the GTX 980Ti—which makes a direct performance comparison between the two difficult. That said, it does support the vast majority of DX12 features. If nothing else, the card should give us an insight into how AMD's older hardware performs under the new API, and if it might be able to close the gap with Nvidia.
As well as the two graphics cards, the benchmark was run under two different CPU configurations: one using all six cores and hyperthreading of the Core i7-5930K, and another with hyperthreading and two cores disabled. This allowed us to mimic a mainstream quad-core Core-i5 processor. Both, however, were run at the same 4.5GHz clock speed. In retrospect, running the CPU at progressively lower clock speeds would have made for some even more interesting results, particularly if you had a slow six- or eight-core CPU versus a supremely overclocked CPU with less cores. Would the new efficiencies afforded by DX12 negate that clock speed difference?
For now with at least for the six-core tests, the CPU is essentially taken out of the equation. The focus is instead largely on the GPUs. To help things along, the benchmark was run in three different resolutions: 1080p, 1440p, and 2160p (4K). All were run at the same "high" preset with MSAA disabled. That gives us a total of 24 separate benchmark runs, each with multiple data points to looks at—basically, a lot of stuff.
First up are the average FPS scores, made up of the frame times for the entire benchmark run, combining data for normal, medium, and heavy batch scenes. These give us a rough idea of the performance across each GPU and resolution, and immediately Nvidia's scores stand out: under DX12, performance decreases. While I wasn't expecting a particularly big jump in performance for team green, I certainly wasn't expecting performance to go down. It's not by a huge amount, but the results are consistent. Nvidia's GPU doesn't perform as well under DX12 as it does under DX11 in the Ashes benchmark.
Contrast that with the AMD results, which show a huge uplift in performance. The climb is as high as 70 percent in some cases. While you have to bear in mind that AMD is coming from a bad place here—its DX11 performance is nowhere near Nvidia's—that's still an impressive result. Under DX12, the much older and much cheaper R9 290X nearly matches the performance of the GTX 980 Ti. At 4K, it actually beats it, if only by a few frames per second. That's an astonishing result no matter how you slice it.
Interestingly, under AMD, switching to a four-core CPU without hyperthreading made little difference to the benchmark results. Granted, the 4.5GHz overclock meant that the benchmark was never CPU-bound, but the fact there's no difference indicates that a standard quad-core Core i5 with a decent overclock is more than up to the task. It was largely the same story under Nvidia, with consistent scores across the six- and four-core CPU. Only the 4K tests showed a difference, with the six-core CPU turning in a few extra frames per second.
Next up are the results from just the heavy scenes in the benchmark. Because there is a much larger number of draw calls happening in these scenes, in theory it should be more taxing on the CPU and on the DX12 API. Again, under DX11, the 980 Ti wipes the floor with the 290X. Under DX12, though, the results are far closer. It's tough to know exactly what's happening with the Nvidia card here. Clearly, Nvidia's DX11 driver implementation is far superior to AMD's, and it has been over the course of the DX11 era. It's almost as if the Nvidia GPU and driver don't have much room to improve... but that wouldn't explain the drop in performance you can see from the move to DX12. At the very least, we can see that Nvidia has some work to do to improve DX12 performance.
As for AMD, if these results pan out in future DX12 games, the company is back in the running—and not just with its latest and greatest GPUs, either, but older ones as well. Its work on Mantle and Vulcan, along with more direct access to hardware under DX12 (which makes driver-level optimisations less important than before) and its hardware-level use of ACEs for parallel processing is finally beginning to pay off.
You can also see that, once again, a quad-core CPU would do just as good a job as huge multicore CPU in driving data to the GPU, if a high enough clock speed is used.
Finally we have the 99th percentile frame rates—that is, the minimum frame rate you can expect to see 99 percent of the time—calculated from the frame times that the Ashes benchmark spits out. This time, the R9 290X card actually manages to beat the 980 Ti when it comes to minimum frame rates. So in Ashes of the Singularity at least, you'll have a slightly smoother experience with AMD. Given that AMD has suffered with erratic frame timings in the past, this is surprising to see.
Oddly when looking at the four-core CPU results, Nvidia see a boost of 4 FPS at 1080p. That's not a huge amount, but it's outside the margin of error. It raises questions about why less CPU cores would result in more performance at the lower resolution.

An AMD coup

To say these benchmark results are unexpected would be an understatement. While it's true that AMD has been banging the DX12 drum for a while, its performance in Ashes is astonishing. AMD's cheaper, older, and less efficient GPU is able to almost match and at one point beat Nvidia's top-of-the-line graphics card. AMD performance boosts reach almost 70 percent under DX12. On the flip side, Nvidia's performance is distinctly odd, with its GPU dropping in performance under DX12 even when more CPU cores are thrown at it. The question is why?

Enlarge / AMD's graphics cards just got a lot more interesting.
Did AMD manage to pull off some sort of crazy-optimised driver coup? Perhaps, but it’s unlikely. It's well known that Nvidia has more software development resources at its disposal, and while AMD's work with Mantle and Vulkan will have helped, it's more likely that AMD has the underlying changes behind DX12 to thank. Since the 600-series of GPUs in 2012, Nvidia has been at the top of the GPU performance pile, mostly in games that use DX10 or 11. DX11 is an API that requires a lot of optimisation at the driver level, and clearly Nvidia's work in doing so has paid off over the past few years. Even now, with the Ashes benchmark, you can see just how good its DX11 driver is.
Optimising for DX12 is a trickier beast. It gives developers far more control over how its resources are used and allocated, which may have rendered much of Nvidia's work in DX11 obsolete. Or perhaps this really is the result of earlier hardware decisions, with Nvidia choosing to optimise for DX11 with a focus on serial scheduling and pre-empting as AMD looks to the future with massively parallel processing.

Alas, without more data to draw from in the form of other DX12 games, it's hard to draw any concrete conclusions from the Ashes benchmark. Yes, AMD's performance gets a dramatic boost, and yes, Nvidia's doesn't. But with only one other major DX12 game on the way—Fable Legends—does it matter all that much right now? While DX12 usage will ramp up, DX11 isn't going anywhere for a long time. And who's to say that Nvidia won't see better performance in Unreal Engine, Unity, and others when they eventually get used in DX12 games?
On top of all those variables, there'll also be new hardware before games really start to use DX12 in earnest. The next generation of graphics cards are promising huge leaps in performance, thanks in part to the move from a positively ancient 28nm manufacturing process to 16nm. Nvidia will have Pascal, which—like AMD's current Fury cards—will feature a form of high-bandwidth memory. While less is known about AMD's follow-up to Fury, it's presumably already hard at work on something.
For now, the Ashes of the Singularity benchmarks gives us a tantalising glimpse at the future: a future where AMD strikes back after years on the sidelines, perhaps finally turning the tide against Nvidia.
This post originated on Ars Technica UK

No comments:

Post a Comment