Friday, October 10, 2008

"A Closer Look at GPUs" Published in October CACM

An updated version of Mike Houston and I's ACM Queue article "GPUs: A Closer Look" was republished in the October 2008 issue of Communications of the ACM as "A Closer Look at GPUs" (also available on my web page). The text has been improved slightly from the original version, and various references to GPUs on the market have been updated to reference current product lines. A historical note to readers: this work was the basis for my talk at the Beyond Programmable Shading class at SIGGRAPH08. It undergoes constant improvement and I personally feel that as the most recent iteration, the SIGGRAPH talk constitutes the most evolved and finely tuned description of GPU architecture concepts. However, the article does go beyond the scope of the talk to describe a precise, but simple, model of the modern real-time graphics pipeline (and resulting workload) that was not presented at SIGGRAPH due to time constraints. I give much credit to Kurt Akeley, Pat Hanrahan, as well as others in the Stanford Graphics Lab, for establishing this mental model of the graphics pipeline over the past two years.

Tuesday, October 7, 2008

GRAMPS: A Programming Model for Graphics Pipelines (it's finally out the door)

Myself and fellow students at Stanford have been exploring the feasibility of custom graphics pipelines for the past year or two. This work has resulted in a research system called GRAMPS. GRAMPS generalizes ideas from modern graphics pipelines by permitting application programmers to create arbitrary computation graphs (not just pipelines) that contain programmable and fixed-function stages that exchange data via explicitly named queues. The GRAMPS abstractions anticipate high-throughput implementations that leverage a combination of CPU and GPU-like processing cores as well as fixed-function units.

I have recently received a number of requests asking about this work. To those interested, just last week we shipped off a final copy of our paper, entitled "GRAMPS: A Programming Model for Graphics Pipelines" to ACM TOG. If all goes well, I'm told it should be appearing in early 2009. Until then, you can find an electronic copy of the submitted draft here.

Wednesday, August 13, 2008

How to Count to 800 (comparing the NV GTX 280, ATI Radeon 4870, and what I hear about LRB)

This week at SIGGRAPH Intel presented a technical paper entitled Larrabee: A Many-Core x86 Architecture for Visual Computing which described Intel's new graphics architecture that is intended to compete with high end GPU products from NVIDIA and ATI/AMD. In trying to draw comparisons between Larrabee and more traditional GPUs, some confusion has ensured over what the various companies mean by terms such as "stream processors", "thread processors", and "processing cores". For example, NVIDIA describes the recent GeForce GTX 280 as having 240 thread processors. AMD describes it's Radeon 4870 as a 800 stream processor chip. Intel's paper describes the possibility of Larrabee incarnations with core counts in the 12 to 48 range.

Okay, so what gives? Obviously this isn't an apples-to-apples comparison, but as it turns out, it's very difficult to define a precise notion of "processing core" that is consistent and accurate across the three architectures. Moreover, counting cores is no measure of absolute chip performance given differences in clock rate, core capabilities, memory subsystem performance, the presence of fixed-function processing, etc. However, since we all like to count things when drawing comparisons, it's useful to try and make an attempt to count consistently.

I've decided to create few slides about how I think about the three chips (I reiterate: this is my own mental model, but it has worked well for me). The following set of slides describe how to derive peak 32-bit floating point capability of the three GPU's programmable components beginning with the organization of ALUs on the chip. As a bonus, I throw in a few notes about how programs (such as Direct3D shader programs, or CUDA programs) map onto the compute resources of the 3 architectures. The iconography and terminology of the slides follows from descriptions and principles presented in the talk "From Shader Code to a Teraflop: How a Shader Core Works" given as part of the SIGGRAPH 2008 Class: "Beyond Programmable Shading: Fundamentals".

Click here for the full talk in pdf form.
Click here for just the notes on NVIDIA, ATI, and proposed Larrabee chips.


Welcome to Kayvon's GBLOG, an attempt to provide simple technical explanations of modern graphics architectures and throughput computing programming models and techniques. I hope to use this blog to clarify and disambiguate the large amount of (often conflicting) terminology that gets thrown around in the field of graphics architecture and, more generally, the emerging field of commodity throughput computing.

I am a PhD candidate in the Computer Graphics Lab at Stanford University. My contact information can be found at