Tuesday, December 6, 2011

First CMU Course Complete (Graphics and Imaging Architectures)

I just finished up the final lecture for CMU 15-869: Graphics and Imaging Architectures (my first course at CMU). Since making a whole bunch of new lectures was quite a task, I'm not being shy about telling folks about it. ;-)


Many of the slides are meant to be in support of class discussions and/or me talking, so they may not serve as the best reference. However a number of them provide coverage of the latest and greatest in performance-centric graphics that I haven't seen elsewhere.


Friday, September 11, 2009

DiagSplit: Parallel, Crack-Free, Adaptive Tessellation for Micropolygon Rendering

As hoped, we just kicked another micropolygon-related paper out the door. We've been working on a parallel algorithm for generating micropolygons via adaptive tessellation for the past year and the result of this study is an algorithm that we've called DiagSplit.

DiagSplit is an implementation of Split-Dice with two interesting modifications. First, instead of what many consider to be the "traditional Reyes" dicer that generates tensor-product UV grids of quadrilateral micropolygons, DiagSplit's dicing step is the D3D11 Tessellation stage. Thus, it produces slightly irregular meshes as output. Second, to get everything to work without creating cracks, the splitting process must sometimes split subpatches along non-isoparametric directions. In other words, the algorithm sometimes makes diagonal splits in parametric space (hence the name DiagSplit).

DiagSplit is intended for tight integration with the real-time graphics pipeline. In the short term, an implementation might do all the splitting on the CPU or within a compute shader, then ship diceable subpatches (not final triangles) over to the graphics pipeline for all the heavily lifting of dicing and surface evaluation. We've really designed DiagSplit for even tighter integration with future graphics pipelines and can imagine the entire adaptive splitting process being implemented in the pipeline itself with only a few extensions to D3D11. For those interested in an early read, the final draft of the paper, which will appear in SIGGRAPH Asia 2009, has been placed online here.

Paper Abstract:

We present DiagSplit, a parallel algorithm for adaptively tessellating displaced parametric surfaces into high-quality, crack-free micropolygon meshes. DiagSplit modifies the split-dice tessellation algorithm to allow splits along non-isoparametric directions in the surface's parametric domain, and uses a dicing scheme that supports unique tessellation factors for each subpatch edge. Edge tessellation factors are computed using only information local to subpatch edges. These modifications allow all subpatches generated by DiagSplit to be processed independently without introducing T-junctions or mesh cracks and without incurring the tessellation overhead of binary dicing. We demonstrate that DiagSplit produces output that is better (in terms of image quality and number of micropolygons produced) than existing parallel tessellation schemes, and as good as highly adaptive split-dice implementations that are less amenable to parallelization.

Friday, July 24, 2009

HPG09 submission: Data-parallel Rasterization of Micropolygons with Defocus and Motion Blur

For those interested, I've placed our HPG09 paper, Data-parallel Rasterization of Micropolygons with Defocus and Motion Blur, online on the Stanford Graphics Lab pages. It was surprisingly how tricky this problem can be, and, as it's clear from the paper, there's still room for improvement in this area. Look for more micropolygon-related papers to come (we hope).

One of the major research goals at Stanford right now is the design of a real time micropolygon rendering pipeline. There's a lot of recent and interesting work out there on implementing REYES-like algorithms on existing GPUs (see the RenderAnts folks, Anjul Patney's tessellation work, and NVIDIA's upcoming tech demos at SIGGRAPH). Our interest is not necessarily in implementing REYES; there are a lot of merits to the existing graphics pipeline. Rather, we're trying to determine how a real-time graphics pipeline, such as D3D11, (as well as corresponding future GPU architectures), should evolve to efficiently accommodate micropolygon workloads. At the Beyond Programmable Shading II course at SIGGRAPH 2009, I'll be getting the chance to talk a bit about what those pipeline changes might be, and what we (and the rest of the field) have learned about building an efficient real-time micropolygon rendering pipeline. Also, in the morning session of the course I will give an extended version of last year's GPU architecture talk: From Shader Code to a Teraflop: How a GPU Core Works.

HPG09 Paper abstract: Current GPUs rasterize micropolygons (polygons approximately one pixel in size) inefficiently. We design and analyze the costs of three alternative data-parallel algorithms for rasterizing micropolygon workloads for the real-time domain. First, we demonstrate that efficient micropolygon rasterization requires parallelism across many polygons, not just within a single polygon. Second, we produce a data-parallel implementation of an existing stochastic rasterization algorithm by Pixar, which is able to produce motion blur and depth-of-field effects. Third, we provide an algorithm that leverages interleaved sampling for motion blur and camera defocus. This algorithm outperforms Pixar's algorithm when rendering objects undergoing moderate defocus or high motion and has the added benefit of predictable performance.

Friday, October 10, 2008

"A Closer Look at GPUs" Published in October CACM

An updated version of Mike Houston and I's ACM Queue article "GPUs: A Closer Look" was republished in the October 2008 issue of Communications of the ACM as "A Closer Look at GPUs" (also available on my web page). The text has been improved slightly from the original version, and various references to GPUs on the market have been updated to reference current product lines. A historical note to readers: this work was the basis for my talk at the Beyond Programmable Shading class at SIGGRAPH08. It undergoes constant improvement and I personally feel that as the most recent iteration, the SIGGRAPH talk constitutes the most evolved and finely tuned description of GPU architecture concepts. However, the article does go beyond the scope of the talk to describe a precise, but simple, model of the modern real-time graphics pipeline (and resulting workload) that was not presented at SIGGRAPH due to time constraints. I give much credit to Kurt Akeley, Pat Hanrahan, as well as others in the Stanford Graphics Lab, for establishing this mental model of the graphics pipeline over the past two years.

Tuesday, October 7, 2008

GRAMPS: A Programming Model for Graphics Pipelines (it's finally out the door)

Myself and fellow students at Stanford have been exploring the feasibility of custom graphics pipelines for the past year or two. This work has resulted in a research system called GRAMPS. GRAMPS generalizes ideas from modern graphics pipelines by permitting application programmers to create arbitrary computation graphs (not just pipelines) that contain programmable and fixed-function stages that exchange data via explicitly named queues. The GRAMPS abstractions anticipate high-throughput implementations that leverage a combination of CPU and GPU-like processing cores as well as fixed-function units.

I have recently received a number of requests asking about this work. To those interested, just last week we shipped off a final copy of our paper, entitled "GRAMPS: A Programming Model for Graphics Pipelines" to ACM TOG. If all goes well, I'm told it should be appearing in early 2009. Until then, you can find an electronic copy of the submitted draft here.

Wednesday, August 13, 2008

How to Count to 800 (comparing the NV GTX 280, ATI Radeon 4870, and what I hear about LRB)

This week at SIGGRAPH Intel presented a technical paper entitled Larrabee: A Many-Core x86 Architecture for Visual Computing which described Intel's new graphics architecture that is intended to compete with high end GPU products from NVIDIA and ATI/AMD. In trying to draw comparisons between Larrabee and more traditional GPUs, some confusion has ensured over what the various companies mean by terms such as "stream processors", "thread processors", and "processing cores". For example, NVIDIA describes the recent GeForce GTX 280 as having 240 thread processors. AMD describes it's Radeon 4870 as a 800 stream processor chip. Intel's paper describes the possibility of Larrabee incarnations with core counts in the 12 to 48 range.

Okay, so what gives? Obviously this isn't an apples-to-apples comparison, but as it turns out, it's very difficult to define a precise notion of "processing core" that is consistent and accurate across the three architectures. Moreover, counting cores is no measure of absolute chip performance given differences in clock rate, core capabilities, memory subsystem performance, the presence of fixed-function processing, etc. However, since we all like to count things when drawing comparisons, it's useful to try and make an attempt to count consistently.

I've decided to create few slides about how I think about the three chips (I reiterate: this is my own mental model, but it has worked well for me). The following set of slides describe how to derive peak 32-bit floating point capability of the three GPU's programmable components beginning with the organization of ALUs on the chip. As a bonus, I throw in a few notes about how programs (such as Direct3D shader programs, or CUDA programs) map onto the compute resources of the 3 architectures. The iconography and terminology of the slides follows from descriptions and principles presented in the talk "From Shader Code to a Teraflop: How a Shader Core Works" given as part of the SIGGRAPH 2008 Class: "Beyond Programmable Shading: Fundamentals".

Click here for the full talk in pdf form.
Click here for just the notes on NVIDIA, ATI, and proposed Larrabee chips.


Welcome to Kayvon's GBLOG, an attempt to provide simple technical explanations of modern graphics architectures and throughput computing programming models and techniques. I hope to use this blog to clarify and disambiguate the large amount of (often conflicting) terminology that gets thrown around in the field of graphics architecture and, more generally, the emerging field of commodity throughput computing.

I am a PhD candidate in the Computer Graphics Lab at Stanford University. My contact information can be found at http://graphics.stanford.edu/~kayvonf