Project Description
GPGPUs offer significant horsepower in our computers that are unfortunately not easily available to .NET programs. <project name> is a system capable to map .NET bytecode into GPU IL (e.g. nVidia PTX) so that you can run .NET algorithms on state of the art hardware.

In this project we consider the problem of diverging gap between the actual architectures and the abstract view offered by Virtual Machines that will eventually lead to the under-use of hardware resources by VM-based programs. Recently Graphics Processing Units (GPUs), as well as the Cell BE architecture, expose to programs forms of non-determinism going beyond the traditional model. These architectures are difficult to target from the Just-In-Time compiler module of a Virtual Machine (VM) because their features (execution and memory models) are hidden from the abstraction layer provided by the intermediate language.

Three are the major problems related to this gap:
  1. The design of VMs (Sun’s JVM, Microsoft’s CLR) was influenced by the dominant idea that processors would have maintained a Von-Neumann model while hiding special-purpose aspects.
  2. Special-purpose architectures expose quite different parallel computational models and require different programming models.
  3. The Just-In-Time compiler cannot target special-purpose architecture features since they are hidden from the abstraction layer provided by the intermediate language (IL).

The VM to GPU mapping problem

In order to fill that gap, we concentrate mainly on using standard mechanisms provided by VMs to represent different parallel computations without changing the design of the VM itself. Our work is not tailored to neither a specific architecture nor a single execution model, but to consider different classes of parallel models:
  • Shared memory, e.g. the Hierarchical PRAM (HPRAM), the Weakly Coherent PRAM (WPRAM);
  • Distribute memory, e.g. Bulk Synchronous Process (BSP)

From parallel models to parallel execution on different architectures.

The idea is to define and implement a meta-programming technique that can map the VM stack-based programming model to different models of parallel computation:
  • without affecting the general structure of the VM;
  • by raising semantic level to eliminate explicit sequencing;
  • by providing suitable programming abstractions to define a single and unified programming model without losing expressivity nor forcing the use of a single source language.

At runtime, through reflection 4-Centauri can evaluate the source code, and its meta-data, and generates required code for exploiting the special features of the underlying non-Von Neumann architectures. 4-Centauri is a meta-program that maps Common Intermediate Language (CIL) code to nVidia Parallel Thread eXecution (PTX) code such that it executes on General Purpose GPUs.

class HMapClass  {
         static Type SField;    
         Type IField;	
         // …
         void MethExeOnCPU ( … )
         // …
         [Kernel]  // executed on PRAMs
         void MethOne ( InputStream<Ti> input, OutputStream<To> output )  
              Type lVar;
              // method body
public sealed class Kernel : Attribute { public int NrThreads {  get; set; } }

The VM to GPU mapping problem

The design goal is to provide a single and unified programming model without forcing the use of a single source language. In order to allow programmers to continue developing and debugging at source level, we consider dividing compilation, optimization, and specialization processes into different stages by introducing a programming tool chain that controls the generation process.

At low level, 4-Centauri maps the Microsoft Intermediate Language to the nVidia Parallel Thread eXecution language. The former is a stack-based VM code, whereas the latter is register-based VM code. The translation requires:
  • abstract interpretation of the operand stack;
  • applying one-one translation rule, whenever possible;
  • stack operations optimization;
  • branch translation;

ASAP will be avaible source code for testing our meta-program

GPGPU Performance Model

We are also developing a software tool that is able to determine if a generic computation
has a less completion time on the CPU than on the GPU.
This is possible due to the fact that sometimes data transfers between host and device becomes bottlenecks for a paralelized
The software tool needs a mathematical model that predicts the completion time of a computation on a GPGPU enabled device.
Computations written in MIL can be:
  • compiled in PTX language and executed on a GPU
  • executed directly on the CPU.

We naturally integrated this tool into 4-Centauri.

Last edited Apr 26, 2010 at 2:01 PM by ilPongista, version 21