2 Comments

OpenCL for Your “Embarrassingly Parallelizable” Code

Spin had a great blog post a few days ago on Mean Shift Clustering. It’s a powerful algorithm with a ton of applications, but an Achille’s heel:

The most glaring disadvantage is its slowness. …it can take a long time to execute. The one silver lining is that, while it is slow, it is also embarrassingly parallelizable.

How many times has that happened to you? Your code solves a small problem perfectly, but it just isn’t fast enough for the real world. Sometimes the solution can be to just find a bigger computer. Luckily almost every computer has a bigger computer inside it: the graphics hardware. Where your computer’s CPU might have 8 cores, its GPU can have hundreds. OpenCL is a standard framework that gives you access to all that power.

The sample I’m going to show was built on Mac OS 10.10 running Xcode 6. Apple has done a nice job integrating OpenCL into their environment while preserving the data types and language features defined in OpenCL. Hopefully Windows and Linux will be in similar good situations. Much of it will come down to your graphics hardware vendor support for OpenCL–debugging especially may be rough. I’ve found online materials and Apple’s documentation to be usable, but a good book really helped. “OpenCL Programming Guide” was a nice introduction and also included useful summaries of data types and built-in functions that really helped.

I started by creating a Mac OS X command line app in Xcode. Then I added the OpenCL framework. The OpenCL documentation on Apple’s developer site is brief and worth reading.

Apple’s OpenCL can use a grand central dispatch queue on either the GPU or the CPU. Be very careful running OpenCL code on a CPU queue because it’s a lot less efficient than the GPU for running massively parallel code. My machine was completely hammered at 800% utilization with the fans at max. Running on the GPU queue in comparison has machine utilization at less than 100% with the fan off.


dispatch_queue_t queue = gcl_create_dispatch_queue(CL_DEVICE_TYPE_GPU, NULL);
 
if (queue == NULL) {
queue = gcl_create_dispatch_queue(CL_DEVICE_TYPE_CPU, NULL);
fprintf(stderr, "Warning: Running on CPU\n");
}
else {
char name[128];
 
cl_device_id gpu = gcl_get_device_id_with_dispatch_queue(queue);
clGetDeviceInfo(gpu, CL_DEVICE_NAME, 128, name, NULL);
fprintf(stderr, "Running on GPU %s\n", name);
}

That said, there’s one good reason to run on the CPU during development: your GPU has no memory protection. A bug in the code running on the GPU can be very difficult to find. I strongly recommend a disciplined approach going from verification of the algorithm in plain C code, followed by porting to OpenCL on the CPU and finally to the GPU.

Which brings us to the code that actually runs on the GPU:

kernel void mean_shift_point(global float2 *points,
                             global float2 *original_points,
                             size_t num_points,
                             float bandwidth,
                             global float2 *shifted_points)
{
    float base_weight = 1. / (bandwidth * sqrt(2. * M_PI_F));
    float2 shift = { 0, 0 };
    float scale = 0;
 
    size_t i = get_global_id(0);
 
    for (size_t j = 0; j < num_points; j++) {
        float dist = distance(points[i], original_points[j]);
        float weight = base_weight * exp(-0.5f * pow(dist / bandwidth, 2.f));
 
        shift += original_points[j] * weight;
        scale += weight;
    }
 
    shifted_points[i] = shift / scale;
}

That looks like C code, but it’s really a “kernel” that is compiled by a separate OpenCL toolchain provided by your graphics hardware vendor. The language has nice support for small vectors (float2 for example is a 2D point) and lots of built-in and optimized math routines. The main difficulty in writing OpenCL code is getting the data in and out of the GPU.

You have two copies of the data that the kernel will work on: one is accessible by the CPU and the other by the GPU (called the “device” by OpenCL). The CPU is responsible for preparing the data, loading it from disk, sending it to the GPU, etc. The copy of the data in the GPU is only for use by the GPU–it’s not in the same memory as your CPU uses.

Here are the data buffers that the CPU uses:

cl_float2 *points = (cl_float2 *)malloc(BUFFER_SIZE);
cl_float2 *original_points = (cl_float2 *)malloc(BUFFER_SIZE);
cl_float2 *shifted_points = (cl_float2 *)malloc(BUFFER_SIZE);
 
for (int i = 0; i < NUM_VALUES; i++) {
    points[i].x = (cl_float)i;
    points[i].y = (cl_float)i;
}
 
memcpy(original_points, points, BUFFER_SIZE);

Here are the associated data buffers that the GPU uses:

void *device_points = gcl_malloc(BUFFER_SIZE, points,
                              CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR);
void *device_original_points = gcl_malloc(BUFFER_SIZE, original_points,
                                          CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR);
void *device_shifted_points = gcl_malloc(BUFFER_SIZE, NULL, CL_MEM_WRITE_ONLY);

At first I was confused as why they were not properly typed like the CPU data buffers, but then I realized that’s sort of a feature to help prevent you from accidentally using those buffers on the CPU. The only thing you can do with them is pass them to OpenCL functions.

Everything is now setup and ready to schedule the kernel on the GPU. This code fits into Apple’s normal multi-threading approach quite well. It will look much different on Windows or Linux, but I’m not so worried about portability that this glue code bothers me.

dispatch_sync(queue, ^{
    size_t work_group_size;
    gcl_get_kernel_block_workgroup_info(mean_shift_point_kernel,
                                        CL_KERNEL_WORK_GROUP_SIZE,
                                        sizeof(work_group_size),
                                        &work_group_size, NULL);
    cl_ndrange range = {
        1,
        { 0, 0, 0 },
        { NUM_VALUES, 0, 0 },
        { work_group_size, 0, 0 }
    };
 
    mean_shift_point_kernel(&range, (cl_float2 *)device_points,
                            (cl_float2 *)device_original_points,
                            NUM_VALUES, BANDWIDTH,
                            (cl_float2 *)device_shifted_points);
 
    gcl_memcpy(shifted_points, device_shifted_points, BUFFER_SIZE);
});

If you’re interested in building the code yourself, it’s on github under an MIT license.