The text discusses GPU image processing using OpenCL, with a focus on implementing dilation and erosion in less than 120 lines of code using Python and OpenCL.
Abstract
The article explains how GPUs are well-suited for image processing tasks due to their parallel processing capabilities and optimized hardware for accessing and manipulating pixels. The author then introduces two image processing methods, dilation and erosion, which belong to the class of morphological operations. The article provides an overview of how these operations work and how they can be implemented using OpenCL, a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. The author provides a detailed explanation of the OpenCL implementation, including the creation of a kernel function that represents the entry point for the GPU program and the use of global IDs to identify work-items. The article concludes with a discussion of further reading on GPU programming and OpenCL.
Opinions
The author believes that GPUs are well-suited for image processing tasks due to their parallel processing capabilities and optimized hardware for accessing and manipulating pixels.
The author suggests that dilation and erosion are useful image processing methods that can be efficiently implemented using OpenCL.
The author recommends further reading on GPU programming and OpenCL to gain a deeper understanding of the topic.
GPU Image Processing using OpenCL
Implementation of two image processing methods in less than 120 lines of code using Python and OpenCL
Large speed-ups can be achieved by using GPUs instead of CPUs for certain tasks. (Image: Joseph Greve)
Besides the obvious use-case of a Graphics Processing Unit (GPU), namely rendering 3D objects, it is also possible to perform general-purpose computations using frameworks like OpenCL or CUDA. One famous use-case is bitcoin mining. We will look at an other interesting use-case: image processing. After discussing the basics of GPU programming, we implement dilation and erosion in less than 120 lines of code using Python and OpenCL.
Why is image processing well suited for GPUs?
First reason
Many image processing operations iterate from pixel to pixel in the image, do some calculation using the current pixel value, and finally write each computed value to an output image. Fig. 1 shows a gray-value-inverting operation as an example. We invert the gray-value of each pixel individually. This can obviously be done at the same time (in parallel) for each pixel as the output values do not depend on each other.
Fig. 1: Inverting the color of an image. The output pixel at some location can be computed by only taking the input pixel at the same location into account. Some corresponding pixels are shown by colored lines.
Second reason
When doing image processing, we need fast access to pixel values. GPUs are designed for graphical purposes, and one of them is texturing, therefore the hardware for accessing and manipulating pixels is well optimized. Fig. 2 shows what texturing in 3D programs is usually used for.
Fig. 2: Textures are often used to make models (in this case a sphere) make look more realistic.
Dilation and erosion
Let’s have a short look at the image processing methods we implement. Both operations belong to the class of morphological operations. A mask (often called structuring element) is shifted through the input image, pixel by pixel. For each location, the maximum (for dilation) or minimum (for erosion) pixel value under the mask is written to the output image. Fig. 3 shows an illustration of how dilation is computed.
Fig. 3: The 3×3 mask is shifted through the image. For dilation, the maximum pixel value under the mask is taken as the resulting value. Two locations of the mask are shown, one resulting in a black pixel (0), the other one resulting in a white pixel (255).
To get an idea of both operations, look at Fig. 4. Further, Fig. 5 show a gray-value image and its eroded version.
Fig. 4: Top: input image. Bottom: output images for dilation and erosion.Fig. 5: This example shows how to increase the pen width using multiple erosions.
Exploiting the parallelism of GPUs
As you have seen, when applying dilation or erosion, a mask is shifted through the image. At each location the same basic operation (select set of pixels by mask, output maximum or minimum pixel value) is applied. Further, the output pixels do not depend on each other.
This suggests implementing the basic operation as a function on the input image with an additional argument for the location. Then, we call the function for all possible locations (x between 0 and width-1, y between 0 and height-1) at the same time (in parallel) to get the output image.
Let’s do a pseudo-code implementation first: either we call synchronous_work(image) and iterate through all (x, y) locations, applying the same operation at each iteration step. Or we call parallel_work(image, x, y) with its additional location arguments for all locations in parallel (e.g. by starting a thread for each instance). Fig. 6 illustrates how the output pixel values are computed in parallel.
Synchronous implementation:
synchronous_work(image):
for all (x, y) locations inimage:
1. select pixels from3×3 neighborhood of (x, y)
2. take maximum (or minimum)
3. write this value to output image at (x, y)
in main: run one instance of synchronous_work(image)
Parallel implementation:
parallel_work(image,x, y):
1. select pixels from 3×3 neighborhood of (x, y)
2. take maximum (or minimum)
3. write this value to output image at (x, y)
in main: run one instance of parallel_work(image, x, y) for each (x, y) location in the image at the same time
Fig. 6: Three mask locations are shown. The resulting pixel values do not depend on each other and can therefore be computed at the same time.
OpenCL implementation
Let’s port our pseudo-code such that it runs on a real GPU: we will use the Open Computing Language (OpenCL) framework. Fig. 7 illustrates what our implementation does: we first copy our input image to the GPU, compile the kernel (GPU program), execute it for all pixel locations in parallel, and finally copy the resulting image back from the GPU.
We need to implement one program for the host and one for the device:
CPU (“host”): a Python program which sets up OpenCL, issues data transfers between CPU and GPU, takes care of file handling and so on
GPU (“device”): an OpenCL kernel which does the actual image processing using a C-like language
Fig. 7: A rough overview of what our OpenCL implementation does.
Large parts of the CPU or host code can be regarded as boilerplate code. Setting up the main OpenCL data-structures is done the same way most of the time. Also, preparing the input and output data follows a certain pattern, the same applies for GPU program compilation, execution and data transfer. You should be able to implement other algorithms like edge-detection or brightness adjustment using the provided code without touching the CPU program. Let’s have a quick walk through the main steps. The numbers in brackets correspond to those in the code for easier navigation.
(1) First, OpenCL initialization has to be done:
Select platform: a platform corresponds to an installed driver, e.g. AMD
Select GPU device from selected platform: you might have more than one GPU installed. We simply take the first entry in the list of GPUs
Create a cl.Context object: this object ties together all relevant information (selected device, data, …) for running a GPU program
Create a command queue object of type cl.CommandQueue: this object is a FIFO queue which allows us to issue commands to send/receive data to/from the GPU and also allows us to execute programs on the GPU
(2) Prepare data:
Read image from file into a numpy image
Allocate output numpy image of same size as a placeholder. We will later fill it with the result from the GPU
Create buffers of type cl.Image which hold the image data for OpenCL
(3) Prepare GPU program (Fig. 8 shows object hierarchy which represents the GPU program):
Read source code of GPU program from file and create cl.Program object from it
Compile program with the cl.Program.build method
Create kernel object of type cl.Kernel which represents the entry point of our GPU program
Set the arguments to the kernel function: this is the connection between the data in our CPU program and the data accessed in the GPU program. We pass the input and output image buffers and also an integer indicating if we want to apply dilation (=0) or erosion (=1). We use the cl.Kernel.set_arg method and specify both the index of the kernel argument (e.g. 0 corresponds to first argument in the argument list of our kernel function) and the CPU data
Fig. 8: OpenCL objects representing GPU program and its arguments.
(4) Execute GPU program and transfer data:
Issue a command to copy the input image to the input buffer using cl.enqueue_copy
Execute the GPU program (kernel): we implemented the morphological operation on pixel-level, therefore we will execute an instance of our kernel for each (x, y) location. This combination of data and kernel is called work-item. The number of work-items we need to process the whole image is equal to width*height. We pass both the kernel and the image size to the cl.enqueue_nd_range_kernel function to execute the work-items
When the GPU program finishes its work, we copy the data back from the output buffer to the output image using cl.enqueue_copy (and wait until this operation finishes to return the filled output image to the calling Python function)
To summarize: we setup OpenCL, prepare input and output image buffers, copy the input image to the GPU, apply the GPU program on each image-location in parallel, and finally read the result back to the CPU program.
GPU program (kernel running on device)
OpenCL GPU programs are written in a language similar to C. The entry point is called kernel function, in our case it is the function morphOpKernel(…). Its arguments correspond to the arguments we set in the CPU program. In the kernel it is necessary to access the work-item IDs. You remember: a work-item is a combination of code (kernel) and data (the image-location in our case). When calling cl.enqueue_nd_range_kernel in the CPU program, we specified that we want one work-item for each (x, y) location in the image. We access the work-item IDs by using the function get_global_id. As we have two dimensions (because the image has two dimensions) we get the IDs for both dimensions. The first ID corresponds to x, the second one to y. Now that we know the pixel location, we can read the pixel values in the 3×3 neighborhood, compute the output value and finally write it to the same location in the output image.
(1) Kernel:
Entry point when execution a GPU program (there might also be sub-functions called by the kernel)
We pass a read-only input image and a write-only output image to the kernel. Further, we pass an integer which tells us if we have to apply dilation (=0) or erosion (=1)
(2) Global IDs identify the work-item, and correspond to the x and y coordinate in our case.
(3) The structuring element (mask) is implemented by two nested loops which represent a 3×3 square neighborhood around the center pixel.
(4a) Pixel data is accessed using the read_imagef function which returns a vector of 4 floats. For gray-value images, we only need the first component which we can index by s0. It is a float value between 0 and 1 representing the gray-value of the pixel. We pass the image, the pixel location and a sampler to this function. A sampler allows things like interpolating between pixels, but in our case the sampler is doing nothing more than returning the pixel value at the given location.
(4b) The sampler is defined by a combination of options. In our case, we want the sampler to use unnormalized coordinates (going from 0 to width-1 instead of going from 0 to 1). Further, we disable interpolation between pixels (as we do not sample in between pixel locations) and we also define what happens when we access locations outside the image.
(5) Depending on the operation, we are looking for the largest/smallest pixel value inside the 3×3 neighborhood.
(6) Write result to output image. Again, we specify the image and the location. We have to pass a vector of 4 floats, but we only have to fill the first entry because our image is gray-valued.
Test it
Execute python main.py which applies both operations to the input image shown in Fig. 9. After execution, two files occur in the directory: dilate.png and erode.png, representing the results of dilation and erosion.
Fig. 9: This image is used to test the implementation.
What happens behind the scenes?
We create a work-item for each pixel as shown in Fig. 10. The GPU starts some kind of hardware-thread for each work-item to be processed in parallel. Of course, the number of threads active at the same time is limited by the hardware. An interesting feature is that these threads are not completely independent: there are groups of threads executing in lock-step and applying the same operation to different instances of data (SIMD). Therefore branching in the kernel (if-else, switch-case, while, for) might slow down execution: the GPU has to process both branches for all threads executing in lock-step, even if only a single thread diverges.
Fig. 10: Work-items are instances of the kernel created for all image-locations. Here, three work-items are shown.
There are different types of parallelism on GPUs (like using vector types, partition data and apply operation to each data element in parallel, …). We only used one of these types by splitting the data on pixel-level and computing the output for each pixel location individually.
If you look closely at the GPU program you might notice that variables have different address spaces: the input and output image are globally visible to all work-items, while the variables used as a scratch-pad inside the function are only visible for a work-item instance. These address spaces correspond to the underlying hardware. To give an example: the images are handled by the texture units and its samplers. These units are also used for texturing 3D objects in games and similar applications.
Further reading
If you want to get a deeper understanding of GPU programming and OpenCL, there is no way around reading a good book on this topic. There are many use-cases which can efficiently be computed on the GPU, not just such famous ones like image processing and bitcoin mining.
I recommend the book [1] to get you started with OpenCL. The book starts with simple programs like 4D matrix-vector multiplication and ends with complex topics like FFT or bitonic sort.
Further, the book [2] is also a good read, but it already expects some experience with OpenCL.
Depending on your GPU you may find information about its OpenCL implementation to better understand how OpenCL is translated to hardware, e.g. see [3] for AMD.
Khronos provides the specification [4] and a very useful API reference card [5].
[1] Scarpino — OpenCL in Action
[2] Gaster et al. — Heterogeneous Computing with OpenCL
[3] AMD Accelerated Parallel Processing — OpenCL Programming Guide
[4] Khronos — The OpenCL Specification
[5] Khronos — OpenCL API Reference Card