Opencl benchmark ucsd

1/6/2024

When dealing with OpenCL, one has to think about granularity. Otherwise, OpenCL-enabled libraries are forced to implement their own caching functionality, resulting in unnecessary code duplication across the community. I hope that AMD and Intel will provide a similar jit caching mechanism soon, because a few seconds of kernel compilation cannot be considered 'short' for many applications.

From the second run onwards, the compiled binaries are taken from the cache ( more details on jit caching here), reducing the 'experienced' overhead dramatically from a few seconds to a few milliseconds. The 5 seconds overhead for the compilation of the 64 kernels for NVIDIA GPUs are only encountered at the first run. A common trend is that compiling each kernel individually in a separate OpenCL program is worst in terms of overhead: It took more than 5 seconds to compile the 64 kernels for the NVIDIA GPU. Overall, it is better to pack all kernels into two to four OpenCL programs rather than compiling all kernels in individual programs.Īs the figure above shows, the rule of thumb for jit compilation overhead is 0.1 to 1 second, with slight deviations between the three vendors and the target device. Time required for the just-in-time compilation of 64 simple OpenCL kernels. The OpenCL JIT compilation benchmark source code and results are available for download and should answer any remaining questions on details. Unfortunately, a suitable Intel GPU was not available for measuring the jit-overhead of the Intel OpenCL SDK when targeting GPUs. The CPUs on the two machines are comparable: The deviation of the jit-compilation targeting the CPU with the AMD OpenCL SDK is below ten percent. All other benchmarks are taken on a Linux Mint Maya machine with an AMD A10-5800K APU equipped with a discrete NVIDIA GeForce GTX 750 Ti GPU. The benchmark is run on an OpenSUSE 13.2 machine with an AMD FirePro W9100 GPU to measure jit-compilation overhead for an AMD GPU. Other combinations such as 2, 4, 8, 16, or 32 kernels per OpenCL program are also considered in this benchmark (always adding up to a total of 64 kernels).įor comparison we select the recent OpenCL SDKs from the major vendors: At the other hand, one may use on OpenCL program per kernel, resulting in 64 compilation units. compilation unit) and call the jit-compiler only once. How are the 64 kernels compiled? There are several options with OpenCL: One may put all kernels into a single OpenCL program (i.e. By varying the kernel name (hence the _1_2 suffix) and the index used, we make sure that all the kernels are indeed distinct and no caching optimizations in jit-compilers trigger. To mimic a realistic workload, consider the compilation of 64 kernels similar to the one above. Each of those kernels is more involved than the one shown above. More elaborate preconditioners quickly drive up the number to 15 or 20. Taking the iterative solvers in ViennaCL as an example, one quickly needs about 10 different kernels for simple setups. If you run a more involved OpenCL application, you may need a couple of different kernels. Since the kernel is so simple, it is reasonable to expect that a jit-compiler only requires a fraction of a second to compile this kernel. The kernel only sets the third entry of the buffer x to 1. But what is 'small'?Ĭonsider simple OpenCL kernels like the following: _kernel void kernel_1_2(_global float * x) In reality, it is sufficient to keep the jit-compilation time small compared to the overall execution time. Ideally, jit-compilation is infinitely fast.

Today's blog post is about just-in-time (jit) compilation overhead. Disadvantage: No automatic performance portability.Disadvantage: Just-in-Time compilation induces overhead.Advantage: Binary can be fully optimized for the underlying hardware.The kernels are just-in-time compiled during the program run, which has several advantages and disadvantages. The beauty of the vendor-independent standard OpenCL is that a single kernel language is sufficient to program many different architectures, ranging from dual-core CPUs over Intel's Many Integrated Cores (MIC) architecture to GPUs and even FPGAs.

0 Comments

Opencl benchmark ucsd

Leave a Reply.

Author

Archives

Categories