Optimizing CUDA code by kernel fusion: application on BLAS

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F15%3A00083436" target="_blank" >RIV/00216224:14330/15:00083436 - isvavai.cz</a>
Výsledek na webu
<a href="http://dx.doi.org/10.1007/s11227-015-1483-z" target="_blank" >http://dx.doi.org/10.1007/s11227-015-1483-z</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/s11227-015-1483-z" target="_blank" >10.1007/s11227-015-1483-z</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Optimizing CUDA code by kernel fusion: application on BLAS
Popis výsledku v původním jazyce
Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector?vector) and BLAS-2 (matrix?vector) operations. However, when kernels share data, kernel fusion can improve memory locality by placing shared data, originally passed via off-chip global memory, into a faster, but distributed on-chip memory. In this paper, we show how kernelsperforming map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared with similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.24x faster for the examples tested.
Název v anglickém jazyce
Optimizing CUDA code by kernel fusion: application on BLAS
Popis výsledku anglicky
Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector?vector) and BLAS-2 (matrix?vector) operations. However, when kernels share data, kernel fusion can improve memory locality by placing shared data, originally passed via off-chip global memory, into a faster, but distributed on-chip memory. In this paper, we show how kernelsperforming map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared with similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.24x faster for the examples tested.

Klasifikace

Druh
J<sub>x</sub> - Nezařazeno - Článek v odborném periodiku (Jimp, Jsc a Jost)
CEP obor
IN - Informatika
OECD FORD obor
—

Návaznosti výsledku

Projekt
<a href="/cs/project/EE2.3.30.0037" target="_blank" >EE2.3.30.0037: Zaměstnáním nejlepších mladých vědců k rozvoji mezinárodní spolupráce</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

Rok uplatnění
2015
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název periodika
The Journal of Supercomputing
ISSN
0920-8542
e-ISSN
—
Svazek periodika
71
Číslo periodika v rámci svazku
10
Stát vydavatele periodika
US - Spojené státy americké
Počet stran výsledku
24
Strana od-do
3934-3957
Kód UT WoS článku
000361531500013
EID výsledku v databázi Scopus
—

Podobné výsledky(10)

Automatic Fusions of CUDA-GPU Kernels for Parallel Map THE ENERGY CONSUMPTION OPTIMIZATION OF THE BLAS ROUTINES OpenCL Kernel Fusion for GPU, Xeon Phi and CPU

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Optimizing CUDA code by kernel fusion: application on BLAS

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)