CUDA
CUDA is the name that NVidia has given to a development environment for creating high performance GPU-accelerated applications. CUDA libraries enable acceleration across multiple domains such as linear algebra, image and video processing, deep learning and graph analytics.These libraries offload work normally done on a CPU to the GPU. And any program created by the CUDA toolkit is tied to the Nvidia family of GPU's.
Setting it up
The first step is to go get the toolkit. This is not shipped by any distribution. You have to get it directly from Nvidia. You can find the toolkit here:
https://developer.nvidia.com/cuda-downloads
Below is a screenshot of the web site. All the dark boxes are the options that I selected. I like the local rpm option because that installs all CUDA rpms in a local repo that you can then install as you need.
Download it. Even though it says F23, it still works fine on F25.
The day I downloaded it, 8.0.44 was the current release. Today its different. So, I'll continue by using my version numbers and you'll have to make the appropriate substitutions. So, let's continue the setup as root...
rpm -ivh ~/Download/cuda-repo-fedora23-8-0-local-8.0.44-1.x86_64.rpm
This installs a local repo of cuda developer rpms. The repo is located in /var/cuda-repo-8-0-local/. You can list the directory to see all the rpms. Let's install the core libraries that are necessary for Deep Learning:
dnf install /var/cuda-repo-8-0-local/cuda-misc-headers-8-0-8.0.44-1.x86_64.rpm
dnf install /var/cuda-repo-8-0-local/cuda-core-8-0-8.0.44-1.x86_64.rpm
dnf install /var/cuda-repo-8-0-local/cuda-samples-8-0-8.0.44-1.x86_64.rpm
dnf install /var/cuda-repo-8-0-local/cuda-core-8-0-8.0.44-1.x86_64.rpm
dnf install /var/cuda-repo-8-0-local/cuda-samples-8-0-8.0.44-1.x86_64.rpm
Next, we need to make sure that utilities provided such as the GPU software compiler, nvcc, are in our path and that the libraries can be found. The easiest way to do this by creating a bash profile file that gets included when you start a shell.
edit /etc/profile.d/cuda.sh (which is a new file you are creating now):
export PATH="/usr/local/cuda-8.0/bin${PATH:+:${PATH}}"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64 ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
export EXTRA_NVCCFLAGS="-Xcompiler -std=c++03"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64 ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
export EXTRA_NVCCFLAGS="-Xcompiler -std=c++03"
The reason CUDA is aimed at F23 rather than 25 is that NVidia is not testing against the newest gcc. So, they put something in the headers to make it fail.
I spoke with people from Nvidia at the GTC conference about why they don't support new gcc. Off the record they said they do extensive testing on everything they support and that its just not something they developed with when creating CUDA 8, but newer gcc will probably be support in CUDA 9.
Its easy enough to fix by altering one line in the header to test for the gcc version. Since we have gcc-6.3, we can fix the header to test for gcc 7 or later and then fail. To do this:
edit /usr/local/cuda-8.0/targets/x86_64-linux/include/host_config.h
On line 119 change from:
#if __GNUC__ > 5
to:
#if __GNUC__ > 6
This will allow things to compile with current gcc. There is one more thing that we need to fix in the headers so that Theano can compile GPU code later. The error looks like this:
math_functions.h(8901): error: cannot overload functions distinguished by return type alone
This is because gcc defines the function also and conflicts with the one NVidia ships. The solution as best I can tell is simply to:
edit /usr/local/cuda-8.0/targets/x86_64-linux/include/math_functions.h
and around lines 8897 and 8901 you will find:
/* GCC 6.1 uses ::isnan(double x) for isnan(double x) */
__DEVICE_FUNCTIONS_DECL__ __cudart_builtin__ int isnan(double x) throw();
__DEVICE_FUNCTIONS_DECL__ __cudart_builtin__ constexpr bool isnan(long double x);
__DEVICE_FUNCTIONS_DECL__ __cudart_builtin__ constexpr bool isinf(float x);
/* GCC 6.1 uses ::isinf(double x) for isinf(double x) */
__DEVICE_FUNCTIONS_DECL__ __cudart_builtin__ int isinf(double x) throw();
__DEVICE_FUNCTIONS_DECL__ __cudart_builtin__ constexpr bool isinf(long double x);
What I did is to comment out both lines that immediately follow the comment about gcc 6.1.
OK. Next we need to fix the cuda install paths just a bit. As root:
# cd /usr/local/
# ln -s /usr/local/cuda-8.0/targets/x86_64-linux/ cuda
# cd cuda
# ln -s /usr/local/cuda-8.0/targets/x86_64-linux/lib/ lib64
# ln -s /usr/local/cuda-8.0/targets/x86_64-linux/ cuda
# cd cuda
# ln -s /usr/local/cuda-8.0/targets/x86_64-linux/lib/ lib64
cuDNN setup
One of the goals of this blog is to explore Deep Learning. You will need the cuDNN libraries for that. So, let's put that in place while we are setting up the system. For some reason this is not shipped in an rpm and this leads to a manual installation that I don't like.
You'll need cuDNN version 5. Go to:
https://developer.nvidia.com/cudnn
To get this you have to have a membership in the Nvidia Developer Program. Its free to join.
Look for "Download cuDNN v5 (May 27, 2016), for CUDA 8.0". Get the Linux one. I moved it to /var/cuda-repo-8-0-local. Assuming you did, too...as root:
# cd /var/cuda-repo-8-0-local
# tar -xzvf cudnn-8.0-linux-x64-v5.0-ga.tgz
# cp cuda/include/cudnn.h /usr/local/cuda/include/
# cp cuda/lib64/libcudnn.so.5.0.5 /usr/local/cuda/lib
# cd /usr/local/cuda/lib
# ln -s /usr/local/cuda/lib/libcudnn.so.5.0.5 libcudnn.so.5
# ln -s /usr/local/cuda/lib/libcudnn.so.5.0.5 libcudnn.so
# tar -xzvf cudnn-8.0-linux-x64-v5.0-ga.tgz
# cp cuda/include/cudnn.h /usr/local/cuda/include/
# cp cuda/lib64/libcudnn.so.5.0.5 /usr/local/cuda/lib
# cd /usr/local/cuda/lib
# ln -s /usr/local/cuda/lib/libcudnn.so.5.0.5 libcudnn.so.5
# ln -s /usr/local/cuda/lib/libcudnn.so.5.0.5 libcudnn.so
Testing it
To verify setup, we will make some sample program shipped with the toolkit. I had you to install them quite a few steps ago. The following instructions assume that you have used my recipe for a rpm build environment. As a normal user:
cd working/BUILD
mkdir cuda-samples
cd cuda-samples
cp -rp /usr/local/cuda-8.0/samples/* .
make
mkdir cuda-samples
cd cuda-samples
cp -rp /usr/local/cuda-8.0/samples/* .
make
When its done (and hopefully its successful):
cd 1_Utilities/deviceQuery
./deviceQuery
./deviceQuery
You should get something like:
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1050 Ti"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 4038 MBytes (4234608640 bytes)
( 6) Multiprocessors, (128) CUDA Cores/MP: 768 CUDA Cores
GPU Max Clock rate: 1468 MHz (1.47 GHz)
Memory Clock rate: 3504 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 1048576 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1050 Ti"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 4038 MBytes (4234608640 bytes)
( 6) Multiprocessors, (128) CUDA Cores/MP: 768 CUDA Cores
GPU Max Clock rate: 1468 MHz (1.47 GHz)
Memory Clock rate: 3504 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 1048576 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
<snip>
You can also check the device bandwidth as follows:
cd ../bandwidthTest
./bandwidthTest
./bandwidthTest
You should see something like:
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce GTX 1050 Ti
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6354.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6421.6
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 94113.5
Result = PASS
Running on...
Device 0: GeForce GTX 1050 Ti
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6354.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6421.6
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 94113.5
Result = PASS
At this point you are done. I will refer back to these instructions in the future. If you see anything wrong or needs updating, please comment on this article.