Thursday, June 29, 2017

Setting up a CUDA development environment on Fedora 25

The aim of this blog is to explore Linux security topics using a data science approach to things. Many people don't like the idea of putting proprietary blobs of code on their nice open source system. But I am pragmatic about things and have to admit that Nvidia is the king of GPU right now. And GPU is the approach to accelerate Deep Learning for the last few years. So, today I'll go over what it takes to correctly setup a CUDA development environment for Fedora 25. This is a continuation of the earlier post about how to get an Nvidia GPU card setup in Fedora. That step is a prerequisite to this blog post.

CUDA
CUDA is the name that NVidia has given to a development environment for creating high performance GPU-accelerated applications. CUDA libraries enable acceleration across multiple domains such as linear algebra, image and video processing, deep learning and graph analytics.These libraries offload work normally done on a CPU to the GPU. And any program created by the CUDA toolkit  is tied to the Nvidia family of GPU's.


Setting it up
The first step is to go get the toolkit. This is not shipped by any distribution. You have to get it directly from Nvidia. You can find the toolkit here:

https://developer.nvidia.com/cuda-downloads

Below is a screenshot of the web site. All the dark boxes are the options that I selected. I like the local rpm option because that installs all CUDA rpms in a local repo that you can then install as you need.



Download it. Even though it says F23, it still works fine on F25.

The day I downloaded it, 8.0.44 was the current release. Today its different. So, I'll continue by using my version numbers and you'll have to make the appropriate substitutions. So, let's continue the setup as root...

rpm -ivh ~/Download/cuda-repo-fedora23-8-0-local-8.0.44-1.x86_64.rpm



This installs a local repo of cuda developer rpms. The repo is located in /var/cuda-repo-8-0-local/. You can list the directory to see all the rpms. Let's install the core libraries that are necessary for Deep Learning:

dnf install /var/cuda-repo-8-0-local/cuda-misc-headers-8-0-8.0.44-1.x86_64.rpm
dnf install /var/cuda-repo-8-0-local/cuda-core-8-0-8.0.44-1.x86_64.rpm
dnf install /var/cuda-repo-8-0-local/cuda-samples-8-0-8.0.44-1.x86_64.rpm


Next, we need to make sure that utilities provided such as the GPU software compiler, nvcc, are in our path and that the libraries can be found. The easiest way to do this by creating a bash profile file that gets included when you start a shell.

edit /etc/profile.d/cuda.sh (which is a new file you are creating now):

export PATH="/usr/local/cuda-8.0/bin${PATH:+:${PATH}}"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64 ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
export EXTRA_NVCCFLAGS="-Xcompiler -std=c++03"


The reason CUDA is aimed at F23 rather than 25 is that NVidia is not testing against the newest gcc. So, they put something in the headers to make it fail.

I spoke with people from Nvidia at the GTC conference about why they don't support new gcc. Off the record they said they do extensive testing on everything they support and that its just not something they developed with when creating CUDA 8, but newer gcc will probably be support in CUDA 9.

Its easy enough to fix by altering one line in the header to test for the gcc version. Since we have gcc-6.3, we can fix the header to test for gcc 7 or later and then fail. To do this:

edit /usr/local/cuda-8.0/targets/x86_64-linux/include/host_config.h

On line 119 change from:

#if __GNUC__ > 5

to:

#if __GNUC__ > 6


This will allow things to compile with current gcc. There is one more thing that we need to fix in the headers so that Theano can compile GPU code later. The error looks like this:

math_functions.h(8901): error: cannot overload functions distinguished by return type alone

This is because gcc defines the function also and conflicts with the one NVidia ships. The solution as best I can tell is simply to:

edit /usr/local/cuda-8.0/targets/x86_64-linux/include/math_functions.h

and around lines 8897 and 8901 you will find:

/* GCC 6.1 uses ::isnan(double x) for isnan(double x) */
__DEVICE_FUNCTIONS_DECL__ __cudart_builtin__ int isnan(double x) throw();
__DEVICE_FUNCTIONS_DECL__ __cudart_builtin__ constexpr bool isnan(long double x);
__DEVICE_FUNCTIONS_DECL__ __cudart_builtin__ constexpr bool isinf(float x);
/* GCC 6.1 uses ::isinf(double x) for isinf(double x) */
__DEVICE_FUNCTIONS_DECL__ __cudart_builtin__ int isinf(double x) throw();

__DEVICE_FUNCTIONS_DECL__ __cudart_builtin__ constexpr bool isinf(long double x);

What I did is to comment out both lines that immediately follow the comment about gcc 6.1.

OK. Next we need to fix the cuda install paths just a bit. As root:

# cd /usr/local/
# ln -s /usr/local/cuda-8.0/targets/x86_64-linux/ cuda
# cd cuda
# ln -s /usr/local/cuda-8.0/targets/x86_64-linux/lib/ lib64



cuDNN setup
One of the goals of this blog is to explore Deep Learning. You will need the cuDNN libraries for that. So, let's put that in place while we are setting up the system. For some reason this is not shipped in an rpm and this leads to a manual installation that I don't like.

You'll need cuDNN version 5. Go to:

https://developer.nvidia.com/cudnn

To get this you have to have a membership in the Nvidia Developer Program. Its free to join.

Look for "Download cuDNN v5 (May 27, 2016), for CUDA 8.0". Get the Linux one. I moved it to /var/cuda-repo-8-0-local. Assuming you did, too...as root:

# cd /var/cuda-repo-8-0-local
# tar -xzvf cudnn-8.0-linux-x64-v5.0-ga.tgz
# cp cuda/include/cudnn.h /usr/local/cuda/include/
# cp cuda/lib64/libcudnn.so.5.0.5 /usr/local/cuda/lib
# cd /usr/local/cuda/lib
# ln -s /usr/local/cuda/lib/libcudnn.so.5.0.5 libcudnn.so.5
# ln -s /usr/local/cuda/lib/libcudnn.so.5.0.5 libcudnn.so



Testing it
To verify setup, we will make some sample program shipped with the toolkit. I had you to install them quite a few steps ago. The following instructions assume that you have used my recipe for a rpm build environment. As a normal user:

cd working/BUILD
mkdir cuda-samples
cd cuda-samples
cp -rp /usr/local/cuda-8.0/samples/* .
make


When its done (and hopefully its successful):

cd 1_Utilities/deviceQuery
./deviceQuery


You should get something like:

  CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1050 Ti"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 4038 MBytes (4234608640 bytes)
  ( 6) Multiprocessors, (128) CUDA Cores/MP:     768 CUDA Cores
  GPU Max Clock rate:                            1468 MHz (1.47 GHz)
  Memory Clock rate:                             3504 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024


<snip>

 You can also check the device bandwidth as follows:

cd ../bandwidthTest
./bandwidthTest



You should see something like:

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 1050 Ti
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            6354.8

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            6421.6

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            94113.5

Result = PASS


At this point you are done. I will refer back to these instructions in the future. If you see anything wrong or needs updating, please comment on this article.

No comments: