Showing posts with label CUDA. Show all posts
Showing posts with label CUDA. Show all posts

Monday, March 3, 2025

Selecting CUDA Devices

I observed that when I run a Pytorch program on a system with GPUs, the Pytorch runner dispatches the computational tasks to both GPUs. Since the program is not optimized for using multiple GPUs, the performance using the two GPU is worse than just using one. A simple method to address this turns out to be that we inform Pytorch to use a designated GPU via environmental variable CUDA_VISIBLE_DEVICES.

For instance, to run a task run_task.sh, we can 

 CUDA_VISIBLE_DEVICES=0 ./run_task.sh SEED=1234

which results in running the task on a single GPU. 

For the non-optimized program, I got much better computational efficiency by doing than letting each run on two GPUs:

 CUDA_VISIBLE_DEVICES=0 ./run_task.sh SEED=1234

 CUDA_VISIBLE_DEVICES=1 ./run_task.sh SEED=4321




 


 

Sunday, March 26, 2023

Installing GPU Driver for PyTorch and Tensorflow

To use GPU for PyTorch and Tensorflow, a method I grow fond of is to install GPU driver from RPM fusion, in particular, on Debian or Fedora systems where only free packages are included in their repositories. Via this method, we only install the driver from RPM fusion, and use Python virtual environment to bring in CUDA libraries.

  1. Configure RPM Fusion repo by following the instruction, e.g., as follows:
    
        sudo dnf install https://mirrors.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm https://mirrors.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
        
  2. Install driver, e.g.,
    
        sudo dnf install akmod-nvidia
        
  3. Add CUDA support, i.e.,
    
        sudo dnf install xorg-x11-drv-nvidia-cuda
        
  4. Check driver by running nvidia-smi. If it complains about not being able to connect to the driver, reboot the system.

If we use PyTorch or Tensorflow only, there is need to install CUDA from Nvidia.

Reference

  1. https://rpmfusion.org/Configuration

Tuesday, February 7, 2023

Tensorflow Complains "successful NUMA node read from SysFS had negative value (-1)"

To test GPU support for Tensorflow, we should run the following according to the manual of Tensorflow


python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

However, in my case, I saw an annoying message:


2023-02-07 14:40:01.345350: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

A Stack Overflow discussion has an excellent explanation about this. I have a single CPU and a single GPU installed on the system. The system is a Ubuntu 20.04 LTS. Following the advice given over there, the following command gets rid of the message,


su -c "echo 0 | tee /sys/module/nvidia/drivers/pci:nvidia/*/numa_node"

That is sweet!

Reference

  1. https://www.tensorflow.org/install/pip#linux_setup
  2. https://stackoverflow.com/questions/44232898/memoryerror-in-tensorflow-and-successful-numa-node-read-from-sysfs-had-negativ

Saturday, January 21, 2023

Verifying Cuda Installation

For full CUDA installation, we can verify it via the following steps


  # check driver is installed
  cat /proc/driver/nvidia/version
  
  # check the version of CUDA Kit
  CUDA_PATH=/usr/local/cuda
  ${CUDA_PATH}/bin/nvcc --version
  
  # run deviceQuery demo program
  ${CUDA_PATH}/extras/demo_suite/deviceQuery
  
  # run bandwidhtTest demo program
  ${CUDA_PATH}/extras/demo_suite/bandwidthTest
  
  # run busGrind demo program
  ${CUDA_PATH}/extras/demo_suite/busGrind
  
  # run vectorAdd demo program
  ${CUDA_PATH}/extras/demo_suite/vectorAdd
  
  # finally, run sample programs from Nvidia
  git clone https://github.com/NVIDIA/cuda-samples
  cd cuda-samples
  make
  

Wednesday, January 18, 2023

More Space Needed on Root File System When installing CUDA Kit

Following the instruction on Nivdia's site, I was setting up CUDA Kit on a Fedora Linux host, and encountered a problem that the installation process failed due to not encough free space on the root file system, as indicated by the error message below


$ sudo dnf -y install cuda
...
Running transaction check
Transaction check succeeded.
Running transaction test
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
Error: Transaction test error:
  installing package cuda-nvcc-12-0-12.0.76-1.x86_64 needs 67MB more space on the / filesystem
  installing package cuda-gdb-12-0-12.0.90-1.x86_64 needs 84MB more space on the / filesystem
  installing package cuda-driver-devel-12-0-12.0.107-1.x86_64 needs 85MB more space on the / filesystem
  installing package cuda-libraries-devel-12-0-12.0.0-1.x86_64 needs 85MB more space on the / filesystem
  installing package cuda-visual-tools-12-0-12.0.0-1.x86_64 needs 85MB more space on the / filesystem
  installing package cuda-documentation-12-0-12.0.76-1.x86_64 needs 85MB more space on the / filesystem
  installing package cuda-demo-suite-12-0-12.0.76-1.x86_64 needs 98MB more space on the / filesystem
  installing package cuda-cuxxfilt-12-0-12.0.76-1.x86_64 needs 99MB more space on the / filesystem
  installing package cuda-cupti-12-0-12.0.90-1.x86_64 needs 210MB more space on the / filesystem
  installing package cuda-cuobjdump-12-0-12.0.76-1.x86_64 needs 210MB more space on the / filesystem
  installing package cuda-compiler-12-0-12.0.0-1.x86_64 needs 210MB more space on the / filesystem
  installing package cuda-sanitizer-12-0-12.0.90-1.x86_64 needs 248MB more space on the / filesystem
  installing package cuda-command-line-tools-12-0-12.0.0-1.x86_64 needs 248MB more space on the / filesystem
  installing package cuda-tools-12-0-12.0.0-1.x86_64 needs 248MB more space on the / filesystem
  installing package cuda-toolkit-12-0-12.0.0-1.x86_64 needs 248MB more space on the / filesystem
  installing package cuda-12-0-12.0.0-1.x86_64 needs 248MB more space on the / filesystem
  installing package cuda-12.0.0-1.x86_64 needs 248MB more space on the / filesystem

Error Summary
-------------
Disk Requirements:
   At least 248MB more space needed on the / filesystem.
...
$

It turns out that CUDA is installed at the /usr/local directory, and indeed, the free space on / is low. The solution to this problem is to mount the /usr/local directory to a file system that has sufficient disk space. The following steps illustrates this solultion, provided that the file system mounted at /disks/disk1 has sufficient space


sudo mkdir /disks/disk1/local
sudo rsync -azvf /usr/local/* /disks/disk1/local/
sudo rm -r/usr/local
sudo mkdir /usr/local
sudo mount --bind /disks/disk1/local /usr/local
sudo cp /etc/fstab /etc/fstab.bu
su -c "echo \
  '/disks/disk1/local /usr/local none defaults,bind,nofail,x-systemd.device-timeout=2 0 0' \
  >> /etc/fstab"