Notes of a Programmer: Machine Learning

Showing posts with label Machine Learning. Show all posts

Monday, March 3, 2025

Selecting CUDA Devices

I observed that when I run a Pytorch program on a system with GPUs, the Pytorch runner dispatches the computational tasks to both GPUs. Since the program is not optimized for using multiple GPUs, the performance using the two GPU is worse than just using one. A simple method to address this turns out to be that we inform Pytorch to use a designated GPU via environmental variable CUDA_VISIBLE_DEVICES.

For instance, to run a task run_task.sh, we can

CUDA_VISIBLE_DEVICES=0 ./run_task.sh SEED=1234

which results in running the task on a single GPU.

For the non-optimized program, I got much better computational efficiency by doing than letting each run on two GPUs:

CUDA_VISIBLE_DEVICES=0 ./run_task.sh SEED=1234

CUDA_VISIBLE_DEVICES=1 ./run_task.sh SEED=4321

Wednesday, September 20, 2023

Setting up Conda Virtual Environment for Tensorflow

These steps are for create a Python virtual environment for running Tensorflow on GPU. The steps work on Fedora Linux 38 and Ubuntu 22.04 LTS:

To install miniconda, we can do as a regular user:


curl -s "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh" | bash

Following that, we create a conda virtual environment for Python.


# create conda virtual environment
conda create -n tf213 python=3.11 pip

# activate the environment in order to install packages and libraries
conda activate tf213

#
# the following are from Tensorflow pip installation guide
#
# install CUDA Toolkit 
conda install -c conda-forge cudatoolkit=11.8.0

# install python packages
pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.13.*

#
# setting up library and tool search paths
# scripts in activate.d shall be run when the environment
# is being activated
#
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
# get CUDNN_PATH
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
# set LD_LIBRARY_PATH
echo 'export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
# set XLA_FLAGS (for some systems, without this, it will lead to a 'libdevice not found at ./libdevice.10.bc' error
echo 'export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

To test it, we can run


source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Enjoy!

Wednesday, August 16, 2023

Bus Error (Core Dumped)!

I was training a machine learning model written in PyTorch on a Linux system. During the training, I encountered "Bus error (core dumped)." This error produces no stack trace. Eventually, I figured it out that this was resulted in the exhaustion of shared memory whose symptom is that "/dev/shm" is full.

To resolve this issue, I simply double the size of "/dev/shm", following the instruction given in this Stack Overflow post,

How to resize /dev/shm?

Basically, it is to edit the /etc/fstab file. If the file already has an entry for /dev/shm, we simply increase its size. If not, we add a line to the file, such as

none /dev/shm tmpfs defaults,size=32G 0 0

To bring it to effect, we remount the file system, as in,

sudo mount /dev/shm