Notes of a Programmer: Tensorflow

Showing posts with label Tensorflow. Show all posts

Wednesday, September 20, 2023

Setting up Conda Virtual Environment for Tensorflow

These steps are for create a Python virtual environment for running Tensorflow on GPU. The steps work on Fedora Linux 38 and Ubuntu 22.04 LTS:

To install miniconda, we can do as a regular user:


curl -s "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh" | bash

Following that, we create a conda virtual environment for Python.


# create conda virtual environment
conda create -n tf213 python=3.11 pip

# activate the environment in order to install packages and libraries
conda activate tf213

#
# the following are from Tensorflow pip installation guide
#
# install CUDA Toolkit 
conda install -c conda-forge cudatoolkit=11.8.0

# install python packages
pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.13.*

#
# setting up library and tool search paths
# scripts in activate.d shall be run when the environment
# is being activated
#
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
# get CUDNN_PATH
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
# set LD_LIBRARY_PATH
echo 'export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
# set XLA_FLAGS (for some systems, without this, it will lead to a 'libdevice not found at ./libdevice.10.bc' error
echo 'export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

To test it, we can run


source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Enjoy!

Sunday, March 26, 2023

Installing GPU Driver for PyTorch and Tensorflow

To use GPU for PyTorch and Tensorflow, a method I grow fond of is to install GPU driver from RPM fusion, in particular, on Debian or Fedora systems where only free packages are included in their repositories. Via this method, we only install the driver from RPM fusion, and use Python virtual environment to bring in CUDA libraries.

Configure RPM Fusion repo by following the instruction, e.g., as follows:


    sudo dnf install https://mirrors.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm https://mirrors.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm

Install driver, e.g.,
```
    sudo dnf install akmod-nvidia
    
```

Add CUDA support, i.e.,


    sudo dnf install xorg-x11-drv-nvidia-cuda

Check driver by running nvidia-smi. If it complains about not being able to connect to the driver, reboot the system.

If we use PyTorch or Tensorflow only, there is need to install CUDA from Nvidia.

Reference

https://rpmfusion.org/Configuration

Tuesday, February 7, 2023

Tensorflow Complains "successful NUMA node read from SysFS had negative value (-1)"

To test GPU support for Tensorflow, we should run the following according to the manual of Tensorflow


python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

However, in my case, I saw an annoying message:


2023-02-07 14:40:01.345350: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

A Stack Overflow discussion has an excellent explanation about this. I have a single CPU and a single GPU installed on the system. The system is a Ubuntu 20.04 LTS. Following the advice given over there, the following command gets rid of the message,


su -c "echo 0 | tee /sys/module/nvidia/drivers/pci:nvidia/*/numa_node"

That is sweet!