Wednesday, September 20, 2023

Setting up Conda Virtual Environment for Tensorflow

These steps are for create a Python virtual environment for running Tensorflow on GPU. The steps work on Fedora Linux 38 and Ubuntu 22.04 LTS:

To install miniconda, we can do as a regular user:


curl -s "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh" | bash

Following that, we create a conda virtual environment for Python.


# create conda virtual environment
conda create -n tf213 python=3.11 pip

# activate the environment in order to install packages and libraries
conda activate tf213

#
# the following are from Tensorflow pip installation guide
#
# install CUDA Toolkit 
conda install -c conda-forge cudatoolkit=11.8.0

# install python packages
pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.13.*

#
# setting up library and tool search paths
# scripts in activate.d shall be run when the environment
# is being activated
#
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
# get CUDNN_PATH
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
# set LD_LIBRARY_PATH
echo 'export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
# set XLA_FLAGS (for some systems, without this, it will lead to a 'libdevice not found at ./libdevice.10.bc' error
echo 'export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

To test it, we can run


source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Enjoy!

Monday, September 18, 2023

Mounting File Systems in a Disk Image on Linux

On Linux systems, we can create disk image using the dd command. This post lists the steps to mount file systems, in particular, LVM volumes in an image of a whole disk, which is often created as follows,


dd if=/dev/sdb of=/mnt/disk1/sdb.img bs=1M status=progress

Assuming the disk has multiple partitions, how do we mount the file systems on these partitions? The following are the steps,


# 1. mount the disk where the disk image is
#    we assume the disk is /dev/sdb1, and we mount
#    it on directory win
sudo mount /dev/sdb1 win

# 2. map the partitions to loopback devices
#    here we assume the disk image is win/disks/disk1.img
sudo losetup -f -P win/disks/disk1.img

# 3. list the LVM volumes
sudo lvdisplay

# 4. suppose from the input of the above command, 
#    the volumne is shown as /dev/mylvm/lvol0,
#    and we want it mounted on directory lvol0
sudo mount /dev/mylvm/lvol0 lvol0

# 5. do something we want ...


# 6. unmount the volume
sudo umount lvol0

# 7. deactivate LVM volume
#    we can query, confirm the volume group by
#    vgdisplay
sudo vgchange -a n mylvm

# 8. detatch the loopback device
#    assuming the device is /dev/loop0
sudo losetup -d /dev/loop0

# 9. umount the disk
sudo umount win

Sunday, September 17, 2023

Mounting ZFS Dataset as /home

The following steps work:


# list ZFS pools and datasets
zfs list

# Query current mount point for a ZFS dataset, e.g., mypool/mydataset
zfs get mountpoint mypool/mydataset

# Set new mountpoint to /home
zfs set mountpoint=/home mypool/mydataset

# Always verify
zfs list
zfs get mountpoint mypool/mydataset

Persistent Mount Bind

The following works:


/from_dir_path   /to_dir_path  none    bind,nofail

Wednesday, August 16, 2023

Bus Error (Core Dumped)!

I was training a machine learning model written in PyTorch on a Linux system. During the training, I encountered "Bus error (core dumped)." This error produces no stack trace. Eventually, I figured it out that this was resulted in the exhaustion of shared memory whose symptom is that  "/dev/shm" is full. 

To resolve this issue, I simply double the size of "/dev/shm", following the instruction given in this Stack Overflow post,

How to resize /dev/shm?

Basically, it is to edit the /etc/fstab file. If the file already has an entry for /dev/shm, we simply increase its size. If not, we add a line to the file, such as

none /dev/shm tmpfs defaults,size=32G 0 0

To bring it to effect, we remount the file system, as in,

sudo mount /dev/shm

 

Thursday, March 30, 2023

Binding Process to TCP/UDP Port Failure on Windows

Windows has the concept of reserved TCP/UDP ports. These ports can nonetheless be used by any other application. These can be annoying because the reserved ports would not shown be used when we query used ports using netstat. For instance, if we want to bind TCP port 23806 to an application, we determine the availability using the netstat command, such as


C:> netstat -anp tcp | find ":23806"

C:>

The output is blank, which means that the port is unused. However, when we attempt to bind the port to a process of our choice, we encounter an error, such as


bind [127.0.0.1]:23806: Permission denied

This is annoying. The reason is that the port somehow becomes a reserved port. To see this, we can query reserved ports, e.g.,


C:> netsh int ipv4 show excludedportrange protocol=tcp

Protocol tcp Port Exclusion Ranges

Start Port    End Port
----------    --------
      1155        1254
      ...          ...
     23733       23832
     23833       23932
     50000       50059     *

* - Administered port exclusions.


C:>
  

which shows that 23806 is now a served port. What is really annoying is that the range can be updated by Windows dynamically. There are several methods to deal with this.

  1. Method 1. Stop and start the Windows NAT Driver service.
    
      net stop winnat
      net start winnat
      
    After this, query the reserved the ports again. It is often the reserved ports are much limited when compared to before, e.g.,
    
    C:>netsh int ipv4 show excludedportrange protocol=tcp
    
    Protocol tcp Port Exclusion Ranges
    
    Start Port    End Port
    ----------    --------
          2869        2869
          5357        5357
         50000       50059     *
    
    * - Administered port exclusions.
    
    C:>
      
  2. Method 2. If you don't wish to use this feature of Windows, we can disable reserved ports.
    
    reg add HKLM\SYSTEM\CurrentControlSet\Services\hns\State /v EnableExcludedPortRange /d 0 /f
    

Tuesday, March 28, 2023

Installing and Using CUDA Tookit and cuDNN in Conda Virtual Environment of Python

This is straightforward.

  1. Create a conda virtual environment, e.g.,
    
    conda create -n cudacudnn python=3.9 pip
          
  2. Activate the virtual environment, i.e,
    
    conda activate cudacudnn
        
  3. Assume we are using Pytorch 2.0, e.g., install it via
    
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118  
        
  4. Install CUDA toolkit and cuDNN, e.g.,
    
    conda install -c conda-forge cudnn=8 cudatoolkit=11.8
        
  5. Add the library path of the conda environment to LD_LIBRARY_PATH. There are several approaches. Two appraoches are as follows, assuming the environment is at $HOME/.conda/envs/cudacudnn and we want to run foo.py,
    
    virtenv_path=$HOME/.conda/envs/cudacudnn
    export LD_LIBRARY_PATH=${virtenv_path}/lib:$LD_LIBRARY_PATH
    python foo.py
        
    or
    
    virtenv_path=$HOME/.conda/envs/cudacudnn
    LD_LIBRARY_PATH=${virtenv_path}/lib:$LD_LIBRARY_PATH python foo.py
        

Sunday, March 26, 2023

Installing GPU Driver for PyTorch and Tensorflow

To use GPU for PyTorch and Tensorflow, a method I grow fond of is to install GPU driver from RPM fusion, in particular, on Debian or Fedora systems where only free packages are included in their repositories. Via this method, we only install the driver from RPM fusion, and use Python virtual environment to bring in CUDA libraries.

  1. Configure RPM Fusion repo by following the instruction, e.g., as follows:
    
        sudo dnf install https://mirrors.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm https://mirrors.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
        
  2. Install driver, e.g.,
    
        sudo dnf install akmod-nvidia
        
  3. Add CUDA support, i.e.,
    
        sudo dnf install xorg-x11-drv-nvidia-cuda
        
  4. Check driver by running nvidia-smi. If it complains about not being able to connect to the driver, reboot the system.

If we use PyTorch or Tensorflow only, there is need to install CUDA from Nvidia.

Reference

  1. https://rpmfusion.org/Configuration

Tuesday, February 7, 2023

Tensorflow Complains "successful NUMA node read from SysFS had negative value (-1)"

To test GPU support for Tensorflow, we should run the following according to the manual of Tensorflow


python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

However, in my case, I saw an annoying message:


2023-02-07 14:40:01.345350: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

A Stack Overflow discussion has an excellent explanation about this. I have a single CPU and a single GPU installed on the system. The system is a Ubuntu 20.04 LTS. Following the advice given over there, the following command gets rid of the message,


su -c "echo 0 | tee /sys/module/nvidia/drivers/pci:nvidia/*/numa_node"

That is sweet!

Reference

  1. https://www.tensorflow.org/install/pip#linux_setup
  2. https://stackoverflow.com/questions/44232898/memoryerror-in-tensorflow-and-successful-numa-node-read-from-sysfs-had-negativ

Saturday, February 4, 2023

Checking RAM Type on Linux

We can use the following command to check RAM types and slots


sudo dmidecode --type 17

Reloading WireGuard Configuration File without Completely Restarting WireGuard Session

On Linux systems, under bash, we can run the following command to reload and apply a revised WireGuard configuration file without restarting and distrupting the clients


wg syncconf wg0 <(wg-quick strip wg0)

Note that this command may not work for shells other than bash. However, we can always complete this in a three step fashion.


wg-quick strip wg0 > temp_wg0.conf
wg sysconf wg0 temp_wg0.conf
rm temp_wg0.conf

Determining File System of Current Directory on Linux

On Linux, a simple command can reveal the file system the directory is actually located. The command is


df -hT . Filesystem

Sunday, January 29, 2023

Ressetting Network Stack on Windows

Sometimes, I want to reset the network stack on Windows. I found that Intel has a good documentation for it. I copy the steps below:

Resetting the network stack


ipconfig /release
ipconfig /flushdns
ipconfig /renew
netsh int ip reset
netsh winsock reset

Quick Note on WireGuard Configuration Files

Assume that we set up a VPN server, and a number of clients are the peers of the server. Below are example configuration files

  1. Server Configuration
    
    [Interface]
    Address = 10.188.0.1/32
    PrivateKey = (Private Key of the server, genreated via: wg genkey | server.private)
    ListenPort = 51820
    
    
    
    [Peer]
    PublicKey = (Public key of the client, generated via: wg genkey | tee client.2.private | wg pubkey)
    AllowedIPs = 10.188.0.2/32
    
    [Peer]
    PublicKey = (Public key of the client, generated via: wg genkey | tee client.3.private | wg pubkey)
    AllowedIPs = 10.188.0.3/32
    
    [Peer]
    PublicKey = (Public key of the client, generated via: wg genkey | tee client.4.private | wg pubkey)
    AllowedIPs = 10.188.0.4/32
    
    [Peer]
    PublicKey = (Public key of the client, generated via: wg genkey | tee client.5.private | wg pubkey)
    AllowedIPs = 10.188.0.5/32  
    
    • The AllowedIPs of the Peer section is to assign the IP address to the client.
  2. Client Configuration
    
    [Interface]
    Address = 10.188.0.5/32
    PrivateKey = (Private Key of the the client, e.g., the content of client.5.private)
    DNS = 192.168.1.1,1.1.1.1,8.8.8.8
    
    
    
    [Peer]
    PublicKey = (Public key of the server, generated via: cat server.private | wg pubkey)
    AllowedIPs = 10.188.0.1/32,10.188.0.5/32
    Endpoint = Server_Public_IP_OR_Hostname:51820
    
    
    • The AllowedIPs is to control access the client has to the part of the network. My experience is that you must give the access to the server, i.e., it must include server's IP address 10.188.0.1; otherwise, there would be a reachability problem.
    • Since it is a client, we should also inclue the Endpoint.
    • Numerous examples on the Web often use AllowedIPs = 0.0.0.0/0,::/0 as part of the client configuration. Although a further investigation is needed to confirm it, my experience is that this can be a problematic setup for Windows clients, in particular, both the server and the client reside in private networks with the same network prefix, e.g., 192.168.1.0/24. Windows does not appear to set up proper routes and appears to be confused with which private network it should reach when given an IP address like 192.168.1.1. My experience seems to be when this happens, Ping on Windows would report "General Failure."

Running WireGuard Windows GUI Client as Non-administrator User

As indicated in this document, and also referenced in several places, we can run the WireGuard Windows GUI client as a non-administrator user with the functionality limited to toggle on or off the existing VPN tunnel configuration created.

This generally involves two steps as an administrator on the Windows host:

  1. Create a registration key, which is specified in the command below
    
        reg add HKLM\Software\WireGuard /v LimitedOperatorUI /t REG_DWORD /d 1 /f
        
  2. Add the non-administrator user we wish to be able to toggle on/off the tunnel to the the Network Configuration Operators builtin group. We can do this by invoking the lusrmgr command.

Friday, January 27, 2023

Mysterious bash while read var behavior understood!

This is note about a mysterious behavior of while read var of the Bash shell. To understand the problem, let's consider the following problem:

Given a text file called example.txt as follows, write a Bash shell script called join_lines.sh to join the lines


BEGIN Line 1 Line 1
Line 1 Line 1
BEGIN Line 2 Line 2
Line 2 Line 2
Line 2 Line 2
Line 2
BEGIN Line 3 Line 3 Line 3
Line 3
Line 3

The output should be 3 lines, as illustrated in the example below:


$ ./join_lines.sh
Joined Line: BEGIN Line 1 Line 1 Line 1 Line 1
Joined Line: BEGIN Line 2 Line 2 Line 2 Line 2 ine 2 Line 2 Line 2
Joined Line: BEGIN Line 3 Line 3 Line 3 ine 3 ine 3

Our first implementation of join_lines.sh is as follows:


#!/bin/bash

joined=""
cat test.txt | \
    while read line; do
        echo ${line} | grep -E -q "^BEGIN"
        if [ $? -eq 0 ]; then
            if [ "${joined}" != "" ]; then
                echo "Joind Line: ${joined}"
                joined=""
            fi
        fi
        joined="${joined} ${line}"
    done
echo "Joind Line: ${joined}"

Unfortunately, the output is actually the following:


$ ./join_lines.sh
Joind Line:  BEGIN Line 1 Line 1 Line 1 Line 1
Joind Line:  BEGIN Line 2 Line 2 Line 2 Line 2 Line 2 Line 2 Line 2
Joind Line:
$

Why does variable joined lose its value? That is a mystery, isn't it? To understand this, let's revise the script to print out the process ID's of the shell. The revised version is as follows:


#!/bin/bash

joined=""
cat example.txt | \
    while read line; do
        echo ${line} | grep -E -q "^BEGIN"
        if [ $? -eq 0 ]; then
            if [ "${joined}" != "" ]; then
                echo "In $$ $BASHPID: Joind Line: ${joined}"
                joined=""
            fi
        fi
        joined="${joined} ${line}"
    done
echo "In $$ $BASHPID: Joind Line: ${joined}"

If we run this revised script, we shall get something like the following:


$ ./join_lines.sh
In 7065 7067: Joind Line:  BEGIN Line 1 Line 1 Line 1 Line 1
In 7065 7067: Joind Line:  BEGIN Line 2 Line 2 Line 2 Line 2 Line 2 Line 2 Line 2
In 7065 7065: Joind Line:
$

By carefully examine the output, we can see that $$ and $BASHPID have different values at the first two lines. So, what is the difference between $$ and $BASHPID and why are they different?

The Bash manaual page states this:


$ man bash
...
 BASHPID
              Expands  to  the  process  ID of the current bash process.  This
              differs from $$ under certain circumstances, such  as  subshells
              that  do  not require bash to be re-initialized.  Assignments to
              BASHPID have no effect.  If BASHPID is unset, it loses its  spe‐
              cial properties, even if it is subsequently reset.
 ...
$

The above experiment actually reveals that the while read-loop actually needs to run in a subshell. In fact, there are two variables, both called joined, one lives in the parent and the other the child bash process. A simple fix to the script would be to put the while read-loop and the last echo command in a subshell, e.g., as follows:


#!/bin/bash

joined=""
cat example.txt | \
	( \
    while read line; do
        echo ${line} | grep -E -q "^BEGIN"
        if [ $? -eq 0 ]; then
            if [ "${joined}" != "" ]; then
                echo "In $$ $BASHPID: Joind Line: ${joined}"
                joined=""
            fi
        fi
        joined="${joined} ${line}"
    done
echo "In $$ $BASHPID: Joind Line: ${joined}" \
	)

Let's run this revised script. We shall get:


$ ./join_lines.sh
In 7119 7121: Joind Line:  BEGIN Line 1 Line 1 Line 1 Line 1
In 7119 7121: Joind Line:  BEGIN Line 2 Line 2 Line 2 Line 2 Line 2 Line 2 Line 2
In 7119 7121: Joind Line:  BEGIN Line 3 Line 3 Line 3 Line 3 Line 3

The mystery is solved!

Wednesday, January 25, 2023

Disabling Linux Boot Splash Window

Most Linux systems use Plymouthd to display the Splash scren during boot. If you are running the computer as a server and do not log in from the console, the Plymouthd can sometimes bring more trouble than it is worth. For one, to display the Splash window, Plymouthd needs to interact with the driver of the graphics adapter in the system, and if there is an issue here, the system will not boot successfully. Since the server's console may not be conveniently accessed, this can be a real inconvenience.

To remove it on Linux systems like Fedora and Redhat, we can do the following,


sudo grubby --update-kernel=ALL --remove-args="quiet"
sudo grubby --update-kernel=ALL --remove-args="rhgb"
# directly edit /etc/default/grub and add "rd.plymouth=0 plymouth.enable=0" to GRUB_CMDLINE_LINUX
vi /etc/default/grub
sudo grub2-mkconfig -o /etc/grub2.cfg
sudo dnf remove plymouth

Saturday, January 21, 2023

Verifying Cuda Installation

For full CUDA installation, we can verify it via the following steps


  # check driver is installed
  cat /proc/driver/nvidia/version
  
  # check the version of CUDA Kit
  CUDA_PATH=/usr/local/cuda
  ${CUDA_PATH}/bin/nvcc --version
  
  # run deviceQuery demo program
  ${CUDA_PATH}/extras/demo_suite/deviceQuery
  
  # run bandwidhtTest demo program
  ${CUDA_PATH}/extras/demo_suite/bandwidthTest
  
  # run busGrind demo program
  ${CUDA_PATH}/extras/demo_suite/busGrind
  
  # run vectorAdd demo program
  ${CUDA_PATH}/extras/demo_suite/vectorAdd
  
  # finally, run sample programs from Nvidia
  git clone https://github.com/NVIDIA/cuda-samples
  cd cuda-samples
  make
  

Thursday, January 19, 2023

Removing Pandas SettingWithCopyWarning in Python Programs

Pandas can issue SettingWithCopyWarning messages. Although the messages can be false positives, it is more than often an indicator a bug or potential bug in our Python program. However, it is sometimes not straightforward to remove them, not until we have addressed a few thorny cases. This is a note to document a scenario that such a warning mesasge manifests. First, let's take look at the following Python program:


"""
test_copywarn.py
"""
import numpy as np
import pandas as pd


def get_subdf(df, rows):
    return df.iloc[rows]

def process_row(c1, c2):
    return c1+c2, c1-c2

if __name__ == '__main__':
    columns = ['c{}'.format(i) for i in range(3)]
    indices = ['i{}'.format(i) for i in range(8)]
    df = pd.DataFrame(np.random.random((8, 3)),
                      columns=columns,
                      index=indices)
    print(df)

    rows = [i+2 for i in range(4)]
    df2 = get_subdf(df, rows)
    print(df2)


    df2[['d', 'e']] = \
            df2.apply(lambda row: process_row(row['c1'], row['c2']),
                      axis=1,
                      result_type='expand')

    print(df2)

In the program, we use thePandas.DataFrame.apply() function to compute new columns from existing columns.

For reproducibility, we document the versions Python and the two packages imported:


$ python --version
Python 3.9.15
$ python -c "import pandas as pd; print(pd.__version__)"
1.5.2
$ python -c "import numpy as np; print(np.__version__)"
1.23.5
$

Now let's run the Python program:


$ python test_copywarn.py
          c0        c1        c2
i0  0.989495  0.071666  0.767847
i1  0.728875  0.881395  0.878282
i2  0.620991  0.391125  0.758265
i3  0.344082  0.971074  0.666805
i4  0.794103  0.554744  0.687492
i5  0.037881  0.790503  0.175453
i6  0.545525  0.493586  0.859064
i7  0.797247  0.271426  0.995042
          c0        c1        c2
i2  0.620991  0.391125  0.758265
i3  0.344082  0.971074  0.666805
i4  0.794103  0.554744  0.687492
i5  0.037881  0.790503  0.175453
test_copywarn.py:25: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2[['d', 'e']] = \
test_copywarn.py:25: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2[['d', 'e']] = \
          c0        c1        c2         d         e
i2  0.620991  0.391125  0.758265  1.149390 -0.367141
i3  0.344082  0.971074  0.666805  1.637879  0.304269
i4  0.794103  0.554744  0.687492  1.242236 -0.132747
i5  0.037881  0.790503  0.175453  0.965956  0.615050
$

Python complains about the line we compute new columns from existing columns via the apply function, and suggests that we should use .loc[row_indexer,col_indexer] instead. The result appears to be correct despite the warning mesages. However, we shall see that it can have disastrous results if we blindly follow the suggestion given here. In the following, we replace:


df2[['d', 'e']] = \
            df2.apply(lambda row: process_row(row['c1'], row['c2']),
                      axis=1,
                      result_type='expand')

with


df2.loc[:, ['d', 'e']] = \
            df2.apply(lambda row: process_row(row['c1'], row['c2']),
                      axis=1,
                      result_type='expand')

we run it again:


$ python test_copywarn.py
...
test_copywarn.py:25: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2.loc[:, ['d', 'e']] = \
          c0        c1        c2   d   e
i2  0.182985  0.635170  0.476586 NaN NaN
i3  0.157991  0.587269  0.498907 NaN NaN
i4  0.576238  0.669497  0.622658 NaN NaN
i5  0.304192  0.539268  0.618814 NaN NaN
$

We observe that columns d and e now have incorrect values. Two lessons here are:

  1. If we want to add new columns to a DataFrame, it is wrong to use the .loc function because the function is to slice the DataFrame and when the slice does not exist, and the result can be incorrect.
  2. The error may not be at the line the SettingWithCopyWarning is issued

For this particular example, after a closer examination, we realize the error is resulted from the chain assignment as follows:


	df.iloc[rows][['d', 'e']] = df.iloc[rows].apply(...)

because df2 is returned from get_subdf. Pandas designers want to ask us, do we want to change the original DataFrame df? Having understood this, we have two ways to fix this:

We can make a deep copy of the slice, so that it becomes a new DataFrame, i.e., as in below


    ...
    df2 = get_subdf(df, rows).copy()
    ...
    df2[['d', 'e']] = \
            df2.apply(lambda row: process_row(row['c1'], row['c2']),
                      axis=1,
                      result_type='expand')
    ...

Alternatively, if we never use the original DataFrame, we can rename df2 with df, which also gets rid of the warning because whether we want to change the original DataFrame df is irrelevant since we would lose access to it when we do df = get_subdf(df, rows), becase of this, there is no SettingWithCopyWarning any more. Just to emphasize this point, the complete program with this revision is below:


$ cat test_copywarn.py
import numpy as np
import pandas as pd

def get_subdf(df, rows):
    return df.iloc[rows]

def process_row(c1, c2):
    return c1+c2, c1-c2


if __name__ == '__main__':

    columns = ['c{}'.format(i) for i in range(3)]
    indices = ['i{}'.format(i) for i in range(8)]
    df = pd.DataFrame(np.random.random((8, 3)),
                      columns=columns,
                      index=indices)
    print(df)

    rows = [i+2 for i in range(4)]
    df = get_subdf(df, rows).copy()
    print(df)


    df[['d', 'e']] = \
            df.apply(lambda row: process_row(row['c1'], row['c2']),
                      axis=1,
                      result_type='expand')

    print(df)
$ python test_copywarn.py
          c0        c1        c2
i0  0.588995  0.706887  0.684446
i1  0.142972  0.481663  0.318174
i2  0.669792  0.869648  0.439205
i3  0.663541  0.951182  0.062734
i4  0.084048  0.089704  0.264744
i5  0.952133  0.087036  0.796757
i6  0.180122  0.819766  0.949701
i7  0.761599  0.772481  0.559961
          c0        c1        c2
i2  0.669792  0.869648  0.439205
i3  0.663541  0.951182  0.062734
i4  0.084048  0.089704  0.264744
i5  0.952133  0.087036  0.796757
          c0        c1        c2         d         e
i2  0.669792  0.869648  0.439205  1.308853  0.430444
i3  0.663541  0.951182  0.062734  1.013916  0.888447
i4  0.084048  0.089704  0.264744  0.354449 -0.175040
i5  0.952133  0.087036  0.796757  0.883793 -0.709720
$

which is interesting, and is worth noting it

Wednesday, January 18, 2023

More Space Needed on Root File System When installing CUDA Kit

Following the instruction on Nivdia's site, I was setting up CUDA Kit on a Fedora Linux host, and encountered a problem that the installation process failed due to not encough free space on the root file system, as indicated by the error message below


$ sudo dnf -y install cuda
...
Running transaction check
Transaction check succeeded.
Running transaction test
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
Error: Transaction test error:
  installing package cuda-nvcc-12-0-12.0.76-1.x86_64 needs 67MB more space on the / filesystem
  installing package cuda-gdb-12-0-12.0.90-1.x86_64 needs 84MB more space on the / filesystem
  installing package cuda-driver-devel-12-0-12.0.107-1.x86_64 needs 85MB more space on the / filesystem
  installing package cuda-libraries-devel-12-0-12.0.0-1.x86_64 needs 85MB more space on the / filesystem
  installing package cuda-visual-tools-12-0-12.0.0-1.x86_64 needs 85MB more space on the / filesystem
  installing package cuda-documentation-12-0-12.0.76-1.x86_64 needs 85MB more space on the / filesystem
  installing package cuda-demo-suite-12-0-12.0.76-1.x86_64 needs 98MB more space on the / filesystem
  installing package cuda-cuxxfilt-12-0-12.0.76-1.x86_64 needs 99MB more space on the / filesystem
  installing package cuda-cupti-12-0-12.0.90-1.x86_64 needs 210MB more space on the / filesystem
  installing package cuda-cuobjdump-12-0-12.0.76-1.x86_64 needs 210MB more space on the / filesystem
  installing package cuda-compiler-12-0-12.0.0-1.x86_64 needs 210MB more space on the / filesystem
  installing package cuda-sanitizer-12-0-12.0.90-1.x86_64 needs 248MB more space on the / filesystem
  installing package cuda-command-line-tools-12-0-12.0.0-1.x86_64 needs 248MB more space on the / filesystem
  installing package cuda-tools-12-0-12.0.0-1.x86_64 needs 248MB more space on the / filesystem
  installing package cuda-toolkit-12-0-12.0.0-1.x86_64 needs 248MB more space on the / filesystem
  installing package cuda-12-0-12.0.0-1.x86_64 needs 248MB more space on the / filesystem
  installing package cuda-12.0.0-1.x86_64 needs 248MB more space on the / filesystem

Error Summary
-------------
Disk Requirements:
   At least 248MB more space needed on the / filesystem.
...
$

It turns out that CUDA is installed at the /usr/local directory, and indeed, the free space on / is low. The solution to this problem is to mount the /usr/local directory to a file system that has sufficient disk space. The following steps illustrates this solultion, provided that the file system mounted at /disks/disk1 has sufficient space


sudo mkdir /disks/disk1/local
sudo rsync -azvf /usr/local/* /disks/disk1/local/
sudo rm -r/usr/local
sudo mkdir /usr/local
sudo mount --bind /disks/disk1/local /usr/local
sudo cp /etc/fstab /etc/fstab.bu
su -c "echo \
  '/disks/disk1/local /usr/local none defaults,bind,nofail,x-systemd.device-timeout=2 0 0' \
  >> /etc/fstab"

Tuesday, January 17, 2023

Installing Missing LaTeX Packages?

I recently discovered that I can easily install missing LaTeX packages on Fedora Linux, that is, via


sudo dnf install 'tex(beamer.cls)' 
sudo dnf install 'tex(hyperref.sty)' 

Can we do the similar on Debian/Ubuntu distributions?

Reference

  1. https://docs.fedoraproject.org/en-US/neurofedora/latex/

Monday, January 16, 2023

Creating and Starting KVM Virtual Machine: Basic Steps

This is just a note for docummenting the basic steps to create and start KVM virtual machines on Linux systems

  1. Make a plan for virtual machine resources. For this, we should query host resources.
    
        # show available disk spaces
        df -h
        # show available memory
        free -m
        # CPUs
        lscpu
        
  2. Assume we are installing an Ubuntu server system. We shall download the ISO image for the system, e.g.,
    
        wget \
          https://releases.ubuntu.com/22.04.1/ubuntu-22.04.1-live-server-amd64.iso \
          -O /var/lib/libvirt/images/ubuntu-22.04.1-live-server-amd64.iso
        
  3. Create a virtual disk for the virtual machine, e.g.,
    
        sudo truncate --size=10240M /var/lib/libvirt/images/officeservice.img
        
  4. Decide how we should configure the virtual machine network. First, we query existing ones:
    
        virsh --connect qemu:///system  net-list --all
        
  5. Now create a virtual machine and set up Ubuntu Linux on it, e.g.,
    
        sudo virt-install --name ubuntu \
        --description 'Ubuntu Server LTS' \
        --ram 4096 \
        --vcpus 2 \
        --disk path=/var/lib/libvirt/images/officeservice.img,size=10 \
        --osinfo detect=on,name=ubuntu-lts-latest \
        --network network=default \
        --graphics vnc,listen=127.0.0.1,port=5901 \
        --cdrom /var/lib/libvirt/images/ubuntu-22.04.1-live-server-amd64.iso  \
        --noautoconsole \
        --connect qemu:///system
        
  6. Suppose that you connect to Linux host via ssh via a Windows host. We cannot directly access the console of the virtual machine (that is at 127.0.0.1:5901 via VNC). In this case, we tunnel to the Linux host (assume its host name is LinuxHost) from the Windows host:
    
        ssh -L 15901:localhost:5901 LinuxHost
        
  7. We can now access the control via a VNC Viewer at the Windows host at localhost:15901.
  8. Once Ubuntu installation is over, we would lose the VNC connectivity. But, we can list the virtual machine created.
    
        sudo virsh --connect qemu:///system list --all
        
  9. To start the virtual machine, we run
    
        sudo virsh --connect qemu:///system  start ubuntu
        
  10. To make the virtual machine to start when we boot the host, set the virtual machine to be autostart, e.g.,
    
    	virsh --connect qemu:///system autostart ubuntu
    	

References

  1. https://docs.fedoraproject.org/en-US/quick-docs/getting-started-with-virtualization/
  2. https://ubuntu.com/blog/kvm-hyphervisor
  3. https://askubuntu.com/questions/160152/virt-install-says-name-is-in-use-but-virsh-list-all-is-empty-where-is-virt-i
  4. https://www.cyberciti.biz/faq/rhel-centos-linux-kvm-virtualization-start-virtual-machine-guest/
  5. https://www.cyberciti.biz/faq/howto-linux-delete-a-running-vm-guest-on-kvm/

Listing Physical Disks behind Hardward RAID Controller on Linux

Without rebooting into BIOS and hardware RAID controller's firmware, can we figure out the disks controlled by the controller? The answer is generally yes. However, the method can vary from one RAID controller to another. To list the physical disks on Linux, we need to figure out the RADI controller model, such as,


lspci | grep RAID

In my case, I have a MegaRAID, a popular RAID controller. To figure out the disks connected to the RAID controller, we can use smartctl as follows,


sudo smartctl -i -d megaraid,0 /dev/sdb

where "metaraid" is the controller model, and "0" is the 0-th disk, and "/dev/sdb" is the Linux device for the disk array

Having understood this, we can list all disks by a script as follows:


#!/bin/bash
device=/dev/sdb
disk=0
while [ 1 ]; do
   sudo smartctl -i -d megaraid,${disk} ${device}
   if [ $? -ne 0 ]; then
     break
   fi
   let disk=${disk}+1
done

Listing Bind Mounts on Linux Systems

To list bind mounts, we can use the findmnt command. For bind mounts, findmnt prints out the directories mounted in a pair of square brackets. Then, we can use this the following:


findmnt | grep -E "\[.*\]"

Thursday, January 12, 2023

Python failed to load a pickle: __randomstate_ctor() takes from 0 to 1 positional arguments but 2 were given

When I tried to load a Python pickle created at another host (host B), I encountered an error as follows:


__randomstate_ctor() takes from 0 to 1 positional arguments but 2 were given

It turns out that I had different versions of numpy at hosts A and B. To fix it, I went to host B where the pickle was created, figured out the version of numpy


$ pip list --format=freeze | grep numpy
numpy==1.24.1

At host A, I installed numpy


 pip install numpy==1.24.1

The problem went away!