Thursday, January 19, 2023

Removing Pandas SettingWithCopyWarning in Python Programs

Pandas can issue SettingWithCopyWarning messages. Although the messages can be false positives, it is more than often an indicator a bug or potential bug in our Python program. However, it is sometimes not straightforward to remove them, not until we have addressed a few thorny cases. This is a note to document a scenario that such a warning mesasge manifests. First, let's take look at the following Python program:


"""
test_copywarn.py
"""
import numpy as np
import pandas as pd


def get_subdf(df, rows):
    return df.iloc[rows]

def process_row(c1, c2):
    return c1+c2, c1-c2

if __name__ == '__main__':
    columns = ['c{}'.format(i) for i in range(3)]
    indices = ['i{}'.format(i) for i in range(8)]
    df = pd.DataFrame(np.random.random((8, 3)),
                      columns=columns,
                      index=indices)
    print(df)

    rows = [i+2 for i in range(4)]
    df2 = get_subdf(df, rows)
    print(df2)


    df2[['d', 'e']] = \
            df2.apply(lambda row: process_row(row['c1'], row['c2']),
                      axis=1,
                      result_type='expand')

    print(df2)

In the program, we use thePandas.DataFrame.apply() function to compute new columns from existing columns.

For reproducibility, we document the versions Python and the two packages imported:


$ python --version
Python 3.9.15
$ python -c "import pandas as pd; print(pd.__version__)"
1.5.2
$ python -c "import numpy as np; print(np.__version__)"
1.23.5
$

Now let's run the Python program:


$ python test_copywarn.py
          c0        c1        c2
i0  0.989495  0.071666  0.767847
i1  0.728875  0.881395  0.878282
i2  0.620991  0.391125  0.758265
i3  0.344082  0.971074  0.666805
i4  0.794103  0.554744  0.687492
i5  0.037881  0.790503  0.175453
i6  0.545525  0.493586  0.859064
i7  0.797247  0.271426  0.995042
          c0        c1        c2
i2  0.620991  0.391125  0.758265
i3  0.344082  0.971074  0.666805
i4  0.794103  0.554744  0.687492
i5  0.037881  0.790503  0.175453
test_copywarn.py:25: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2[['d', 'e']] = \
test_copywarn.py:25: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2[['d', 'e']] = \
          c0        c1        c2         d         e
i2  0.620991  0.391125  0.758265  1.149390 -0.367141
i3  0.344082  0.971074  0.666805  1.637879  0.304269
i4  0.794103  0.554744  0.687492  1.242236 -0.132747
i5  0.037881  0.790503  0.175453  0.965956  0.615050
$

Python complains about the line we compute new columns from existing columns via the apply function, and suggests that we should use .loc[row_indexer,col_indexer] instead. The result appears to be correct despite the warning mesages. However, we shall see that it can have disastrous results if we blindly follow the suggestion given here. In the following, we replace:


df2[['d', 'e']] = \
            df2.apply(lambda row: process_row(row['c1'], row['c2']),
                      axis=1,
                      result_type='expand')

with


df2.loc[:, ['d', 'e']] = \
            df2.apply(lambda row: process_row(row['c1'], row['c2']),
                      axis=1,
                      result_type='expand')

we run it again:


$ python test_copywarn.py
...
test_copywarn.py:25: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2.loc[:, ['d', 'e']] = \
          c0        c1        c2   d   e
i2  0.182985  0.635170  0.476586 NaN NaN
i3  0.157991  0.587269  0.498907 NaN NaN
i4  0.576238  0.669497  0.622658 NaN NaN
i5  0.304192  0.539268  0.618814 NaN NaN
$

We observe that columns d and e now have incorrect values. Two lessons here are:

  1. If we want to add new columns to a DataFrame, it is wrong to use the .loc function because the function is to slice the DataFrame and when the slice does not exist, and the result can be incorrect.
  2. The error may not be at the line the SettingWithCopyWarning is issued

For this particular example, after a closer examination, we realize the error is resulted from the chain assignment as follows:


	df.iloc[rows][['d', 'e']] = df.iloc[rows].apply(...)

because df2 is returned from get_subdf. Pandas designers want to ask us, do we want to change the original DataFrame df? Having understood this, we have two ways to fix this:

We can make a deep copy of the slice, so that it becomes a new DataFrame, i.e., as in below


    ...
    df2 = get_subdf(df, rows).copy()
    ...
    df2[['d', 'e']] = \
            df2.apply(lambda row: process_row(row['c1'], row['c2']),
                      axis=1,
                      result_type='expand')
    ...

Alternatively, if we never use the original DataFrame, we can rename df2 with df, which also gets rid of the warning because whether we want to change the original DataFrame df is irrelevant since we would lose access to it when we do df = get_subdf(df, rows), becase of this, there is no SettingWithCopyWarning any more. Just to emphasize this point, the complete program with this revision is below:


$ cat test_copywarn.py
import numpy as np
import pandas as pd

def get_subdf(df, rows):
    return df.iloc[rows]

def process_row(c1, c2):
    return c1+c2, c1-c2


if __name__ == '__main__':

    columns = ['c{}'.format(i) for i in range(3)]
    indices = ['i{}'.format(i) for i in range(8)]
    df = pd.DataFrame(np.random.random((8, 3)),
                      columns=columns,
                      index=indices)
    print(df)

    rows = [i+2 for i in range(4)]
    df = get_subdf(df, rows).copy()
    print(df)


    df[['d', 'e']] = \
            df.apply(lambda row: process_row(row['c1'], row['c2']),
                      axis=1,
                      result_type='expand')

    print(df)
$ python test_copywarn.py
          c0        c1        c2
i0  0.588995  0.706887  0.684446
i1  0.142972  0.481663  0.318174
i2  0.669792  0.869648  0.439205
i3  0.663541  0.951182  0.062734
i4  0.084048  0.089704  0.264744
i5  0.952133  0.087036  0.796757
i6  0.180122  0.819766  0.949701
i7  0.761599  0.772481  0.559961
          c0        c1        c2
i2  0.669792  0.869648  0.439205
i3  0.663541  0.951182  0.062734
i4  0.084048  0.089704  0.264744
i5  0.952133  0.087036  0.796757
          c0        c1        c2         d         e
i2  0.669792  0.869648  0.439205  1.308853  0.430444
i3  0.663541  0.951182  0.062734  1.013916  0.888447
i4  0.084048  0.089704  0.264744  0.354449 -0.175040
i5  0.952133  0.087036  0.796757  0.883793 -0.709720
$

which is interesting, and is worth noting it

No comments:

Post a Comment