Pandas can issue SettingWithCopyWarning
messages. Although the messages can be false positives,
it is more than often an indicator a bug or potential bug in our Python program. However, it is sometimes
not straightforward to remove them, not until we have addressed a few thorny cases. This is a note to
document a scenario that such a warning mesasge manifests. First, let's take look at the following Python
program:
"""
test_copywarn.py
"""
import numpy as np
import pandas as pd
def get_subdf(df, rows):
return df.iloc[rows]
def process_row(c1, c2):
return c1+c2, c1-c2
if __name__ == '__main__':
columns = ['c{}'.format(i) for i in range(3)]
indices = ['i{}'.format(i) for i in range(8)]
df = pd.DataFrame(np.random.random((8, 3)),
columns=columns,
index=indices)
print(df)
rows = [i+2 for i in range(4)]
df2 = get_subdf(df, rows)
print(df2)
df2[['d', 'e']] = \
df2.apply(lambda row: process_row(row['c1'], row['c2']),
axis=1,
result_type='expand')
print(df2)
In the program, we use thePandas.DataFrame.apply()
function to compute new columns from existing columns.
For reproducibility, we document the versions Python and the two packages imported:
$ python --version
Python 3.9.15
$ python -c "import pandas as pd; print(pd.__version__)"
1.5.2
$ python -c "import numpy as np; print(np.__version__)"
1.23.5
$
Now let's run the Python program:
$ python test_copywarn.py
c0 c1 c2
i0 0.989495 0.071666 0.767847
i1 0.728875 0.881395 0.878282
i2 0.620991 0.391125 0.758265
i3 0.344082 0.971074 0.666805
i4 0.794103 0.554744 0.687492
i5 0.037881 0.790503 0.175453
i6 0.545525 0.493586 0.859064
i7 0.797247 0.271426 0.995042
c0 c1 c2
i2 0.620991 0.391125 0.758265
i3 0.344082 0.971074 0.666805
i4 0.794103 0.554744 0.687492
i5 0.037881 0.790503 0.175453
test_copywarn.py:25: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df2[['d', 'e']] = \
test_copywarn.py:25: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df2[['d', 'e']] = \
c0 c1 c2 d e
i2 0.620991 0.391125 0.758265 1.149390 -0.367141
i3 0.344082 0.971074 0.666805 1.637879 0.304269
i4 0.794103 0.554744 0.687492 1.242236 -0.132747
i5 0.037881 0.790503 0.175453 0.965956 0.615050
$
Python complains about the line we compute new columns from existing columns via the apply
function, and suggests that we should use .loc[row_indexer,col_indexer]
instead. The result appears to be correct despite the warning mesages. However, we shall see that it can have disastrous results if we blindly follow the suggestion given here.
In the following, we replace:
df2[['d', 'e']] = \
df2.apply(lambda row: process_row(row['c1'], row['c2']),
axis=1,
result_type='expand')
with
df2.loc[:, ['d', 'e']] = \
df2.apply(lambda row: process_row(row['c1'], row['c2']),
axis=1,
result_type='expand')
we run it again:
$ python test_copywarn.py
...
test_copywarn.py:25: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df2.loc[:, ['d', 'e']] = \
c0 c1 c2 d e
i2 0.182985 0.635170 0.476586 NaN NaN
i3 0.157991 0.587269 0.498907 NaN NaN
i4 0.576238 0.669497 0.622658 NaN NaN
i5 0.304192 0.539268 0.618814 NaN NaN
$
We observe that columns d
and e
now have incorrect values. Two lessons here are:
- If we want to add new columns to a
DataFrame
, it is wrong to use the.loc
function because the function is to slice theDataFrame
and when the slice does not exist, and the result can be incorrect. - The error may not be at the line the
SettingWithCopyWarning
is issued
For this particular example, after a closer examination, we realize the error is resulted from the chain assignment as follows:
df.iloc[rows][['d', 'e']] = df.iloc[rows].apply(...)
because df2
is returned from get_subdf
. Pandas designers want to ask us, do we want to
change the original DataFrame df
? Having understood this, we have two ways to fix this:
We can make a deep copy of the slice, so that it becomes a new DataFrame
, i.e., as in below
...
df2 = get_subdf(df, rows).copy()
...
df2[['d', 'e']] = \
df2.apply(lambda row: process_row(row['c1'], row['c2']),
axis=1,
result_type='expand')
...
Alternatively, if we never use the original DataFrame
, we can rename df2
with df
,
which also gets rid of the warning because whether we want to change the original DataFrame df
is irrelevant
since we would lose access to it when we do df = get_subdf(df, rows)
, becase of this, there is no
SettingWithCopyWarning
any more. Just to emphasize this point, the complete program with this revision
is below:
$ cat test_copywarn.py
import numpy as np
import pandas as pd
def get_subdf(df, rows):
return df.iloc[rows]
def process_row(c1, c2):
return c1+c2, c1-c2
if __name__ == '__main__':
columns = ['c{}'.format(i) for i in range(3)]
indices = ['i{}'.format(i) for i in range(8)]
df = pd.DataFrame(np.random.random((8, 3)),
columns=columns,
index=indices)
print(df)
rows = [i+2 for i in range(4)]
df = get_subdf(df, rows).copy()
print(df)
df[['d', 'e']] = \
df.apply(lambda row: process_row(row['c1'], row['c2']),
axis=1,
result_type='expand')
print(df)
$ python test_copywarn.py
c0 c1 c2
i0 0.588995 0.706887 0.684446
i1 0.142972 0.481663 0.318174
i2 0.669792 0.869648 0.439205
i3 0.663541 0.951182 0.062734
i4 0.084048 0.089704 0.264744
i5 0.952133 0.087036 0.796757
i6 0.180122 0.819766 0.949701
i7 0.761599 0.772481 0.559961
c0 c1 c2
i2 0.669792 0.869648 0.439205
i3 0.663541 0.951182 0.062734
i4 0.084048 0.089704 0.264744
i5 0.952133 0.087036 0.796757
c0 c1 c2 d e
i2 0.669792 0.869648 0.439205 1.308853 0.430444
i3 0.663541 0.951182 0.062734 1.013916 0.888447
i4 0.084048 0.089704 0.264744 0.354449 -0.175040
i5 0.952133 0.087036 0.796757 0.883793 -0.709720
$
which is interesting, and is worth noting it
No comments:
Post a Comment