ENH: Safety net for operations not compatible with a given dtype.

This issue has been tracked since 2022-09-20.

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish I could have a safety net preventing me from doing non correct operations linked to dtypes issues/incompatibilities.

Example:

In [1]: import pandas as pd
In [2]: s = pd.Series([1,3,5,129]).astype("uint8")
   ...: s + s

Out[2]: 
0     2
1     6
2    10
3     2
dtype: uint8

The expected value of 129+129 would be 258, however the value is 2.
This is because the maximum value of an "uint8" is 255, and 255+1 set the value back to zero.

Another kind of misleading behaviour is if one uses astype in a not careful way

In [1]: import pandas as pd
In [2]: s = pd.Series([1,-1,10**5]).astype("int8")
In [3]: s

Out[3]: 
0     1
1    -1
2   -96
dtype: int8

Feature Description

A solution could be to make sure that an operation is compatible between columns dtypes.
I try to show an example of what I mean for the + operation.

    @unpack_zerodim_and_defer("__add__")
    def __add__(self, other):
        check_dtype_operation_add(self, other)       # The check function checking that an operation is ok
        return self._arith_method(other, operator.add)

A mock of the function testing only for an operation between uint8 and a number:

def check_dtype_operation_add(self, other):
    if self.dtype == "uint8" and isinstance(other, numbers.Number) :
        logging.warning("warning: you are performing an addition between a uint8 and a number,"
                        "this could result in an overflow if the operation leads to number greater than 255"
        )
        if self.max() + other > 255:
            raise Exception(f"adding {self.dtype} with a number which had lead to a memory overflow, please upcast your series dtype")

The final behaviour is this one:

  • print a warning (should be kept for low values dtypes (int8, uint8, Int8, float8, etc))
  • raise an exception if the operation leads to an overflow
In [1]: import pandas as pd

In [2]: s = pd.Series([1,2,200]).astype("uint8")

In [3]: s+200
WARNING:root:warning: your are performing an addition between a uint8 and a number,this could result in an overflow if the operation leads to number greater than 255
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In [3], line 1
----> 1 s+200

File /workspaces/pandas/pandas/core/ops/common.py:73, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     69             return NotImplemented
     71 other = item_from_zerodim(other)
---> 73 return method(self, other)

File /workspaces/pandas/pandas/core/arraylike.py:121, in OpsMixin.__add__(self, other)
    119 @unpack_zerodim_and_defer("__add__")
    120 def __add__(self, other):
--> 121     check_dtype_operation_add(self, other)
    122     return self._arith_method(other, operator.add)

File /workspaces/pandas/pandas/core/arraylike.py:43, in check_dtype_operation_add(self, other)
     39 logging.warning("warning: your are performing an addition between a uint8 and a number,"
     40                 "this could result in an overflow if the operation leads to number greater than 255"
     41 )
     42 if self.max() + other > 255:
---> 43     raise Exception(f"adding {self.dtype} with a number which had lead to a memory overflow, please upcast your series dtype")
     44 else:
     45     pass

Exception: adding uint8 with a number which had lead to a memory overflow, please upcast your series dtype

Concerning .astype we would just have to make sure, in the case of a downcasting, that s.max() and s.min() are compatible with the lower dtype.

I guess that would impact slightly the performance, an option to deactivate this behaviour ( e.g. pd.options.dtype.dtype_checks = False ) could solve the issue for users that want the maximum performance.

Alternative Solutions

Another solution would be to upcast implicitly the dtype when needed. I actually would enjoy that solution, it could possibly be implemented as an option pd.options.dytpe.automatic_casting = True

In [1]: import pandas as pd
In [2]: s = pd.Series([1,3,5,129]).astype("uint8")
   ...: s + s

Out[2]: 
0     2
1     6
2    10
3     258
dtype: uint16
In [1]: import pandas as pd
In [2]: s = pd.Series([1,3,5,129, 3.2]).astype("uint8")
   ...: s

Out[3]: 
0      1.0
1      3.0
2      5.0
3    129.0
4      3.2
dtype: float64

One could also imagine an option to save the maximum memory pd.options.dytpe.automatic_casting = "aggressive"

In [1]: import pandas as pd
In [2]: s = pd.Series([1,3,5,129, 3.2]).astype("uint8")
   ...: s

Out[3]: 
0      1.000000
1      3.000000
2      5.000000
3    129.000000
4      3.199219
dtype: float16

Additional Context

My experience & why I think something should be done

While I know about this pandas behavior since 2014, I remember that I was fairly surprised the first time it happened to me.
I was working on a 2000 columns dataframe, with more than 2000 lines of code, my downstream dataframe was strange, and I spent a great amount of time finding the issue.

I am certain that users wanting to save a little memory space with lower dtypes have faced (silently or not) the same issue, and I am pretty sure that production code is running bugs due to this. Moreover, the documentation lacks warnings about it.

Documentation concerning the subject:

I looked at several places to find a warning about this pandas behaviour, and I did not find anything.

API Reference:

It is the same for the Series part of the api reference.

User guide

The basics dtypes is fairly complete, but except for an example that shows explicitly the issue, the is no warning in the whole guide about possible issues with operations or downcasting that would lead to unintended behaviour.

Functions docstring

I was not able to find any warning in the function docstring about this issue.

phofl wrote this answer on 2022-09-20

Hi, thanks for your report.

pandas does not implement these aggregation methods itself. We are falling back to numpy where this behavior is inherited from:

na = np.array([129], dtype="uint8")

na + na

returns

[2]
adrienpacifico wrote this answer on 2022-09-20

Hi, thanks for your answer.
As pandas is higher level than numpy, I think it would make sense to protect (or warn) users from operations than would lead to unexpected behavior.

I hope that I showed that pandas could have these protections with quite small modifications in the code base. This inherited behavior from numpy does not mean that pandas can not deal with it, right?

@phofl should I consider that the pandas library is not open to such contributions and will never implement protection against those behaviours?

In any case, does documenting more about the issue with warnings in the user guide, api reference, and functions docstring would be an improvement to the current documentation?

mroeschke wrote this answer on 2022-09-21

Improving the docstrings would probably be the better course of actions here. Generally, pandas tries to align with numpy semantics unless there are unsupported behaviors in numpy that pandas requires. Deviating from numpy semantics would be too difficult to maintain if numpy decides to change its behaviors.

Side note: the new pyarrow dtypes in 1.5 will raise on overflow errors for addition

In [2]: ser = pd.Series([129], dtype="uint8[pyarrow]")

In [3]: ser
Out[3]:
0   129
dtype: uint8[pyarrow]

In [4]: ser + ser
ArrowInvalid: overflow
More Details About Repo
Owner Name pandas-dev
Repo Name pandas
Full Name pandas-dev/pandas
Language Python
Created Date 2010-08-24
Updated Date 2022-10-04
Star Count 35430
Watcher Count 1120
Fork Count 15089
Issue Count 3589

YOU MAY BE INTERESTED

Issue Title Created Date Updated Date