BUG: Wrong index constructor if new DF/Series is created from DF/Series

This issue has been tracked since 2022-09-21.

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from pandas import DataFrame as pDF, Series as pSe

# INIT
# ────────────────────────────────────────────────────────────
# DATA
x = pDF({'A':[1,2,3,4],'B':[5,6,7,8]}, index=[1,0,1,0])

# NEW VAR TO BE ADDED IN SEVERAL WAYS
N = [1,2,3,4]


# (1) Works as expected → New variable from a List
# ────────────────────────────────────────────────────────────
x['N'] = N

# (1) RESULT → OK
	A	B	N
1	1	5	1
0	2	6	2
1	3	7	3
0	4	8	4


# (2) Works as expected → New variable from pSe/PDF
# ────────────────────────────────────────────────────────────
x['N'] = pDF(N, index=x.index)
# OR 
x['N'] = pSe(N, index=x.index)
# Produces the same correct result

# (2) RESULT → OK
	A	B	N
1	1	5	1
0	2	6	2
1	3	7	3
0	4	8	4

# (3) DOESN'T Work as expected → New variable from ► INNER pSe/PDF ◄
# ────────────────────────────────────────────────────────────
x['N'] = pDF(pSe(N), index=x.index)
# OR 
x['N'] = pSe(pSe(N), index=x.index)
# Produces WRONG Result as index mismatch

# (3) RESULT → WRONG → INDEX MISMATCH
	A	B	N
1	1	5	2
0	2	6	1
1	3	7	2
0	4	8	1

# (4) WRONG behavior can be overcome by using set_index as follows:
x['N'] = pDF(pSe(N)).set_index(x.index)
# Which produces correct result

Issue Description

Working with variables, and creating new variables is a crucial component of pandas.
Thus I assume that the construct of the new variable should be consistent regardless of using index= or set_index() method.

The example above shows that issue is inside the constructor when an inner dataframe or series is used and a non-unique index is present in the targeted dataframe → in such a situation using index= param during the constructor doesn't work as expected and it is inconsistent with the rest of the approachs.

Expected Behavior

Consistent behavior of constructor for index= and set_index regardless of input object to DataFrame/Serie.

Installed Versions

INSTALLED VERSIONS

commit : e8093ba
python : 3.9.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19043
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 1.4.3
numpy : 1.20.3
pytz : 2021.3
dateutil : 2.8.2
setuptools : 58.0.4
pip : 21.2.4
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : 4.2.0
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.29.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
brotli :
fastparquet : None
fsspec : 2021.10.1
gcsfs : None
markupsafe : 1.1.1
matplotlib : 3.5.2
numba : 0.54.1
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
snappy : None
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None

rhshadrach wrote this answer on 2022-09-24

Thanks for the report! I believe what you're highlighting is intended behavior. When data is a DataFrame or Series and the index argument is specified, pandas will align the data with the provided index. I thought this was well documented in the Series / DataFrame constructor, but I'm not seeing that now. We may have another issue about documenting this.

MichalRIcar wrote this answer on 2022-09-24

Thank you for the reply.
From MPOV the issue is that behavior is dangerously inconsistent:
Adding a new variable to DataFrame from another DF/Serie, which has a different index from targeted DF:
► x['N'] = pSe(pSe(N), index=x.index)
► then constructor behaves like pd.concat and silently ignoring specified index=x.index ◄ as the example above shows.

To wrap it up:
x['N'] = pSe(pSe(N), index=x.index) == pd.concat([x, pSe(N)], 1)
► silently ignoring key factor index=x.index

however

other approaches like
x['N'] = pSe(pSe(N)).set_index(x.index)
x['N'] = pSe(N, index=x.index)
x['N'] = list(N)
etc.. producing correct-expected results and not behaving like pd.concat.

The problem from my side is that this silent behavior of ignoring the specified index and acting as pd.concat is the only one of the 4 approaches above.

Thus, to conclude, shouldn't the result table be just same regardless using pDF(index = x.index ) or .(set_index(x.index))?

rhshadrach wrote this answer on 2022-09-25

x['N'] = pSe(pSe(N), index=x.index)

This combines two different operations:

  1. the creation of a Series by providing a Series for data and specifying index; and
  2. the assignment of a Series to a DataFrame

We can only determine the behavior of each one individually, as this logically determines what happens when you performance them sequentially. Can you specify which of these two do you think behaves improperly and specify why? Include both if you believe they are both incorrect, but do so individually.

MichalRIcar wrote this answer on 2022-09-26

Hello,

thank you, a good idea to reduce it to the elements.

The main point is that pandas constructor of new DataFrame or Serie with input as another panda (DF/SE) works as a pd.concat as the result shows:

image

The problem is that the second line is just the same with a slight difference - instead of using index=x.index inside pDF(), we define specify index outside via set_index(x.index).

Thus, the new DataFrame constructor with inner index specification works as pd.concat, not as a new dataframe... if it's intentional, then I believe should be highlighted in the documentation as the behavior is not compatible with other approaches and in general pandas behavior..

rhshadrach wrote this answer on 2022-09-26

Thanks for elaborating here. Constructing a DataFrame from a Series while specifying an index will align the Series to the provided index:

ser = pd.Series([3, 4], index=[1, 2])
print(ser)
# 2    3
# 1    4
# dtype: int64

df = pd.DataFrame(ser, index=[1, 2, 3])
print(df)
#      0
# 1  4.0
# 2  3.0
# 3  NaN

Using .set_index on the other hand replaces the existing index without modifying the values. Is it these two behaviors which you view as being incompatible?

In any case, the previous issue I referenced above was #42818. This was closed by adding the line

If a dict contains Series which have an index defined, it is aligned by its index.

which is good, but I think there could still be improvement here. Namely, mentioning alignment with Series or DataFrame inputs (and not just those in a dictionary).

MichalRIcar wrote this answer on 2022-09-27

Thank you, the 42818 is discussing this. I also believe a rising warning would be appropriate as the behavior is not consistent.
Simply put, creating new dataframe is by definition a command to create new dataframe, not to merge it to another dataframe.
Not trying to convince pandas team to think about this twice :) but I hope you can see my point that inherent incosistency is present in the result.

To sum it up
→ when creating a new dataframe, user should obtain same output regardless the input format.
→ when want to join dataframes, then pd.concat/pd.merge has to be called accordingly.

Thank you once again for the discussion, I will put to our internal documentation big red alert, warning to never use inner index, always set_index.

Best,
Michal

rhshadrach wrote this answer on 2022-09-28

Reopening - I think the docs could be improved here as mentioned above.

More Details About Repo
Owner Name pandas-dev
Repo Name pandas
Full Name pandas-dev/pandas
Language Python
Created Date 2010-08-24
Updated Date 2022-10-04
Star Count 35430
Watcher Count 1120
Fork Count 15089
Issue Count 3589

YOU MAY BE INTERESTED

Issue Title Created Date Updated Date