StataReader processes whole file before reading in chunks

This issue has been tracked since 2022-09-21.

I've noticed that when reading large Stata files using the chunksize parameter the time it takes to create the StataReader object is affected by the size of the file. This is a bit surprising since all of the metadata it needs is contained in the file header so it seems like it should take the same time regardless of the total file size.

I took a look at the code and it seems like the culprit is this line that reads the entire file into a BytesIO object before parsing the header. I'm not entirely sure what this accomplishes. Ideally it would be nice to be able to create the StataReader object after processing just the header portion of the file.

pandas/pandas/io/stata.py

Lines 1167 to 1175 in 71fc89c

with get_handle(
path_or_buf,
"rb",
storage_options=storage_options,
is_text=False,
compression=compression,
) as handles:
# Copy to BytesIO, and ensure no encoding
self.path_or_buf = BytesIO(handles.handle.read())

twoertwein wrote this answer on 2022-09-22

Is that a behavior change you have noticed since 1.5 or did it also exist in previous versions? I think these particular lines of code are around since 1.3 but even before it (I think) it had a similar logic.

I think the issue is that some IO-like objects are not seekable but read_stata does internally a lot of seeking (some of the compressions IO doesn't support seeking). It might be the case that we can change the above line to only completely read the file if it isn't seekable.

sterlinm wrote this answer on 2022-09-24

I don't think it changed in 1.5, I had noticed it with 1.4. I didn't look back to see when it was introduced or if it has always been there.

I see the uses of seek in parsing the header but it seems like it should be possible to avoid that.

EDIT: Commented to soon, I think the suggestion to skip that when the file is seekable is simpler.

twoertwein wrote this answer on 2022-09-24

Feel free to open a PR!

I think the main change is

self.handles = get_handle(...)
if hasattr(self.handles.handle, "seekable") and self.handles.handle.seekable:
    self.path_or_buf = self.handles.handle
else:
    with self.handles:
        self.path_or_buf = BytesIO(handles.handle.read()) 

# and then appropriate code to close self.handles (and self.path_or_buf in case of BytesIO)
sterlinm wrote this answer on 2022-09-24

Feel free to open a PR!

I'll give it a shot over the weekend. Thanks!

akx wrote this answer on 2022-10-03

By the looks of it, 2f0ada3 was the commit that changed this behavior, way back when.

(Came here via answering https://stackoverflow.com/a/73934594/51685 :-) )

More Details About Repo
Owner Name pandas-dev
Repo Name pandas
Full Name pandas-dev/pandas
Language Python
Created Date 2010-08-24
Updated Date 2022-10-04
Star Count 35430
Watcher Count 1120
Fork Count 15089
Issue Count 3589

YOU MAY BE INTERESTED

Issue Title Created Date Updated Date