Skip to content Skip to sidebar Skip to footer

Numpy: Fromfile For Gzipped File

I am using numpy.fromfile to construct an array which I can pass to the pandas.DataFrame constructor import numpy as np import pandas as pd def read_best_file(file, **kwargs):

Solution 1:

open.gzip() doesn't return a true file object. It's duck one .. it walks like a duck, sounds like a duck, but isn't quite a duck per numpy. So numpy is being strict (since much is written in lower level C code, it might require an actual file descriptor.)

You can get the underlying file from the gzip.open() call, but that's just going to get you the compressed stream.

This is what I would do: I would use subprocess.Popen() to invoke zcat to uncompress the file as a stream.

>>>import subprocess>>>p = subprocess.Popen(["/usr/bin/zcat", "foo.txt.gz"], stdout=subprocess.PIPE)>>>type(p.stdout)
<type 'file'>
>>>p.stdout.read()
'hello world\n'

Now you can pass p.stdout as a file object to numpy:

np.fromfile(p.stdout, ...)

Solution 2:

I have had success reading arrays of raw binary data from gzipped files by feeding the read() results through numpy.frombuffer(). This code works in Python 3.7.3, and perhaps in earlier versions also.

# Example: read short integers (signed) from gzipped raw binary fileimport gzip
import numpy as np

fname_gzipped = 'my_binary_data.dat.gz'
raw_dtype = np.int16
with gzip.open(fname_gzipped, 'rb') as f:
    from_gzipped = np.frombuffer(f.read(), dtype=raw_dtype)

# Demonstrate equivalence with direct np.fromfile()
fname_raw = 'my_binary_data.dat'
from_raw = np.fromfile(fname_raw, dtype=raw_dtype)

# Trueprint('raw binary and gunzipped are the same: {}'.format(
    np.array_equiv(from_gzipped, from_raw)))

# False
wrong_dtype = np.uint8
binary_as_wrong_dtype = np.fromfile(fname_raw, dtype=wrong_dtype)
print('wrong dtype and gunzipped are the same: {}'.format(
    np.array_equiv(from_gzipped, binary_as_wrong_dtype)))

Post a Comment for "Numpy: Fromfile For Gzipped File"