Numpy: Fromfile For Gzipped File
I am using numpy.fromfile to construct an array which I can pass to the pandas.DataFrame constructor import numpy as np import pandas as pd def read_best_file(file, **kwargs):
Solution 1:
open.gzip()
doesn't return a true file
object. It's duck one .. it walks like a duck, sounds like a duck, but isn't quite a duck per numpy
. So numpy
is being strict (since much is written in lower level C code, it might require an actual file descriptor.)
You can get the underlying file
from the gzip.open()
call, but that's just going to get you the compressed stream.
This is what I would do: I would use subprocess.Popen()
to invoke zcat
to uncompress the file as a stream.
>>>import subprocess>>>p = subprocess.Popen(["/usr/bin/zcat", "foo.txt.gz"], stdout=subprocess.PIPE)>>>type(p.stdout)
<type 'file'>
>>>p.stdout.read()
'hello world\n'
Now you can pass p.stdout
as a file
object to numpy
:
np.fromfile(p.stdout, ...)
Solution 2:
I have had success reading arrays of raw binary data from gzipped files by feeding the read() results through numpy.frombuffer(). This code works in Python 3.7.3, and perhaps in earlier versions also.
# Example: read short integers (signed) from gzipped raw binary fileimport gzip
import numpy as np
fname_gzipped = 'my_binary_data.dat.gz'
raw_dtype = np.int16
with gzip.open(fname_gzipped, 'rb') as f:
from_gzipped = np.frombuffer(f.read(), dtype=raw_dtype)
# Demonstrate equivalence with direct np.fromfile()
fname_raw = 'my_binary_data.dat'
from_raw = np.fromfile(fname_raw, dtype=raw_dtype)
# Trueprint('raw binary and gunzipped are the same: {}'.format(
np.array_equiv(from_gzipped, from_raw)))
# False
wrong_dtype = np.uint8
binary_as_wrong_dtype = np.fromfile(fname_raw, dtype=wrong_dtype)
print('wrong dtype and gunzipped are the same: {}'.format(
np.array_equiv(from_gzipped, binary_as_wrong_dtype)))
Post a Comment for "Numpy: Fromfile For Gzipped File"