How Can I "sparsify" On Two Values?
Solution 1:
Since you tagged this with scipy
, I'll show you what a scipy.sparse
matrix is like:
In [31]: n=100In [32]: arr=np.array([[0]*n+[1]*n],int)
In [33]: M=sparse.csr_matrix(arr)
In [34]: M.data
Out[34]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
In [35]: M.indices
Out[35]:
array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151,
152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177,
178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190,
191, 192, 193, 194, 195, 196, 197, 198, 199], dtype=int32)
In [36]: M.indptr
Out[36]: array([ 0, 100], dtype=int32)
It has replaced the n
elements of arr
with 2 arrays each with n/2
elements. Even if I replace the int
with uint8
, the M.indices
array will still be int32
.
The fact that your pandas
version has half the memory usage,suggests that it is just storing the indices, and some how noting that the data
part is all 1s. But that's just a guess.
How much greater sparification do you expect?
====================
http://pandas.pydata.org/pandas-docs/stable/sparse.html
This example looks like pandas is implementing some sort of 'run' compression:
In [4]: sts
Out[4]:00.4691121-0.2828632NaN3NaN4NaN5NaN6NaN7NaN8-0.8618499-2.104569
dtype: float64
BlockIndex
Block locations: array([0,8], dtype=int32)
Block lengths: array([2,2], dtype=int32)
It has identified 2 blocks, of length 2 each. It still has to store the 4 nonfill values in some array.
A csr sparse equivalent (for a row array):
In [1052]: arr=np.random.rand(10)
In [1053]: arr[2:-2]=0In [1055]: M=sparse.csr_matrix(arr)
In [1056]: M
Out[1056]:
<1x10 sparse matrix of type '<class 'numpy.float64'>'with4 stored elements in Compressed Sparse Row format>In [1057]: M.data
Out[1057]: array([ 0.37875012, 0.73703368, 0.7935645 , 0.22948213])
In [1058]: M.indices
Out[1058]: array([0, 1, 8, 9], dtype=int32)
In [1059]: M.indptr
Out[1059]: array([0, 4], dtype=int32)
The pandas version might be more compact if the fill values occur in blocks. But I suspect
01.011.02NaN3NaN4NaN5NaN6NaN7NaN81.091.0
would produce the same blocks. I don't see evidence of it's trying to identify the identical 1.0
values, and storing those as a value plus a count.
================
Based on @MaxU
answer your ds stores 1000 1's
, and two single element arrays that tell it where those values are stored.
In [56]: sp.memory_usage()
Out[56]: 1080
In [57]: sp.sp_index
Out[57]:
BlockIndex
Block locations: array([1000])
Block lengths: array([1000])
As long the nonfill values occur in big runs, the block
arrays will be small. But if you scattered those 1000 values through out the series, you'd multiply the number of blocks substantially
block locations: array([1,3,6,10,...])
block lengths: array([1,1,1,2,1,...])
I can imagine mapping between the csr
layout and the pandas blocks, but haven't worked out the details. The csr
layout is meant to work with 2d arrays, with a clear concept of rows and columns. Looks like a sparse dataframe just contains sparse series objects.
===================
https://stackoverflow.com/a/38157234/901925 shows how to map from sparse dataframe values to a scipy sparse matrix. For each column (data series) it uses sp_values
,fill_value
,sp_index
.
pandas/pandas/sparse/scipy_sparse.py
has the code for interaction between scipy sparse and data series.
===================
kind='integer' produces sparse structure more like
scipy.sparse`:
In [62]: n=5; s=pd.Series([0]*5+[1]*5, dtype=int)
In [63]: ss=s.to_sparse(fill_value=0, kind='integer')
In [64]: ss
Out[64]:
00102030405161718191
dtype: int32
IntIndex
Indices: array([5, 6, 7, 8, 9])
contrast that with the default block
:
dtype:int32BlockIndexBlock locations:array([5])Block lengths:array([5])
And equivalent column sparse matrix can be built with:
In [89]: data=ss.values
In [90]: data=ss.sp_values
In [91]: rows=ss.sp_index.indices
In [92]: cols=np.zeros_like(rows)
In [93]: sparse.csr_matrix((data,(rows,cols)))
Out[93]:
<10x1 sparse matrix of type '<class 'numpy.int32'>'with5 stored elements in Compressed Sparse Row format>
There is a to_coo
method, but it only works with the more complex pd.MultiIndex
object (why?).
Solution 2:
Pandas documentation says:
Currently,
float64
,int64
andbool
dtypes are supported.
so let's try to convert your series to bool
value:
In [53]: s.memory_usage()
Out[53]: 8080
In [54]: s.to_sparse().memory_usage()
Out[54]: 4080
In [55]: sp = s.astype(bool).to_sparse()
In [56]: sp.memory_usage()
Out[56]: 1080
In [57]: sp.sp_index
Out[57]:
BlockIndex
Block locations: array([1000])
Block lengths: array([1000])
Post a Comment for "How Can I "sparsify" On Two Values?"