Skip to content Skip to sidebar Skip to footer

Pandas: Need A Speedier Way Of Index Slicing

Anyone care to take a stab at speeding up this dataframe index slicing scheme? I'm trying to slice and dice some huge dataframes, so every bit counts. I need to somehow find a fast

Solution 1:

You can use a dictionary comprehension together with loc to do the dataframe indexing:

finDict = {pair: df.loc[pd.IndexSlice[:, pair[0], pair[1]], :] 
           for pair in pd.unique(initFrame[['bar1', 'bar4']].values).tolist()}

>>> finDict
{(5, 1):                     bar1 bar2  bar3  bar4
 ifoo1  ifoo2 ifoo3                       
 LABEL1 515    a    11116    b    222,
 (6, 2):                     bar1 bar2  bar3  bar4
 ifoo1  ifoo2 ifoo3                       
 LABEL2 625    c    331,
 (6, 3):                     bar1 bar2  bar3  bar4
 ifoo1  ifoo2 ifoo3                       
 LABEL2 636    d    443}

Solution 2:

I don't know what you really want to do, but here is some hint to speedup your code:

change

uniqueList = list(pd.unique(initFrame[['bar1','bar4']].values))

to

uniqueList = initFrame[["bar1", "bar4"]].drop_duplicates().values.tolist()

and the for loop to :

g = initFrame.groupby(level=(1, 2))
uniqueSet = set(uniqueList)
dict((key, df) forkey, df in g ifkeyin uniqueSet)

or:

g = initFrame.groupby(level=(1, 2))
dict((key, g.get_group(key)) forkeyin uniqueList)

Here is the %timeit compare:

import numpy as np
import pandas as pdarr= np.random.randint(0, 10, (10000, 2))
df = pd.DataFrame(arr, columns=("A", "B"))

%timeit df.drop_duplicates().values.tolist()
%timeit list(pd.unique(arr))

outputs:

100 loops, best of 3: 3.51 ms per loop10 loops, best of 3: 94.7 ms per loop

Solution 3:

Not as a answer but just to visualise a thought re my comment, since multi-indexes are grouped, we can simply & possibly just compare and skip the loop if value of ('bar1', 'bar4') equals to the previous value, then perform the dict update.

It may not be speedier, but if your dataset is huge, it could potentially save you a memory consumption problem, pseudo code:

# ...replace timer1...
prev, finDict = None, {}
for n in initFrame[['bar1', 'bar4']].iterrows():
    current = (n[0][1], n[0][2])
    if current == prev: continue
    prev = current
    #... whatever faster way to solve your 2nd timer...

Personally I think @Alexander answers your 2nd timer rather nicely.

Post a Comment for "Pandas: Need A Speedier Way Of Index Slicing"