Skip to content Skip to sidebar Skip to footer

Split Pandas Dataframe By String

I'm new to using Pandas dataframes. I have data in a .csv like this: foo, 1234, bar, 4567 stuff, 7894 New Entry,, morestuff,1345 I'm reading it into the dataframe with df = pd.r

Solution 1:

1) Doing it on the fly while reading the file line-by-line and checking for NewEntry break is one approach.

2) Other way, if the dataframe already exists is to find the NewEntry and slice the dataframe into multiple ones to dff = {}

df                                                                 
        col1  col2  
0        foo  1234    
1        bar  4567                
2      stuff  7894                                                        
3   NewEntry   NaN                       
4  morestuff  1345 

Find the NewEntry rows, add [-1] and [len(df.index)] for boundary conditions

rows = [-1] + np.where(df['col1']=='NewEntry')[0].tolist() + [len(df.index)]
[-1, 3L, 5]

Create the dict of dataframes

dff = {}                                                                            
for i, r in enumerate(rows[:-1]):                                                   
    dff[i] = df[r+1: rows[i+1]]                                                     

Dict of dataframes {0: datafram1, 1: dataframe2}

dff                           
{0:     col1  col2            
 0    foo  1234               
 1    bar  4567               
 2  stuff  7894, 1:         col1  col2  
 4  morestuff  1345}

Dataframe 1

dff[0]              
    col1  col2      
0    foo  1234      
1    bar  4567      
2  stuff  7894      

Dataframe 2

dff[1]              
        col1  col2  
4  morestuff  1345 

Solution 2:

So using your example data which I concatenated 3 times, after loading (I named the cols 'a','b','c' for convenience) we then find the indices where you have 'New Entry' and the produce a list of tuples of these positions stepwise to mark the beg, end range.

We can then iterate over this list of tuples and slice the orig df and append to list:

In [22]:

t="""foo,1234,
bar,4567
stuff,7894
New Entry,,
morestuff,1345"""
df = pd.read_csv(io.StringIO(t),header=None,names=['a','b','c'] )
df = pd.concat([df]*3, ignore_index=True)
df
Out[22]:
            a     b   c
0         foo  1234 NaN
1         bar  4567 NaN
2       stuff  7894 NaN
3   New Entry   NaN NaN
4   morestuff  1345 NaN
5         foo  1234 NaN
6         bar  4567 NaN
7       stuff  7894 NaN
8   New Entry   NaN NaN
9   morestuff  1345 NaN
10        foo  1234 NaN
11        bar  4567 NaN
12      stuff  7894 NaN
13  New Entry   NaN NaN
14  morestuff  1345 NaN
In [30]:

import itertools
idx = df[df['a'] == 'New Entry'].index
idx_list = [(0,idx[0])]
idx_list = idx_list + list(zip(idx, idx[1:]))
idx_list


Out[30]:
[(0, 3), (3, 8), (8, 13)]
In [31]:

df_list = []
for i in idx_list:  
    print(i)
    if i[0] == 0:
        df_list.append(df[i[0]:i[1]])
    else:
        df_list.append(df[i[0]+1:i[1]])
df_list
(0, 3)
(3, 8)
(8, 13)
Out[31]:
[       a     b   c
 0    foo  1234 NaN
 1    bar  4567 NaN
 2  stuff  7894 NaN,            a     b   c
 4  morestuff  1345 NaN
 5        foo  1234 NaN
 6        bar  4567 NaN
 7      stuff  7894 NaN,             a     b   c
 9   morestuff  1345 NaN
 10        foo  1234 NaN
 11        bar  4567 NaN
 12      stuff  7894 NaN]

Post a Comment for "Split Pandas Dataframe By String"