Skip to content Skip to sidebar Skip to footer

Find Max Frequency For Every Sequence_id

I have a Dataframe Like: Time Frq_1 Seq_1 Frq_2 Seq_2 Frq_3 Seq_3 12:43:04 - 30,668 - 30,670 4,620 30,671 12:46:05 -

Solution 1:

This turned out to be quite involved. Here we go anyway:

long_df = pd.wide_to_long(df.reset_index(), stubnames=['Seq_', 'Frq_'], 
                          suffix='\d+', i='index', j='j')
long_df['Frq_'] = pd.to_numeric(long_df.Frq_.str.replace(',','.')
                                .replace('-',float('nan')))
long_df.reset_index(drop=True, inplace=True)
ix = long_df.groupby('Seq_').Frq_.idxmax()

print(long_df.loc[ix[ix.notna()].values.astype(int)])

     Time      Seq_   Frq_
3412:43:0430,6714.621612:49:2930,6904.164212:46:3830,7004.60

Seems like for the sequence 30,700, the highest frequency is 4.60, not 4.20


The first step is to collapse the dataframe into three rows, one for the Time, another for the sequence and for the frequency. We can use pd.wide_to_long with the stubnames ['Seq_', 'Frq_']:

long_df = pd.wide_to_long(df.reset_index(), stubnames=['Seq_', 'Frq_'], 
                              suffix='\d+', i='index', j='j')

print(long_df)

            Time    Seq_   Frq_index j                         
0112:43:0430,668      -
1112:46:0530,699      -
2112:46:1730,7004,2003112:46:1830,7003,0604112:46:1830,7003,0605112:46:1930,7003,0606112:46:2030,7003,0607112:46:3730,698      -
8112:46:3830,699      -
9112:47:1930,668      -
10112:47:2030,667      -
11112:47:2030,667      -
12112:47:2130,667      -
13112:47:2130,665      -
14112:47:2230,665      -
15112:48:3530,688      -
16112:49:2930,6904,160
...

The next step is to cast to float the fequencies to float, to be able to find the maximum values:

long_df['Frq_'] = pd.to_numeric(long_df.Frq_.str.replace(',','.')
                                    .replace('-',float('nan')))

print(long_df)

          Time    Seq_  Frq_
index j                        
0112:43:04  30,668   NaN
1112:46:05  30,699   NaN
2112:46:1730,7004.203112:46:1830,7003.064112:46:1830,7003.065112:46:1930,7003.066112:46:2030,7003.067112:46:3730,698   NaN
... 

Then we can groupby Seq_ and find the indices with the highest values. One could also think of using max, but this would remove the Time column.

long_df.reset_index(drop=True, inplace=True)
ix = long_df.groupby('Seq_').Frq_.idxmax()

And finally index based on the above:

print(long_df.loc[ix[ix.notna()].values.astype(int)])

     Time      Seq_   Frq_
3412:43:0430,6714.621612:49:2930,6904.164212:46:3830,7004.60

Post a Comment for "Find Max Frequency For Every Sequence_id"