Find Max Frequency For Every Sequence_id
I have a Dataframe Like: Time Frq_1 Seq_1 Frq_2 Seq_2 Frq_3 Seq_3 12:43:04 - 30,668 - 30,670 4,620 30,671 12:46:05 -
Solution 1:
This turned out to be quite involved. Here we go anyway:
long_df = pd.wide_to_long(df.reset_index(), stubnames=['Seq_', 'Frq_'],
suffix='\d+', i='index', j='j')
long_df['Frq_'] = pd.to_numeric(long_df.Frq_.str.replace(',','.')
.replace('-',float('nan')))
long_df.reset_index(drop=True, inplace=True)
ix = long_df.groupby('Seq_').Frq_.idxmax()
print(long_df.loc[ix[ix.notna()].values.astype(int)])
Time Seq_ Frq_
3412:43:0430,6714.621612:49:2930,6904.164212:46:3830,7004.60
Seems like for the sequence 30,700
, the highest frequency is 4.60
, not 4.20
The first step is to collapse the dataframe into three rows, one for the Time
, another for the sequence and for the frequency. We can use pd.wide_to_long
with the stubnames ['Seq_', 'Frq_']
:
long_df = pd.wide_to_long(df.reset_index(), stubnames=['Seq_', 'Frq_'],
suffix='\d+', i='index', j='j')
print(long_df)
Time Seq_ Frq_index j
0112:43:0430,668 -
1112:46:0530,699 -
2112:46:1730,7004,2003112:46:1830,7003,0604112:46:1830,7003,0605112:46:1930,7003,0606112:46:2030,7003,0607112:46:3730,698 -
8112:46:3830,699 -
9112:47:1930,668 -
10112:47:2030,667 -
11112:47:2030,667 -
12112:47:2130,667 -
13112:47:2130,665 -
14112:47:2230,665 -
15112:48:3530,688 -
16112:49:2930,6904,160
...
The next step is to cast to float the fequencies to float
, to be able to find the maximum values:
long_df['Frq_'] = pd.to_numeric(long_df.Frq_.str.replace(',','.')
.replace('-',float('nan')))
print(long_df)
Time Seq_ Frq_
index j
0112:43:04 30,668 NaN
1112:46:05 30,699 NaN
2112:46:1730,7004.203112:46:1830,7003.064112:46:1830,7003.065112:46:1930,7003.066112:46:2030,7003.067112:46:3730,698 NaN
...
Then we can groupby Seq_
and find the indices with the highest values. One could also think of using max
, but this would remove the Time
column.
long_df.reset_index(drop=True, inplace=True)
ix = long_df.groupby('Seq_').Frq_.idxmax()
And finally index based on the above:
print(long_df.loc[ix[ix.notna()].values.astype(int)])
Time Seq_ Frq_
3412:43:0430,6714.621612:49:2930,6904.164212:46:3830,7004.60
Post a Comment for "Find Max Frequency For Every Sequence_id"