Skip to content Skip to sidebar Skip to footer

Scikit-learn Labeled Dataset Creation From Segmented Time Series

INTRO I have a Pandas DataFrame that represents a segmented time series of different users (i.e., user1 & user2). I want to train a scikit-learn classifier with the mentioned D

Solution 1:

This is not trivial and there might be several way of formulating the problem for consumption by a ML algorithm. You should try them all and find how you get the best results.

As you already found you need two things, a matrix X of shape n_samples * n_features and a column vector y of length 'n_samples'. Lets start with the target y.

Target:

As you want to predict a user from a discrete pool of usernames, you have a classification problem an your target will be a vector with np.unique(y) == ['user1', 'user2', ...]

Features

Your features are the information that you provide the ML algorithm for each label/user/target. Unfortunately most algorithms require this information to have a fixed length, but variable length time series don't fit well into this description. So if you want to stick to classic algorithms, you need some way to condense the time series information for a user into a fixed length vector. Some possibilities are the mean, min, max, sum, first, last values, histogram, spectral power, etc. You will need to come up with the ones that make sense for your given problem.

So if you ignore the SegID information your X matrix will look like this:

y/features 
           minmax ... sum 
user1      0.11.2 ... 1.1# <-first time series for user 1
user1      0.01.3 ... 1.1# <-second time series for user 1
user2      0.30.4 ... 13.0# <-first time series for user 2

As SegID is itself a time series you also need to encode it as fixed length information, for example a histogram/counts of all possible values, most frequent value, etc

In this case you will have:

y/features 
           minmax ... sum segID_most_freq segID_min
user1      0.11.2 ... 1.111
user1      0.30.4 ... 1321
user2      0.30.4 ... 1353

The algorithm will look at this data and will "think": so for user1 the minimum segID is always 1 so if I see a user a prediction time, whose time series has a minimum ID of 1 then it should be user1. If it is around 3 it is probably user2, and so on.

Keep in mind that this is only a possible approach. Sometimes it is useful to ask, what info will I have at prediction time that will allow me to find which user is the one I am seeing and why will this info lead to the given user?

Post a Comment for "Scikit-learn Labeled Dataset Creation From Segmented Time Series"