Skip to content Skip to sidebar Skip to footer

Improve Flow Python Classifier And Combine Features

I am trying to create a classifier to categorize websites. I am doing this for the very first time so it's all quite new to me. Currently I am trying to do some Bag of Words on a c

Solution 1:

First, CountVectorizer and TfidfTransformer can be removed by using TfidfVectorizer (which is essentially combination of both).

Second, the TfidfVectorizer and MultinomialNB can be combined in a Pipeline. A pipeline sequentially apply a list of transforms and a final estimator. When fit() is called on a Pipeline, it fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. And when score() or predict() is called, it only call transform() on all transformers and score() or predict() on last one.

So the code will look like:

from sklearn.pipeline import Pipeline
pipeline = Pipeline([('vectorizer', TfidfVectorizer(encoding="cp1252",
                                                    stop_words="english",
                                                    use_idf=True)), 
                     ('nb', MultinomialNB())])

accuracy={}
for item in ['text', 'title', 'headings']:

    # No need to save the return of fit(), it returns self
    pipeline.fit(tr_data[item], tr_data['class'])

    # Apply transforms, and score with the final estimator
    accuracy[item] = pipeline.score(te_data[item], te_data['class'])

EDIT: Edited to include the combining of all features to get single accuracy:

To combine the results, we can follow multiple approaches. One that is easily understandable (but a bit of again going to the cluttery side) is the following:

# Using scipy to concatenate, because tfidfvectorizer returns sparse matricesfrom scipy.sparse import hstack

defget_tfidf(tr_data, te_data, columns):

    train = None
    test = None

    tfidfVectorizer = TfidfVectorizer(encoding="cp1252",
                                      stop_words="english",
                                      use_idf=True)
    for item in columns:
        temp_train = tfidfVectorizer.fit_transform(tr_data[item])
        train = hstack((train, temp_train)) if train isnotNoneelse temp_train

        temp_test = tfidfVectorizer.transform(te_data[item])
        test = hstack((test , temp_test)) if test isnotNoneelse temp_test

    return train, test

train_tfidf, test_tfidf = get_tfidf(tr_data, te_data, ['text', 'title', 'headings']) 

nb = MultinomialNB()
nb.fit(train_tfidf, tr_data['class'])
nb.score(test_tfidf, te_data['class'])

Second approach (and more preferable) will be to include all these in pipeline. But due to selecting the different columns ('text', 'title', 'headings') and concatenating the results, its not that straightforward. We need to use FeatureUnion for them. And specifically the following example:

Third, if you are open to use other libraries, then DataFrameMapper from sklearn-pandas can simplify the usage of FeatureUnions used in previous example.

If you do want to go the second or third way, please feel free to contact if having any difficulties.

NOTE: I have not checked the code, but it should work (less some syntax errors, if any). Will check as soon as on my pc.

Solution 2:

The snippet below is a possible way to simplify your code:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

cv = CountVectorizer(encoding="cp1252", stop_words="english")
tt = TfidfTransformer(use_idf=True)
mnb = MultinomialNB()

accuracy = {}
for item in ['text', 'title', 'headings']:
    X_tr_counts = cv.fit_transform(tr_data[item])
    X_tr_tfidf = tt.fit_transform(X_tr_counts)
    mnb.fit(X_tr_tfidf, tr_data['class'])
    X_te_counts = cv.transform(te_data[item])
    X_te_tfidf = tt.transform(X_te_counts)
    accuracy[item] = mnb.score(X_te_tfidf, te_data['class'])

The classification success rates are stored in a dictionary accuracy with keys 'text, 'title', and 'headings'.

EDIT

A more elegant solution - not necessarily simpler though - would consist in using Pipeline and FeatureUnion as pointed out by @Vivek Kumar. This approach would also allow you to combine all the features into a single model and apply weighting factors to the features extracted from the different items of your dataset.

First we import the necessary modules.

from sklearn.baseimportBaseEstimator, TransformerMixinfrom sklearn.feature_extraction.textimportTfidfVectorizerfrom sklearn.naive_bayesimportMultinomialNB
from sklearn.pipelineimportFeatureUnion, Pipeline

Then we define a transformer class (as suggested in this example) to select the different items of your dataset:

classItemSelector(BaseEstimator, TransformerMixin):def__init__(self, key):
        self.key = key

    deffit(self, foo, bar=None):
        returnselfdeftransform(self, data_dict):
        return data_dict[self.key]

We are now ready to define the pipeline:

pipeline = Pipeline([
  ('features', FeatureUnion(
    transformer_list=[
      ('text_feats', Pipeline([
        ('text_selector', ItemSelector(key='text')),
        ('text_vectorizer', TfidfVectorizer(encoding="cp1252", 
                                            stop_words="english", 
                                            use_idf=True))
        ])),
      ('title_feats', Pipeline([
        ('title_selector', ItemSelector(key='text')),
        ('title_vectorizer', TfidfVectorizer(encoding="cp1252", 
                                             stop_words="english", 
                                             use_idf=True))
        ])),
      ('headings_feats', Pipeline([
        ('headings_selector', ItemSelector(key='text')),
        ('headings_vectorizer', TfidfVectorizer(encoding="cp1252", 
                                                stop_words="english", 
                                                use_idf=True))
        ])),
    ],
    transformer_weights={'text': 0.5,  #change weights as appropriate'title': 0.3,
                         'headings': 0.2}
    )),
  ('classifier', MultinomialNB())
])

And finally, we can classify data in a straightforward manner:

pipeline.fit(tr_data, tr_data['class'])
pipeline.score(te_data, te_data['class'])

Post a Comment for "Improve Flow Python Classifier And Combine Features"