How To Choose Parameters In Tfidfvectorizer In Sklearn During Unsupervised Clustering

August 31, 2023 Post a Comment

TfidfVectorizer provides an easy way to encode & transform texts into vectors. My question is how to choose the proper values for parameters such as min_df, max_features, smoot

Solution 1:

If you are, for instance, using these vectors in a classification task, you can vary these parameters (and of course also the parameters of the classifier) and see which values give you the best performance.

You can do that in sklearn easily with the GridSearchCV and Pipeline objects

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stop_words)),
    ('clf', OneVsRestClassifier(MultinomialNB(
        fit_prior=True, class_prior=None))),
])
parameters = {
    'tfidf__max_df': (0.25, 0.5, 0.75),
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'clf__estimator__alpha': (1e-2, 1e-3)
}

grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=3)
grid_search_tune.fit(train_x, train_y)

print("Best parameters set:")
print grid_search_tune.best_estimator_.steps

Baca Juga

Python Channel

How To Choose Parameters In Tfidfvectorizer In Sklearn During Unsupervised Clustering

Solution 1:

Post a Comment for "How To Choose Parameters In Tfidfvectorizer In Sklearn During Unsupervised Clustering"