Skip to content Skip to sidebar Skip to footer

Sklearn Tfidf On Large Corpus Of Documents

In the context of an internship project, I have to perform a tfidf analyse over a large set of files (~18000). I am trying to use the TFIDF vectorizer from sklearn, but I'm facing

Solution 1:

Have you tried input='filename' param in TfidfVectorizer? Something like this:

raw_docs_filepaths = [#List containing the filepaths of all the files]

tfidf_vectorizer =  TfidfVectorizer(`input='filename'`)
tfidf_data = tfidf_vectorizer.fit_transform(raw_docs_filepaths)

This should work, because in this, the vectorizer will open a single file at a time, when processing that. This can be confirmed by cross-checking the source code here

defdecode(self, doc):
...
...
    if self.input == 'filename':
        withopen(doc, 'rb') as fh:
            doc = fh.read()
...
...

Post a Comment for "Sklearn Tfidf On Large Corpus Of Documents"