assembl.nlp.indexedcorpus module¶

Indexed corpus is a mechanism for random-accessing corpora.

While the standard corpus interface in gensim allows iterating over corpus with for doc in corpus: pass, indexed corpus allows accessing the documents with corpus[docno] (in O(1) look-up time).

This functionality is achieved by storing an extra file (by default named the same as the corpus file plus ‘.index’ suffix) that stores the byte offset of the beginning of each document.

class assembl.nlp.indexedcorpus.IdMmCorpus(fname)[source]¶

Bases: gensim.corpora.mmcorpus.MmCorpus

classmethod serialize(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False, dockeys_fname=None)[source]¶

Serialize corpus with offset metadata, allows to use direct indexes after loading.

fnamestr: Path to output file.
corpusiterable of iterable of (int, float): Corpus in BoW format.
id2worddict of (str, str), optional: Mapping id -> word.
index_fnamestr, optional: Where to save resulting index, if None - store index to fname.index.
progress_cntint, optional: Number of documents after which progress info is printed.
labelsbool, optional: If True - ignore first column (class labels).
metadatabool, optional: If True - ensure that serialize will write out article titles to a pickle file.

>>> from gensim.corpora import MmCorpus
>>> from gensim.test.utils import get_tmpfile
>>>
>>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]]
>>> output_fname = get_tmpfile("test.mm")
>>>
>>> MmCorpus.serialize(output_fname, corpus)
>>> mm = MmCorpus(output_fname)  # `mm` document stream now has random access
>>> print(mm[1])  # retrieve document no. 42, etc.
[(1, 0.1)]

class assembl.nlp.indexedcorpus.IdSlicedCorpus(corpus, slice_)[source]¶: Bases: gensim.utils.SlicedCorpus

assembl.nlp.indexedcorpus module¶

Idealoom

Navigation

Related Topics