assembl.nlp.indexedcorpus module

Indexed corpus is a mechanism for random-accessing corpora.

While the standard corpus interface in gensim allows iterating over corpus with for doc in corpus: pass, indexed corpus allows accessing the documents with corpus[docno] (in O(1) look-up time).

This functionality is achieved by storing an extra file (by default named the same as the corpus file plus ‘.index’ suffix) that stores the byte offset of the beginning of each document.

class assembl.nlp.indexedcorpus.IdMmCorpus(fname)[source]

Bases: gensim.corpora.mmcorpus.MmCorpus

classmethod serialize(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False, dockeys_fname=None)[source]

Serialize corpus with offset metadata, allows to use direct indexes after loading.

fnamestr

Path to output file.

corpusiterable of iterable of (int, float)

Corpus in BoW format.

id2worddict of (str, str), optional

Mapping id -> word.

index_fnamestr, optional

Where to save resulting index, if None - store index to fname.index.

progress_cntint, optional

Number of documents after which progress info is printed.

labelsbool, optional

If True - ignore first column (class labels).

metadatabool, optional

If True - ensure that serialize will write out article titles to a pickle file.

>>> from gensim.corpora import MmCorpus
>>> from gensim.test.utils import get_tmpfile
>>>
>>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]]
>>> output_fname = get_tmpfile("test.mm")
>>>
>>> MmCorpus.serialize(output_fname, corpus)
>>> mm = MmCorpus(output_fname)  # `mm` document stream now has random access
>>> print(mm[1])  # retrieve document no. 42, etc.
[(1, 0.1)]
class assembl.nlp.indexedcorpus.IdSlicedCorpus(corpus, slice_)[source]

Bases: gensim.utils.SlicedCorpus