Batches Utils

This page describes BatchVectorizer class.

class artm.BatchVectorizer(batches=None, collection_name=None, data_path='', data_format='batches', target_folder=None, batch_size=1000, batch_name_type='code', data_weight=1.0, n_wd=None, vocabulary=None, gather_dictionary=True, class_ids=None, process_in_memory_model=None)
__init__(batches=None, collection_name=None, data_path='', data_format='batches', target_folder=None, batch_size=1000, batch_name_type='code', data_weight=1.0, n_wd=None, vocabulary=None, gather_dictionary=True, class_ids=None, process_in_memory_model=None)
Parameters:
  • collection_name (str) – the name of text collection (required if data_format == ‘bow_uci’)
  • data_path (str) –
    1. if data_format == ‘bow_uci’ => folder containing ‘docword.collection_name.txt’ and vocab.collection_name.txt files; 2) if data_format == ‘vowpal_wabbit’ => file in Vowpal Wabbit format; 3) if data_format == ‘bow_n_wd’ => useless parameter 4) if data_format == ‘batches’ => folder containing batches
  • data_format (str) – the type of input data: 1) ‘bow_uci’ — Bag-Of-Words in UCI format; 2) ‘vowpal_wabbit’ — Vowpal Wabbit format; 3 ‘bow_n_wd’ — result of CountVectorizer or similar tool; 4) ‘batches’ — the BigARTM data format
  • batch_size (int) – number of documents to be stored in each batch
  • target_folder (str) – full path to folder for future batches storing; if not set, no batches will be produced for further work
  • batches (list of str) – if process_in_memory_model is None -> list with non-full file names of batches (necessary parameters are batches + data_path + data_fromat==’batches’ in this case) else -> list of batches (messages.Batch objects), loaded in memory
  • batch_name_type (str) – name batches in natural order (‘code’) or using random guids (guid)
  • data_weight (float) – weight for a group of batches from data_path; it can be a list of floats, then data_path (and target_folder if not data_format == ‘batches’) should also be lists; one weight corresponds to one path from the data_path list;
  • n_wd (array) – matrix with n_wd counters
  • vocabulary (dict) – dict with vocabulary, key - index of n_wd, value - token
  • gather_dictionary (bool) – create or not the default dictionary in vectorizer; if data_format == ‘bow_n_wd’ - automatically set to True; and if data_format == ‘batches’ or data_weight is list - automatically set to False
  • class_ids (list of str or str) – list of class_ids or single class_id to parse and include in batches
  • process_in_memory_model (artm.ARTM) – ARTM instance that will use this vectorizer, is required when one needs processing of batches from disk in RAM (only if data_format == ‘batches’). NOTE: makes vectorizer model specific.
batch_size
Returns:the user-defined size of the batches
batches_ids
Returns:list of batches filenames, if process_in_memory == False, else - the list of in memory batches ids
batches_list

: return: list of batches, if process_in_memory == False, else - the list of in memory batches ids

data_path
Returns:the disk path of batches
dictionary
Returns:Dictionary object, if parameter gather_dictionary was True, else None
num_batches
Returns:the number of batches
process_in_memory
Returns:if Vectorizer uses processing of batches in core memory
weights
Returns:list of batches weights