Batches Utils¶
This page describes BatchVectorizer class.
-
class
artm.
BatchVectorizer
(batches=None, collection_name=None, data_path='', data_format='batches', target_folder=None, batch_size=1000, batch_name_type='code', data_weight=1.0, n_wd=None, vocabulary=None, gather_dictionary=True, class_ids=None, process_in_memory_model=None)¶ -
__init__
(batches=None, collection_name=None, data_path='', data_format='batches', target_folder=None, batch_size=1000, batch_name_type='code', data_weight=1.0, n_wd=None, vocabulary=None, gather_dictionary=True, class_ids=None, process_in_memory_model=None)¶ Parameters: - collection_name (str) – the name of text collection (required if data_format == ‘bow_uci’)
- data_path (str) –
- if data_format == ‘bow_uci’ => folder containing ‘docword.collection_name.txt’ and vocab.collection_name.txt files; 2) if data_format == ‘vowpal_wabbit’ => file in Vowpal Wabbit format; 3) if data_format == ‘bow_n_wd’ => useless parameter 4) if data_format == ‘batches’ => folder containing batches
- data_format (str) – the type of input data: 1) ‘bow_uci’ — Bag-Of-Words in UCI format; 2) ‘vowpal_wabbit’ — Vowpal Wabbit format; 3 ‘bow_n_wd’ — result of CountVectorizer or similar tool; 4) ‘batches’ — the BigARTM data format
- batch_size (int) – number of documents to be stored in each batch
- target_folder (str) – full path to folder for future batches storing; if not set, no batches will be produced for further work
- batches (list of str) – if process_in_memory_model is None -> list with non-full file names of batches (necessary parameters are batches + data_path + data_fromat==’batches’ in this case) else -> list of batches (messages.Batch objects), loaded in memory
- batch_name_type (str) – name batches in natural order (‘code’) or using random guids (guid)
- data_weight (float) – weight for a group of batches from data_path; it can be a list of floats, then data_path (and target_folder if not data_format == ‘batches’) should also be lists; one weight corresponds to one path from the data_path list;
- n_wd (array) – matrix with n_wd counters
- vocabulary (dict) – dict with vocabulary, key - index of n_wd, value - token
- gather_dictionary (bool) – create or not the default dictionary in vectorizer; if data_format == ‘bow_n_wd’ - automatically set to True; and if data_format == ‘batches’ or data_weight is list - automatically set to False
- class_ids (list of str or str) – list of class_ids or single class_id to parse and include in batches
- process_in_memory_model (artm.ARTM) – ARTM instance that will use this vectorizer, is required when one needs processing of batches from disk in RAM (only if data_format == ‘batches’). NOTE: makes vectorizer model specific.
-
batch_size
¶ Returns: the user-defined size of the batches
-
batches_ids
¶ Returns: list of batches filenames, if process_in_memory == False, else - the list of in memory batches ids
-
batches_list
¶ : return: list of batches, if process_in_memory == False, else - the list of in memory batches ids
-
data_path
¶ Returns: the disk path of batches
-
dictionary
¶ Returns: Dictionary object, if parameter gather_dictionary was True, else None
-
num_batches
¶ Returns: the number of batches
-
process_in_memory
¶ Returns: if Vectorizer uses processing of batches in core memory
-
weights
¶ Returns: list of batches weights
-