Batches Utils¶

This page describes BatchVectorizer class.

class artm.BatchVectorizer(batches=None, collection_name=None, data_path='', data_format='batches', target_folder=None, batch_size=1000, batch_name_type='code', data_weight=1.0, n_wd=None, vocabulary=None, gather_dictionary=True, class_ids=None, process_in_memory_model=None)¶

__init__(batches=None, collection_name=None, data_path='', data_format='batches', target_folder=None, batch_size=1000, batch_name_type='code', data_weight=1.0, n_wd=None, vocabulary=None, gather_dictionary=True, class_ids=None, process_in_memory_model=None)¶

Parameters:

collection_name (str) – the name of text collection (required if data_format == ‘bow_uci’)
data_path (str) –
1. if data_format == ‘bow_uci’ => folder containing ‘docword.collection_name.txt’ and vocab.collection_name.txt files; 2) if data_format == ‘vowpal_wabbit’ => file in Vowpal Wabbit format; 3) if data_format == ‘bow_n_wd’ => useless parameter 4) if data_format == ‘batches’ => folder containing batches
data_format (str) – the type of input data: 1) ‘bow_uci’ — Bag-Of-Words in UCI format; 2) ‘vowpal_wabbit’ — Vowpal Wabbit format; 3 ‘bow_n_wd’ — result of CountVectorizer or similar tool; 4) ‘batches’ — the BigARTM data format
batch_size (int) – number of documents to be stored in each batch
target_folder (str) – full path to folder for future batches storing; if not set, no batches will be produced for further work
batches (list of str) – if process_in_memory_model is None -> list with non-full file names of batches (necessary parameters are batches + data_path + data_fromat==’batches’ in this case) else -> list of batches (messages.Batch objects), loaded in memory
batch_name_type (str) – name batches in natural order (‘code’) or using random guids (guid)
data_weight (float) – weight for a group of batches from data_path; it can be a list of floats, then data_path (and target_folder if not data_format == ‘batches’) should also be lists; one weight corresponds to one path from the data_path list;
n_wd (array) – matrix with n_wd counters
vocabulary (dict) – dict with vocabulary, key - index of n_wd, value - token
gather_dictionary (bool) – create or not the default dictionary in vectorizer; if data_format == ‘bow_n_wd’ - automatically set to True; and if data_format == ‘batches’ or data_weight is list - automatically set to False
class_ids (list of str or str) – list of class_ids or single class_id to parse and include in batches
process_in_memory_model (artm.ARTM) – ARTM instance that will use this vectorizer, is required when one needs processing of batches from disk in RAM (only if data_format == ‘batches’). NOTE: makes vectorizer model specific.

batch_size¶

Returns:	the user-defined size of the batches

batches_ids¶

Returns:	list of batches filenames, if process_in_memory == False, else - the list of in memory batches ids

batches_list¶: : return: list of batches, if process_in_memory == False, else - the list of in memory batches ids

data_path¶

Returns:	the disk path of batches

dictionary¶

Returns:	Dictionary object, if parameter gather_dictionary was True, else None

num_batches¶

Returns:	the number of batches

process_in_memory¶

Returns:	if Vectorizer uses processing of batches in core memory

weights¶

Returns:	list of batches weights