Dictionary¶
This page describes Dictionary class.
-
class
artm.
Dictionary
(name=None, dictionary_path=None, data_path=None)¶ -
__init__
(name=None, dictionary_path=None, data_path=None)¶ Parameters: - name (str) – name of the dictionary
- dictionary_path (str) – can be used for default call of load() method in constructor
- data_path (str) – can be used for default call of gather() method in constructor
Note: all parameters are optional
-
create
(dictionary_data)¶ Description: creates dictionary using DictionaryData object Parameters: dictionary_data (DictionaryData instance) – configuration of dictionary
-
filter
(class_id=None, min_df=None, max_df=None, min_df_rate=None, max_df_rate=None, min_tf=None, max_tf=None, max_dictionary_size=None, recalculate_value=False)¶ Description: filters the BigARTM dictionary of the collection, which was already loaded into the lib
Parameters: - dictionary_name (str) – name of the dictionary in the lib to filter
- dictionary_target_name (str) – name for the new filtered dictionary in the lib
- class_id (str) – class_id to filter
- min_df (float) – min df value to pass the filter
- max_df (float) – max df value to pass the filter
- min_df_rate (float) – min df rate to pass the filter
- max_df_rate (float) – max df rate to pass the filter
- min_tf (float) – min tf value to pass the filter
- max_tf (float) – max tf value to pass the filter
- max_dictionary_size (float) – give an easy option to limit dictionary size; rare tokens will be excluded until dictionary reaches given size.
- recalculate_value (bool) – recalculate or not value field in dictionary after filtration according to new sun of tf values
Note: the current dictionary will be replaced with filtered
-
gather
(data_path, cooc_file_path=None, vocab_file_path=None, symmetric_cooc_values=False)¶ Description: creates the BigARTM dictionary of the collection, represented as batches and load it in the lib
Parameters: - data_path (str) – full path to batches folder
- cooc_file_path (str) – full path to the file with cooc info. Cooc info is a file with three columns, first two a the zero-based indices of tokens in vocab file, and third one is a value of their co-occurrence in collection (or another) pairwise statistic.
- vocab_file_path (str) – full path to the file with vocabulary. If given, the dictionary token will have the same order, as in this file, otherwise the order will be random. If given, the tokens from batches, that are not presented in vocab, will be skipped.
- symmetric_cooc_values (bool) – if the cooc matrix should considered to be symmetric or not
-
load
(dictionary_path)¶ Description: loads the BigARTM dictionary of the collection into the lib Parameters: dictionary_path (str) – full filename of the dictionary
-
load_text
(dictionary_path, encoding='utf-8')¶ Description: loads the BigARTM dictionary of the collection from the disk in the human-readable text format
Parameters: - dictionary_path (str) – full file name of the text dictionary file
- encoding (str) – an encoding of text in diciotnary
-
save
(dictionary_path)¶ Description: saves the BigARTM dictionary of the collection on the disk Parameters: dictionary_path (str) – full file name for the dictionary
-
save_text
(dictionary_path, encoding='utf-8')¶ Description: saves the BigARTM dictionary of the collection on the disk in the human-readable text format
Parameters: - dictionary_path (str) – full file name for the text dictionary file
- encoding (str) – an encoding of text in diciotnary
-