Dictionary

This page describes Dictionary class.

class artm.Dictionary(name=None, dictionary_path=None, data_path=None)
__init__(name=None, dictionary_path=None, data_path=None)
Parameters:
  • name (str) – name of the dictionary
  • dictionary_path (str) – can be used for default call of load() method in constructor
  • data_path (str) – can be used for default call of gather() method in constructor

Note: all parameters are optional

create(dictionary_data)
Description:creates dictionary using DictionaryData object
Parameters:dictionary_data (DictionaryData instance) – configuration of dictionary
filter(class_id=None, min_df=None, max_df=None, min_df_rate=None, max_df_rate=None, min_tf=None, max_tf=None, max_dictionary_size=None, recalculate_value=False)
Description:

filters the BigARTM dictionary of the collection, which was already loaded into the lib

Parameters:
  • dictionary_name (str) – name of the dictionary in the lib to filter
  • dictionary_target_name (str) – name for the new filtered dictionary in the lib
  • class_id (str) – class_id to filter
  • min_df (float) – min df value to pass the filter
  • max_df (float) – max df value to pass the filter
  • min_df_rate (float) – min df rate to pass the filter
  • max_df_rate (float) – max df rate to pass the filter
  • min_tf (float) – min tf value to pass the filter
  • max_tf (float) – max tf value to pass the filter
  • max_dictionary_size (float) – give an easy option to limit dictionary size; rare tokens will be excluded until dictionary reaches given size.
  • recalculate_value (bool) – recalculate or not value field in dictionary after filtration according to new sun of tf values
Note:

the current dictionary will be replaced with filtered

gather(data_path, cooc_file_path=None, vocab_file_path=None, symmetric_cooc_values=False)
Description:

creates the BigARTM dictionary of the collection, represented as batches and load it in the lib

Parameters:
  • data_path (str) – full path to batches folder
  • cooc_file_path (str) – full path to the file with cooc info. Cooc info is a file with three columns, first two a the zero-based indices of tokens in vocab file, and third one is a value of their co-occurrence in collection (or another) pairwise statistic.
  • vocab_file_path (str) – full path to the file with vocabulary. If given, the dictionary token will have the same order, as in this file, otherwise the order will be random. If given, the tokens from batches, that are not presented in vocab, will be skipped.
  • symmetric_cooc_values (bool) – if the cooc matrix should considered to be symmetric or not
load(dictionary_path)
Description:loads the BigARTM dictionary of the collection into the lib
Parameters:dictionary_path (str) – full filename of the dictionary
load_text(dictionary_path, encoding='utf-8')
Description:

loads the BigARTM dictionary of the collection from the disk in the human-readable text format

Parameters:
  • dictionary_path (str) – full file name of the text dictionary file
  • encoding (str) – an encoding of text in diciotnary
save(dictionary_path)
Description:saves the BigARTM dictionary of the collection on the disk
Parameters:dictionary_path (str) – full file name for the dictionary
save_text(dictionary_path, encoding='utf-8')
Description:

saves the BigARTM dictionary of the collection on the disk in the human-readable text format

Parameters:
  • dictionary_path (str) – full file name for the text dictionary file
  • encoding (str) – an encoding of text in diciotnary