Master Component

This page describes MasterComponent class.

class artm.MasterComponent(library, topic_names=None, class_ids=None, scores=None, regularizers=None, num_processors=None, pwt_name=None, nwt_name=None, num_document_passes=None, reuse_theta=None, cache_theta=False)
__init__(library, topic_names=None, class_ids=None, scores=None, regularizers=None, num_processors=None, pwt_name=None, nwt_name=None, num_document_passes=None, reuse_theta=None, cache_theta=False)
Parameters:
  • library – an instance of LibArtm
  • topic_names (list of str) – list of topic names to use in model
  • class_ids (dict) – key - class_id, value - class_weight
  • scores (dict) – key - score name, value - config
  • regularizers (dict) – key - regularizer name, value - tuple (config, tau) or triple (config, tau, gamma)
  • num_processors (int) – number of worker threads to use for processing the collection
  • pwt_name (str) – name of pwt matrix
  • nwt_name (str) – name of nwt matrix
  • num_document_passes (in) – num passes through each document
  • reuse_theta (bool) – reuse Theta from previous iteration or not
  • cache_theta (bool) – save or not the Theta matrix
attach_model(model)
Parameters:model (str) – name of matrix in BigARTM
Returns:
  • messahes.TopicModel() object with info about Phi matrix
  • numpy.ndarray with Phi data (i.e., p(w|t) values)
clear_score_array_cache()

Clears all entries from score array cache

clear_score_cache()

Clears all entries from score cache

clear_theta_cache()

Clears all entries from theta matrix cache

create_dictionary(dictionary_data, dictionary_name=None)
Parameters:
  • dictionary_data – an instance of DictionaryData with info about dictionary
  • dictionary_name (str) – name of exported dictionary
create_regularizer(name, config, tau, gamma=None)
Parameters:
  • name (str) – the name of the future regularizer
  • config – the config of the future regularizer
  • tau (float) – the coefficient of the regularization
create_score(name, config, model_name=None)
Parameters:
  • name (str) – the name of the future score
  • config – an instance of ***ScoreConfig
export_dictionary(filename, dictionary_name)
Parameters:
  • filename (str) – full name of dictionary file
  • dictionary_name (str) – name of exported dictionary
export_model(model, filename)
filter_dictionary(dictionary_name=None, dictionary_target_name=None, class_id=None, min_df=None, max_df=None, min_df_rate=None, max_df_rate=None, min_tf=None, max_tf=None, args=None)
Parameters:
  • dictionary_name (str) – name of the dictionary in the core to filter
  • dictionary_target_name (str) – name for the new filtered dictionary in the core
  • class_id (str) – class_id to filter
  • min_df (float) – min df value to pass the filter
  • max_df (float) – max df value to pass the filter
  • min_df_rate (float) – min df rate to pass the filter
  • max_df_rate (float) – max df rate to pass the filter
  • min_tf (float) – min tf value to pass the filter
  • max_tf (float) – max tf value to pass the filter
  • args – an instance of FilterDictionaryArgs
fit_offline(batch_filenames=None, batch_weights=None, num_collection_passes=None, batches_folder=None)
Parameters:
  • batch_filenames (list of str) – name of batches to process
  • batch_weights (list of float) – weights of batches to process
  • num_collection_passes (int) – number of outer iterations
  • batches_folder (str) – folder containing batches to process
fit_online(batch_filenames=None, batch_weights=None, update_after=None, apply_weight=None, decay_weight=None, async=None)
Parameters:
  • batch_filenames (list of str) – name of batches to process
  • batch_weights (list of float) – weights of batches to process
  • update_after (list of int) – number of batches to be passed for Phi synchronizations
  • apply_weight (list of float) – weight of applying new counters (len == len of update_after)
  • decay_weight (list of float) – weight of applying old counters (len == len of update_after)
  • async (bool) – whether to use the async implementation of the EM-algorithm or not
gather_dictionary(dictionary_target_name=None, data_path=None, cooc_file_path=None, vocab_file_path=None, symmetric_cooc_values=None, args=None)
Parameters:
  • dictionary_target_name (str) – name of the dictionary in the core
  • data_path (str) – full path to batches folder
  • cooc_file_path (str) – full path to the file with cooc info
  • vocab_file_path (str) – full path to the file with vocabulary
  • symmetric_cooc_values (bool) – whether the cooc matrix should considered to be symmetric or not
  • args – an instance of GatherDictionaryArgs
get_dictionary(dictionary_name)
Parameters:dictionary_name (str) – name of dictionary to get
get_info()
get_phi_info(model)
Parameters:model (str) – name of matrix in BigARTM
Returns:messages.TopicModel object
get_phi_matrix(model, topic_names=None, class_ids=None, use_sparse_format=None)
Parameters:
  • model (str) – name of matrix in BigARTM
  • topic_names (list of str or None) – list of topics to retrieve (None means all topics)
  • class_ids (list of str or None) – list of class ids to retrieve (None means all class ids)
  • use_sparse_format (bool) – use sparsedense layout
Returns:

numpy.ndarray with Phi data (i.e., p(w|t) values)

get_score(score_name)
Parameters:
  • score_name (str) – the user defined name of score to retrieve
  • score_config – reference to score data object
get_score_array(score_name)
Parameters:
  • score_name (str) – the user defined name of score to retrieve
  • score_config – reference to score data object
get_theta_info()
Returns:messages.ThetaMatrix object
get_theta_matrix(topic_names=None)
Parameters:topic_names (list of str or None) – list of topics to retrieve (None means all topics)
Returns:numpy.ndarray with Theta data (i.e., p(t|d) values)
import_dictionary(filename, dictionary_name)
Parameters:
  • filename (str) – full name of dictionary file
  • dictionary_name (str) – name of imported dictionary
import_model(model, filename)
Parameters:
  • model (str) – name of matrix in BigARTM
  • filename (str) – the name of file to load model from binary format
initialize_model(model_name=None, topic_names=None, dictionary_name=None, seed=None, args=None)
Parameters:
  • model_name (str) – name of pwt matrix in BigARTM
  • topic_names (list of str) – the list of names of topics to be used in model
  • dictionary_name (str) – name of imported dictionary
  • seed (unsigned int or -1, default None) – seed for random initialization, None means no seed
  • args – an instance of InitilaizeModelArgs
merge_model(models, nwt, topic_names=None)

Merge multiple nwt-increments together.

Parameters:
  • models (dict) – list of models with nwt-increments and their weights, key - nwt_source_name, value - source_weight.
  • nwt (str) – the name of target matrix to store combined nwt. The matrix will be created by this operation.
  • topic_names (list of str) – names of topics in the resulting model. By default model names are taken from the first model in the list.
normalize_model(pwt, nwt, rwt=None)
Parameters:
  • pwt (str) – name of pwt matrix in BigARTM
  • nwt (str) – name of nwt matrix in BigARTM
  • rwt (str) – name of rwt matrix in BigARTM
process_batches(pwt, nwt=None, num_document_passes=None, batches_folder=None, batches=None, regularizer_name=None, regularizer_tau=None, class_ids=None, class_weights=None, find_theta=False, reuse_theta=False, find_ptdw=False, predict_class_id=None)
Parameters:
  • pwt (str) – name of pwt matrix in BigARTM
  • nwt (str) – name of nwt matrix in BigARTM
  • num_document_passes (int) – number of inner iterations during processing
  • batches_folder (str) – full path to data folder (alternative 1)
  • batches (list of str) – full file names of batches to process (alternative 2)
  • regularizer_name (list of str) – list of names of Theta regularizers to use
  • regularizer_tau (list of float) – list of tau coefficients for Theta regularizers
  • class_ids (list of str) – list of class ids to use during processing
  • class_weights (list of float) – list of corresponding weights of class ids
  • find_theta (bool) – find theta matrix for ‘batches’ (if alternative 2)
  • reuse_theta (bool) – initialize by theta from previous collection pass
  • find_ptdw (bool) – calculate and return Ptdw matrix or not (works if find_theta == False)
  • predict_class_id (str, default None) – class_id of a target modality to predict
Returns:

  • tuple (messages.ThetaMatrix, numpy.ndarray) — the info about Theta (if find_theta == True)
  • messages.ThetaMatrix — the info about Theta (if find_theta == False)

reconfigure(topic_names=None, class_ids=None, scores=None, regularizers=None, num_processors=None, pwt_name=None, nwt_name=None, num_document_passes=None, reuse_theta=None, cache_theta=None)
reconfigure_regularizer(name, config=None, tau=None, gamma=None)
reconfigure_score(name, config)
regularize_model(pwt, nwt, rwt, regularizer_name, regularizer_tau, regularizer_gamma=None)
Parameters:
  • pwt (str) – name of pwt matrix in BigARTM
  • nwt (str) – name of nwt matrix in BigARTM
  • rwt (str) – name of rwt matrix in BigARTM
  • regularizer_name (list of str) – list of names of Phi regularizers to use
  • regularizer_tau (list of double) – list of tau coefficients for Phi regularizers
transform(batches=None, batch_filenames=None, theta_matrix_type=None, predict_class_id=None)
Parameters:
  • batches – list of Batch instances
  • batch_weights (list of float) – weights of batches to transform
  • theta_matrix_type (int) – type of matrix to be returned
  • predict_class_id (int) – type of matrix to be returned
Returns:

messages.ThetaMatrix object