Master Component¶
This page describes MasterComponent class.
-
class
artm.
MasterComponent
(library=None, topic_names=None, class_ids=None, scores=None, regularizers=None, num_processors=None, pwt_name=None, nwt_name=None, num_document_passes=None, reuse_theta=None, cache_theta=False, config=None, master_id=None)¶ -
__init__
(library=None, topic_names=None, class_ids=None, scores=None, regularizers=None, num_processors=None, pwt_name=None, nwt_name=None, num_document_passes=None, reuse_theta=None, cache_theta=False, config=None, master_id=None)¶ Parameters: - library – an instance of LibArtm
- topic_names (list of str) – list of topic names to use in model
- class_ids (dict) – key - class_id, value - class_weight
- scores (dict) – key - score name, value - config
- regularizers (dict) – key - regularizer name, value - tuple (config, tau) or triple (config, tau, gamma)
- num_processors (int) – number of worker threads to use for processing the collection
- pwt_name (str) – name of pwt matrix
- nwt_name (str) – name of nwt matrix
- num_document_passes (in) – num passes through each document
- reuse_theta (bool) – reuse Theta from previous iteration or not
- cache_theta (bool) – save or not the Theta matrix
-
attach_model
(model)¶ Parameters: model (str) – name of matrix in BigARTM Returns: - messahes.TopicModel() object with info about Phi matrix
- numpy.ndarray with Phi data (i.e., p(w|t) values)
-
clear_score_array_cache
()¶ Clears all entries from score array cache
-
clear_score_cache
()¶ Clears all entries from score cache
-
clear_theta_cache
()¶ Clears all entries from theta matrix cache
-
create_dictionary
(dictionary_data, dictionary_name=None)¶ Parameters: - dictionary_data – an instance of DictionaryData with info about dictionary
- dictionary_name (str) – name of exported dictionary
-
create_regularizer
(name, config, tau, gamma=None)¶ Parameters: - name (str) – the name of the future regularizer
- config – the config of the future regularizer
- tau (float) – the coefficient of the regularization
-
create_score
(name, config, model_name=None)¶ Parameters: - name (str) – the name of the future score
- config – an instance of ***ScoreConfig
- model_name – pwt or nwt model name
-
export_dictionary
(filename, dictionary_name)¶ Parameters: - filename (str) – full name of dictionary file
- dictionary_name (str) – name of exported dictionary
-
export_model
(model, filename)¶ Parameters: - model (str) – name of matrix in BigARTM
- filename (str) – the name of file to save model into binary format
-
export_score_tracker
(filename)¶ Parameters: filename (str) – the name of file to save score tracker into binary format
-
filter_dictionary
(dictionary_name=None, dictionary_target_name=None, class_id=None, min_df=None, max_df=None, min_df_rate=None, max_df_rate=None, min_tf=None, max_tf=None, max_dictionary_size=None, recalculate_value=None, args=None)¶ Parameters: - dictionary_name (str) – name of the dictionary in the core to filter
- dictionary_target_name (str) – name for the new filtered dictionary in the core
- class_id (str) – class_id to filter
- min_df (float) – min df value to pass the filter
- max_df (float) – max df value to pass the filter
- min_df_rate (float) – min df rate to pass the filter
- max_df_rate (float) – max df rate to pass the filter
- min_tf (float) – min tf value to pass the filter
- max_tf (float) – max tf value to pass the filter
- max_dictionary_size (float) – give an easy option to limit dictionary size; rare tokens will be excluded until dictionary reaches given size.
- recalculate_value (bool) – recalculate or not value field in dictionary after filtration according to new sun of tf values
- args – an instance of FilterDictionaryArgs
-
fit_offline
(batch_filenames=None, batch_weights=None, num_collection_passes=None, batches_folder=None)¶ Parameters: - batch_filenames (list of str) – name of batches to process
- batch_weights (list of float) – weights of batches to process
- num_collection_passes (int) – number of outer iterations
- batches_folder (str) – folder containing batches to process
-
fit_online
(batch_filenames=None, batch_weights=None, update_after=None, apply_weight=None, decay_weight=None, async=None)¶ Parameters: - batch_filenames (list of str) – name of batches to process
- batch_weights (list of float) – weights of batches to process
- update_after (list of int) – number of batches to be passed for Phi synchronizations
- apply_weight (list of float) – weight of applying new counters (len == len of update_after)
- decay_weight (list of float) – weight of applying old counters (len == len of update_after)
- async (bool) – whether to use the async implementation of the EM-algorithm or not
-
gather_dictionary
(dictionary_target_name=None, data_path=None, cooc_file_path=None, vocab_file_path=None, symmetric_cooc_values=None, args=None)¶ Parameters: - dictionary_target_name (str) – name of the dictionary in the core
- data_path (str) – full path to batches folder
- cooc_file_path (str) – full path to the file with cooc info
- vocab_file_path (str) – full path to the file with vocabulary
- symmetric_cooc_values (bool) – whether the cooc matrix should considered to be symmetric or not
- args – an instance of GatherDictionaryArgs
-
get_dictionary
(dictionary_name)¶ Parameters: dictionary_name (str) – name of dictionary to get
-
get_info
()¶
-
get_phi_info
(model)¶ Parameters: model (str) – name of matrix in BigARTM Returns: messages.TopicModel object
-
get_phi_matrix
(model, topic_names=None, class_ids=None, use_sparse_format=None)¶ Parameters: - model (str) – name of matrix in BigARTM
- topic_names (list of str or None) – list of topics to retrieve (None means all topics)
- class_ids (list of str or None) – list of class ids to retrieve (None means all class ids)
- use_sparse_format (bool) – use sparsedense layout
Returns: numpy.ndarray with Phi data (i.e., p(w|t) values)
-
get_score
(score_name)¶ Parameters: - score_name (str) – the user defined name of score to retrieve
- score_config – reference to score data object
-
get_score_array
(score_name)¶ Parameters: - score_name (str) – the user defined name of score to retrieve
- score_config – reference to score data object
-
get_theta_info
()¶ Returns: messages.ThetaMatrix object
-
get_theta_matrix
(topic_names=None)¶ Parameters: topic_names (list of str or None) – list of topics to retrieve (None means all topics) Returns: numpy.ndarray with Theta data (i.e., p(t|d) values)
-
import_batches
(batches=None)¶ Parameters: batches (list) – list of BigARTM batches loaded into RAM
-
import_dictionary
(filename, dictionary_name)¶ Parameters: - filename (str) – full name of dictionary file
- dictionary_name (str) – name of imported dictionary
-
import_model
(model, filename)¶ Parameters: - model (str) – name of matrix in BigARTM
- filename (str) – the name of file to load model from binary format
-
import_score_tracker
(filename)¶ Parameters: filename (str) – the name of file to load score tracker from binary format
-
initialize_model
(model_name=None, topic_names=None, dictionary_name=None, seed=None, args=None)¶ Parameters: - model_name (str) – name of pwt matrix in BigARTM
- topic_names (list of str) – the list of names of topics to be used in model
- dictionary_name (str) – name of imported dictionary
- seed (unsigned int or -1, default None) – seed for random initialization, None means no seed
- args – an instance of InitilaizeModelArgs
-
merge_model
(models, nwt, topic_names=None, dictionary_name=None)¶ Merge multiple nwt-increments together.
Parameters: - models (dict) – list of models with nwt-increments and their weights, key - nwt_source_name, value - source_weight.
- nwt (str) – the name of target matrix to store combined nwt. The matrix will be created by this operation.
- topic_names (list of str) – names of topics in the resulting model. By default model names are taken from the first model in the list.
- dictionary_name – name of dictionary that defines which tokens to include in merged model
-
normalize_model
(pwt, nwt, rwt=None)¶ Parameters: - pwt (str) – name of pwt matrix in BigARTM
- nwt (str) – name of nwt matrix in BigARTM
- rwt (str) – name of rwt matrix in BigARTM
-
process_batches
(pwt, nwt=None, num_document_passes=None, batches_folder=None, batches=None, regularizer_name=None, regularizer_tau=None, class_ids=None, class_weights=None, find_theta=False, reuse_theta=False, find_ptdw=False, predict_class_id=None)¶ Parameters: - pwt (str) – name of pwt matrix in BigARTM
- nwt (str) – name of nwt matrix in BigARTM
- num_document_passes (int) – number of inner iterations during processing
- batches_folder (str) – full path to data folder (alternative 1)
- batches (list of str) – full file names of batches to process (alternative 2)
- regularizer_name (list of str) – list of names of Theta regularizers to use
- regularizer_tau (list of float) – list of tau coefficients for Theta regularizers
- class_ids (list of str) – list of class ids to use during processing
- class_weights (list of float) – list of corresponding weights of class ids
- find_theta (bool) – find theta matrix for ‘batches’ (if alternative 2)
- reuse_theta (bool) – initialize by theta from previous collection pass
- find_ptdw (bool) – calculate and return Ptdw matrix or not (works if find_theta == False)
- predict_class_id (str, default None) – class_id of a target modality to predict
Returns: - tuple (messages.ThetaMatrix, numpy.ndarray) — the info about Theta (if find_theta == True)
- messages.ThetaMatrix — the info about Theta (if find_theta == False)
-
reconfigure
(topic_names=None, class_ids=None, scores=None, regularizers=None, num_processors=None, pwt_name=None, nwt_name=None, num_document_passes=None, reuse_theta=None, cache_theta=None)¶
-
reconfigure_regularizer
(name, config=None, tau=None, gamma=None)¶
-
reconfigure_score
(name, config, model_name=None)¶
-
reconfigure_topic_name
(topic_names=None)¶
-
regularize_model
(pwt, nwt, rwt, regularizer_name, regularizer_tau, regularizer_gamma=None)¶ Parameters: - pwt (str) – name of pwt matrix in BigARTM
- nwt (str) – name of nwt matrix in BigARTM
- rwt (str) – name of rwt matrix in BigARTM
- regularizer_name (list of str) – list of names of Phi regularizers to use
- regularizer_tau (list of floats) – list of tau coefficients for Phi regularizers
-
remove_batch
(batch_id=None)¶ Parameters: batch_id (unicode) – id of batch, loaded in RAM
-
transform
(batches=None, batch_filenames=None, theta_matrix_type=None, predict_class_id=None)¶ Parameters: - batches – list of Batch instances
- batch_weights (list of float) – weights of batches to transform
- theta_matrix_type (int) – type of matrix to be returned
- predict_class_id (int) – type of matrix to be returned
Returns: messages.ThetaMatrix object
-