ARTM model¶
This page describes ARTM class.
-
class
artm.
ARTM
(num_processors=0, topic_names=None, num_topics=10, class_ids=None, cache_theta=True, scores=None, regularizers=None, theta_columns_naming='id')¶ ARTM represents a topic model (public class)
Parameters: - num_processors (int) – how many threads will be used for model training, if not specified then number of threads will be detected by the lib
- topic_names (list of str) – names of topics in model, if not specified will be auto-generated by lib according to num_topics
- num_topics (int) – number of topics in model (is used if topic_names not specified), default=10
- class_ids (dict) – list of class_ids and their weights to be used in model, key — class_id, value — weight, if not specified then all class_ids will be used
- cache_theta (bool) – save or not the Theta matrix in model. Necessary if ARTM.get_theta() usage expects, default=True
- scores (list) – list of scores (objects of artm.***Score classes), default=None
- regularizers (list) – list with regularizers (objects of artm.***Regularizer classes), default=None
- theta_columns_naming (string) – either ‘id’ or ‘title’, determines how to name columns (documents) in theta dataframe, default=’id’
- Important public fields:
- regularizers — contains dict of regularizers, included into model
- scores — contains dict of scores, included into model
- score_tracker — contains dict of scoring results; key — score name, value — ScoreTracker object, which contains info about values of score on each synchronization in list
Note
- Here and anywhere in BigARTM empty topic_names or class_ids means that model (or regularizer, or score) should use all topics or class_ids.
- If some fields of regularizers or scores are not defined by user — internal lib defaults would be used.
- If field ‘topic_names’ is None, it will be generated by BigARTM and will be available using ARTM.topic_names().
-
create_dictionary
(dictionary_name=None, dictionary_data=None)¶ ARTM.save_dictionary() — save the BigARTM dictionary of the collection on the disk
Parameters: - dictionary_name (str) – the name of the dictionary in the lib, default=None
- dictionary_data (DictionaryData instance) – configuration of dictionary, default=None
-
filter_dictionary
(dictionary_name=None, dictionary_target_name=None, class_id=None, min_df=None, max_df=None, min_df_rate=None, max_df_rate=None, min_tf=None, max_tf=None)¶ ARTM.filter_dictionary() — filter the BigARTM dictionary of the collection, which was already loaded into the lib
Parameters: - dictionary_name (str) – name of the dictionary in the lib to filter
- dictionary_target_name (str) – name for the new filtered dictionary in the lib
- class_id (str) – class_id to filter
- min_df (float) – min df value to pass the filter
- max_df (float) – max df value to pass the filter
- min_df_rate (float) – min df rate to pass the filter
- max_df_rate (float) – max df rate to pass the filter
- min_tf (float) – min tf value to pass the filter
- max_tf (float) – max tf value to pass the filter
-
fit_offline
(batch_vectorizer=None, num_collection_passes=20, num_document_passes=1, reuse_theta=True, dictionary_filename='dictionary.dict')¶ ARTM.fit_offline() — proceed the learning of topic model in off-line mode
Parameters: - batch_vectorizer – an instance of BatchVectorizer class
- num_collection_passes (int) – number of iterations over whole given collection, default=20
- num_document_passes (int) – number of inner iterations over each document for inferring theta, default=1
- reuse_theta (bool) – using theta from previous pass of the collection, default=True
- dictionary_filename (str) – the name of file with dictionary to use in inline initialization, default=’dictionary’
Note
ARTM.initialize() should be proceed before first call ARTM.fit_offline(), or it will be initialized by dictionary during first call.
-
fit_online
(batch_vectorizer=None, tau0=1024.0, kappa=0.7, update_every=1, num_document_passes=10, reset_theta_scores=False, dictionary_filename='dictionary.dict')¶ ARTM.fit_online() — proceed the learning of topic model in on-line mode
Parameters: - batch_vectorizer – an instance of BatchVectorizer class
- update_every (int) – the number of batches; model will be updated once per it, default=1
- tau0 (float) – coefficient (see kappa), default=1024.0
- kappa (float) – power for tau0, default=0.7
- num_document_passes (int) – number of inner iterations over each document for inferring theta, default=10
- reset_theta_scores (bool) – reset accumulated Theta scores before learning, default=False
- dictionary_filename (str) – the name of file with dictionary to use in inline initialization, default=’dictionary’
Note
The formulas for decay_weight and apply_weight:
- update_count = current_processed_docs / (batch_size * update_every)
- rho = pow(tau0 + update_count, -kappa)
- decay_weight = 1-rho
- apply_weight = rho
Note
ARTM.initialize() should be proceed before first call ARTM.fit_online(), or it will be initialized by dictionary during first call.
-
fit_transform
(topic_names=None, remove_theta=False)¶ ARTM.fit_transform() — obsolete way of theta retrieval. Use get_theta instead.
-
gather_dictionary
(dictionary_target_name=None, data_path=None, cooc_file_path=None, vocab_file_path=None, symmetric_cooc_values=False)¶ ARTM.gather_dictionary() — create the BigARTM dictionary of the collection, represented as batches and load it in the lib
Parameters: - dictionary_target_name (str) – the name of the dictionary in the lib, default=None
- data_path (str) – full path to batches folder
- cooc_file_path (str) – full path to the file with cooc info
- vocab_file_path (str) – full path to the file with vocabulary. If given, the dictionary token will have the same order, as in this file, otherwise the order will be random, default=None
- symmetric_cooc_values (bool) – if the cooc matrix should considered to be symmetric or not, default=False
-
get_phi
(topic_names=None, class_ids=None, model_name=None)¶ ARTM.get_phi() — get custom Phi matrix of model. The extraction of the whole Phi matrix expects ARTM.phi_ call.
Parameters: - topic_names (list of str) – list with topics to extract, default=None (means all topics)
- class_ids (list of str) – list with class ids to extract, default=None (means all class ids)
- model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters
Returns: pandas.DataFrame (data, columns, rows), where:
- columns — the names of topics in topic model
- rows — the tokens of topic model
- data — content of Phi matrix
-
get_theta
(topic_names=None, remove_theta=False)¶ ARTM.get_theta() — get Theta matrix for training set of documents
Parameters: - topic_names (list of str) – list with topics to extract, default=None (means all topics)
- remove_theta (bool) – flag indicates save or remove Theta from model after extraction, default=False
Returns: pandas.DataFrame (data, columns, rows), where:
- columns — the ids of documents, for which the Theta matrix was requested
- rows — the names of topics in topic model, that was used to create Theta
- data — content of Theta matrix
-
initialize
(dictionary_name=None, seed=-1)¶ ARTM.initialize() — initialize topic model before learning
Parameters: - dictionary_name (str) – the name of loaded BigARTM collection dictionary, default=None
- seed (unsigned int or -1) – seed for random initialization, default=-1 (no seed)
-
load
(filename)¶ ARTM.load() — load the topic model, saved by ARTM.save(), from disk
Parameters: filename (str) – the name of file containing model, no default Note
Loaded model will overwrite ARTM.topic_names and ARTM.num_topics fields. Also it will empty ARTM.score_tracker.
-
load_dictionary
(dictionary_name=None, dictionary_path=None)¶ ARTM.load_dictionary() — load the BigARTM dictionary of the collection into the lib
Parameters: - dictionary_name (str) – the name of the dictionary in the lib, default=None
- dictionary_path (str) – full file name of the dictionary, default=None
-
load_text_dictionary
(dictionary_name=None, dictionary_path=None, encoding='utf-8')¶ ARTM.load_text_dictionary() — load the BigARTM dictionary of the collection from the disk in the human readable text format
Parameters: - dictionary_name (str) – the name for the dictionary in the lib, default=None
- dictionary_path (str) – full file name of the text dictionary file, default=None
- encoding (str) – an encoding of text in diciotnary
-
remove_dictionary
(dictionary_name=None)¶ ARTM.remove_dictionary() — remove the loaded BigARTM dictionary from the lib
Parameters: dictionary_name (str) – the name of the dictionary in the lib, default=None
-
save
(filename='artm_model')¶ ARTM.save() — save the topic model to disk
Parameters: filename (str) – the name of file to store model, default=’artm_model’
-
save_dictionary
(dictionary_name=None, dictionary_path=None)¶ ARTM.save_dictionary() — save the BigARTM dictionary of the collection on the disk
Parameters: - dictionary_name (str) – the name of the dictionary in the lib, default=None
- dictionary_path (str) – full file name for the dictionary, default=None
-
save_text_dictionary
(dictionary_name=None, dictionary_path=None, encoding='utf-8')¶ ARTM.save_text_dictionary() — save the BigARTM dictionary of the collection on the disk in the human readable text format
Parameters: - dictionary_name (str) – the name of the dictionary in the lib, default=None
- dictionary_path (str) – full file name for the text dictionary file, default=None
- encoding (str) – an encoding of text in diciotnary
-
transform
(batch_vectorizer=None, num_document_passes=1, predict_class_id=None)¶ ARTM.transform() — find Theta matrix for new documents
Parameters: - batch_vectorizer – an instance of BatchVectorizer class
- num_document_passes (int) – number of inner iterations over each document for inferring theta, default = 1
- predict_class_id (string) – class_id of a target modality to predict, default = None. When this option is enabled the resulting columns of theta matrix will correspond to unique labels of a target modality. The values will represent p(c|d), which give the probability of class label c for document d.
Returns: pandas.DataFrame (data, columns, rows), where:
- columns — the ids of documents, for which the Theta matrix was requested
- rows — the names of topics in topic model, that was used to create Theta
- data — content of Theta matrix.