ARTM model¶
This page describes ARTM class.
-
class
artm.
ARTM
(num_topics=None, topic_names=None, num_processors=None, class_ids=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1)¶ -
__init__
(num_topics=None, topic_names=None, num_processors=None, class_ids=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1)¶ Parameters: - num_topics (int) – the number of topics in model, will be overwrited if topic_names is set
- num_processors (int) – how many threads will be used for model training, if not specified then number of threads will be detected by the lib
- topic_names (list of str) – names of topics in model
- class_ids (dict) – list of class_ids and their weights to be used in model, key — class_id, value — weight, if not specified then all class_ids will be used
- cache_theta (bool) – save or not the Theta matrix in model. Necessary if ARTM.get_theta() usage expects
- scores (list) – list of scores (objects of artm.*Score classes)
- regularizers (list) – list with regularizers (objects of artm.*Regularizer classes)
- num_document_passes (int) – number of inner iterations over each document
- dictionary (str or reference to Dictionary object) – dictionary to be used for initialization, if None nothing will be done
- reuse_theta (bool) – reuse Theta from previous iteration or not
- theta_columns_naming (str) – either ‘id’ or ‘title’, determines how to name columns (documents) in theta dataframe
- seed (unsigned int or -1) – seed for random initialization, -1 means no seed
Important public fields: - regularizers: contains dict of regularizers, included into model
- scores: contains dict of scores, included into model
- score_tracker: contains dict of scoring results: key — score name, value — ScoreTracker object, which contains info about values of score on each synchronization (e.g. collection pass) in list
Note: - Here and anywhere in BigARTM empty topic_names or class_ids means that model (or regularizer, or score) should use all topics or class_ids.
- If some fields of regularizers or scores are not defined by user — internal lib defaults would be used.
- If field ‘topic_names’ is None, it will be generated by BigARTM and will be available using ARTM.topic_names().
-
dispose
()¶ Description: free all native memory, allocated for this model
Note: - This method does not free memory occupied by dictionaries, because dictionaries are shared across all models
- ARTM class implements __exit__ and __del___ methods, which automatically call dispose.
-
fit_offline
(batch_vectorizer=None, num_collection_passes=1)¶ Description: proceeds the learning of topic model in offline mode
Parameters: - batch_vectorizer (object_referenece) – an instance of BatchVectorizer class
- num_collection_passes (int) – number of iterations over whole given collection
-
fit_online
(batch_vectorizer=None, tau0=1024.0, kappa=0.7, update_every=1, apply_weight=None, decay_weight=None, update_after=None, async=False)¶ Description: proceeds the learning of topic model in online mode
Parameters: - batch_vectorizer (object_reference) – an instance of BatchVectorizer class
- update_every (int) – the number of batches; model will be updated once per it
- tau0 (float) – coefficient (see ‘Update formulas’ paragraph)
- kappa (float) (float) – power for tau0, (see ‘Update formulas’ paragraph)
- update_after (list of int) – number of batches to be passed for Phi synchronizations
- apply_weight (list of float) – weight of applying new counters
- decay_weight (list of float) – weight of applying old counters
- async (bool) – use or not the async implementation of the EM-algorithm
Note: async=True leads to impossibility of score extraction via score_tracker. Use get_score() instead.
Update formulas: - The formulas for decay_weight and apply_weight:
- update_count = current_processed_docs / (batch_size * update_every);
- rho = pow(tau0 + update_count, -kappa);
- decay_weight = 1-rho;
- apply_weight = rho;
- if apply_weight, decay_weight and update_after are set, they will be used, otherwise the code below will be used (with update_every, tau0 and kappa)
-
get_phi
(topic_names=None, class_ids=None, model_name=None)¶ Description: get custom Phi matrix of model. The extraction of the whole Phi matrix expects ARTM.phi_ call.
Parameters: - topic_names (list of str) – list with topics to extract, None value means all topics
- class_ids (list of str) – list with class ids to extract, None means all class ids
- model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters
Returns: - pandas.DataFrame: (data, columns, rows), where:
- columns — the names of topics in topic model;
- rows — the tokens of topic model;
- data — content of Phi matrix.
-
get_score
(score_name)¶ Description: get score after fit_offline, fit_online or transform Parameters: score_name (str) – the name of the score to return
-
get_theta
(topic_names=None)¶ Description: get Theta matrix for training set of documents Parameters: topic_names (list of str) – list with topics to extract, None means all topics Returns: - pandas.DataFrame: (data, columns, rows), where:
- columns — the ids of documents, for which the Theta matrix was requested;
- rows — the names of topics in topic model, that was used to create Theta;
- data — content of Theta matrix.
-
info
¶ Description: returns internal diagnostics information about the model
-
initialize
(dictionary=None)¶ Description: initialize topic model before learning Parameters: dictionary (str or reference to Dictionary object) – loaded BigARTM collection dictionary
-
library_version
¶ Description: the version of BigARTM library in a MAJOR.MINOR.PATCH format
-
load
(filename, model_name='p_wt')¶ Description: loads from disk the topic model saved by ARTM.save()
Parameters: - filename (str) – the name of file containing model
- model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’
Note: - Loaded model will overwrite ARTM.topic_names and class_ids fields.
- All class_ids weights will be set to 1.0, you need to specify them by hand if it’s necessary.
- The method call will empty ARTM.score_tracker.
- All regularizers and scores will be forgotten.
- etc.
- We strongly recommend you to reset all important parameters of the ARTM model, used earlier.
-
remove_theta
()¶ Description: removes cached theta matrix
-
save
(filename, model_name='p_wt')¶ Description: saves one Phi-like matrix to disk
Parameters: - filename (str) – the name of file to store model
- model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’
-
transform
(batch_vectorizer=None, theta_matrix_type='dense_theta', predict_class_id=None)¶ Description: find Theta matrix for new documents
Parameters: - batch_vectorizer (object_reference) – an instance of BatchVectorizer class
- theta_matrix_type (str) – type of matrix to be returned, possible values: ‘dense_theta’, ‘dense_ptdw’, None, default=’dense_theta’
- predict_class_id (str) – class_id of a target modality to predict. When this option is enabled the resulting columns of theta matrix will correspond to unique labels of a target modality. The values will represent p(c|d), which give the probability of class label c for document d.
Returns: - pandas.DataFrame: (data, columns, rows), where:
- columns — the ids of documents, for which the Theta matrix was requested;
- rows — the names of topics in topic model, that was used to create Theta;
- data — content of Theta matrix.
Note: - ‘dense_ptdw’ mode provides simple access to values of p(t|w,d). The resulting pandas.DataFrame object will contain a flat theta matrix (no 3D) where each item has multiple columns - as many as the number of tokens in that document. These columns will have the same item_id. The order of columns with equal item_id is the same as the order of tokens in the input data (batch.item.token_id).
-