ARTM model

This page describes ARTM class.

class artm.ARTM(num_topics=None, topic_names=None, num_processors=None, class_ids=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1)
__init__(num_topics=None, topic_names=None, num_processors=None, class_ids=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1)
Parameters:
  • num_topics (int) – the number of topics in model, will be overwrited if topic_names is set
  • num_processors (int) – how many threads will be used for model training, if not specified then number of threads will be detected by the lib
  • topic_names (list of str) – names of topics in model
  • class_ids (dict) – list of class_ids and their weights to be used in model, key — class_id, value — weight, if not specified then all class_ids will be used
  • cache_theta (bool) – save or not the Theta matrix in model. Necessary if ARTM.get_theta() usage expects
  • scores (list) – list of scores (objects of artm.*Score classes)
  • regularizers (list) – list with regularizers (objects of artm.*Regularizer classes)
  • num_document_passes (int) – number of inner iterations over each document
  • dictionary (str or reference to Dictionary object) – dictionary to be used for initialization, if None nothing will be done
  • reuse_theta (bool) – reuse Theta from previous iteration or not
  • theta_columns_naming (str) – either ‘id’ or ‘title’, determines how to name columns (documents) in theta dataframe
  • seed (unsigned int or -1) – seed for random initialization, -1 means no seed
Important public fields:
 
  • regularizers: contains dict of regularizers, included into model
  • scores: contains dict of scores, included into model
  • score_tracker: contains dict of scoring results: key — score name, value — ScoreTracker object, which contains info about values of score on each synchronization (e.g. collection pass) in list
Note:
  • Here and anywhere in BigARTM empty topic_names or class_ids means that model (or regularizer, or score) should use all topics or class_ids.
  • If some fields of regularizers or scores are not defined by user — internal lib defaults would be used.
  • If field ‘topic_names’ is None, it will be generated by BigARTM and will be available using ARTM.topic_names().
dispose()
Description:

free all native memory, allocated for this model

Note:
  • This method does not free memory occupied by dictionaries, because dictionaries are shared across all models
  • ARTM class implements __exit__ and __del___ methods, which automatically call dispose.
fit_offline(batch_vectorizer=None, num_collection_passes=1)
Description:

proceeds the learning of topic model in offline mode

Parameters:
  • batch_vectorizer (object_referenece) – an instance of BatchVectorizer class
  • num_collection_passes (int) – number of iterations over whole given collection
fit_online(batch_vectorizer=None, tau0=1024.0, kappa=0.7, update_every=1, apply_weight=None, decay_weight=None, update_after=None, async=False)
Description:

proceeds the learning of topic model in online mode

Parameters:
  • batch_vectorizer (object_reference) – an instance of BatchVectorizer class
  • update_every (int) – the number of batches; model will be updated once per it
  • tau0 (float) – coefficient (see ‘Update formulas’ paragraph)
  • kappa (float) (float) – power for tau0, (see ‘Update formulas’ paragraph)
  • update_after (list of int) – number of batches to be passed for Phi synchronizations
  • apply_weight (list of float) – weight of applying new counters
  • decay_weight (list of float) – weight of applying old counters
  • async (bool) – use or not the async implementation of the EM-algorithm
Note:

async=True leads to impossibility of score extraction via score_tracker. Use get_score() instead.

Update formulas:
 
  • The formulas for decay_weight and apply_weight:
  • update_count = current_processed_docs / (batch_size * update_every);
  • rho = pow(tau0 + update_count, -kappa);
  • decay_weight = 1-rho;
  • apply_weight = rho;
  • if apply_weight, decay_weight and update_after are set, they will be used, otherwise the code below will be used (with update_every, tau0 and kappa)
get_phi(topic_names=None, class_ids=None, model_name=None)
Description:

get custom Phi matrix of model. The extraction of the whole Phi matrix expects ARTM.phi_ call.

Parameters:
  • topic_names (list of str) – list with topics to extract, None value means all topics
  • class_ids (list of str) – list with class ids to extract, None means all class ids
  • model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters
Returns:

  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the names of topics in topic model;
  • rows — the tokens of topic model;
  • data — content of Phi matrix.

get_score(score_name)
Description:get score after fit_offline, fit_online or transform
Parameters:score_name (str) – the name of the score to return
get_theta(topic_names=None)
Description:get Theta matrix for training set of documents
Parameters:topic_names (list of str) – list with topics to extract, None means all topics
Returns:
  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the ids of documents, for which the Theta matrix was requested;
  • rows — the names of topics in topic model, that was used to create Theta;
  • data — content of Theta matrix.
info
Description:returns internal diagnostics information about the model
initialize(dictionary=None)
Description:initialize topic model before learning
Parameters:dictionary (str or reference to Dictionary object) – loaded BigARTM collection dictionary
library_version
Description:the version of BigARTM library in a MAJOR.MINOR.PATCH format
load(filename, model_name='p_wt')
Description:

loads from disk the topic model saved by ARTM.save()

Parameters:
  • filename (str) – the name of file containing model
  • model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’
Note:
  • Loaded model will overwrite ARTM.topic_names and class_ids fields.
  • All class_ids weights will be set to 1.0, you need to specify them by hand if it’s necessary.
  • The method call will empty ARTM.score_tracker.
  • All regularizers and scores will be forgotten.
  • etc.
  • We strongly recommend you to reset all important parameters of the ARTM model, used earlier.
remove_theta()
Description:removes cached theta matrix
save(filename, model_name='p_wt')
Description:

saves one Phi-like matrix to disk

Parameters:
  • filename (str) – the name of file to store model
  • model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’
transform(batch_vectorizer=None, theta_matrix_type='dense_theta', predict_class_id=None)
Description:

find Theta matrix for new documents

Parameters:
  • batch_vectorizer (object_reference) – an instance of BatchVectorizer class
  • theta_matrix_type (str) – type of matrix to be returned, possible values: ‘dense_theta’, ‘dense_ptdw’, None, default=’dense_theta’
  • predict_class_id (str) – class_id of a target modality to predict. When this option is enabled the resulting columns of theta matrix will correspond to unique labels of a target modality. The values will represent p(c|d), which give the probability of class label c for document d.
Returns:

  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the ids of documents, for which the Theta matrix was requested;
  • rows — the names of topics in topic model, that was used to create Theta;
  • data — content of Theta matrix.

Note:
  • ‘dense_ptdw’ mode provides simple access to values of p(t|w,d). The resulting pandas.DataFrame object will contain a flat theta matrix (no 3D) where each item has multiple columns - as many as the number of tokens in that document. These columns will have the same item_id. The order of columns with equal item_id is the same as the order of tokens in the input data (batch.item.token_id).