ARTM model

This page describes ARTM class.

class artm.ARTM(num_topics=None, topic_names=None, num_processors=None, class_ids=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1, show_progress_bars=False, theta_name=None)
__init__(num_topics=None, topic_names=None, num_processors=None, class_ids=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1, show_progress_bars=False, theta_name=None)
  • num_topics (int) – the number of topics in model, will be overwrited if topic_names is set
  • num_processors (int) – how many threads will be used for model training, if not specified then number of threads will be detected by the lib
  • topic_names (list of str) – names of topics in model
  • class_ids (dict) – list of class_ids and their weights to be used in model, key — class_id, value — weight, if not specified then all class_ids will be used
  • cache_theta (bool) – save or not the Theta matrix in model. Necessary if ARTM.get_theta() usage expects
  • scores (list) – list of scores (objects of artm.*Score classes)
  • regularizers (list) – list with regularizers (objects of artm.*Regularizer classes)
  • num_document_passes (int) – number of inner iterations over each document
  • dictionary (str or reference to Dictionary object) – dictionary to be used for initialization, if None nothing will be done
  • reuse_theta (bool) – reuse Theta from previous iteration or not
  • theta_columns_naming (str) – either ‘id’ or ‘title’, determines how to name columns (documents) in theta dataframe
  • seed (unsigned int or -1) – seed for random initialization, -1 means no seed
  • show_progress_bars – a boolean flag indicating whether to show progress bar in fit_offline, fit_online and transform operations.
  • theta_name – string, name of ptd (theta) matrix
Important public fields:
  • regularizers: contains dict of regularizers, included into model
  • scores: contains dict of scores, included into model
  • score_tracker: contains dict of scoring results: key — score name, value — ScoreTracker object, which contains info about values of score on each synchronization (e.g. collection pass) in list
  • Here and anywhere in BigARTM empty topic_names or class_ids means that model (or regularizer, or score) should use all topics or class_ids.
  • If some fields of regularizers or scores are not defined by user — internal lib defaults would be used.
  • If field ‘topic_names’ is None, it will be generated by BigARTM and will be available using ARTM.topic_names().
  • Most arguments of ARTM constructor have corresponding setter and getter of the same name that allows to change them at later time, after ARTM object has been created.
  • Setting theta_name to a non-empty string activates an experimental mode where cached theta matrix is internally stored as a phi matrix with tokens corresponding to item title, so user should guarantee that all ites has unique titles. With theta_name argument you specify the name of this matrix (for example ‘ptd’ or ‘theta’, or whatever name you like). Later you can retrieve this matix with ARTM.get_phi(model_name=ARTM.theta_name), change its values with ARTM.master.attach_model(model=ARTM.theta_name), export/import this matrix with ARTM.master.export_model(‘ptd’, filename) and ARTM.master.import_model(‘ptd’, file_name). In this case you are also able to work with theta matrix when using ‘dump_artm_model’ method and ‘load_artm_model’ function.

returns a deep copy of the artm.ARTM object

  • This method is equivalent to copy.deepcopy() of your artm.ARTM object. Both methods perform deep copy of the object, including a complete copy of its internal C++ state (e.g. a copy of all phi and theta matrices, scores and regularizers, as well as ScoreTracker information with history of the scores).
  • Attached phi matrices are copied as dense phi matrices.

free all native memory, allocated for this model

  • This method does not free memory occupied by dictionaries, because dictionaries are shared across all models
  • ARTM class implements __exit__ and __del___ methods, which automatically call dispose.
Description:dump all necessary model files into given folder.
Parameters:data_path (str) – full path to folder (should unexist)
fit_offline(batch_vectorizer=None, num_collection_passes=1)

proceeds the learning of topic model in offline mode

  • batch_vectorizer (object_referenece) – an instance of BatchVectorizer class
  • num_collection_passes (int) – number of iterations over whole given collection
fit_online(batch_vectorizer=None, tau0=1024.0, kappa=0.7, update_every=1, apply_weight=None, decay_weight=None, update_after=None, async=False)

proceeds the learning of topic model in online mode

  • batch_vectorizer (object_reference) – an instance of BatchVectorizer class
  • update_every (int) – the number of batches; model will be updated once per it
  • tau0 (float) – coefficient (see ‘Update formulas’ paragraph)
  • kappa (float) (float) – power for tau0, (see ‘Update formulas’ paragraph)
  • update_after (list of int) – number of batches to be passed for Phi synchronizations
  • apply_weight (list of float) – weight of applying new counters
  • decay_weight (list of float) – weight of applying old counters
  • async (bool) – use or not the async implementation of the EM-algorithm

async=True leads to impossibility of score extraction via score_tracker. Use get_score() instead.

Update formulas:
  • The formulas for decay_weight and apply_weight:
  • update_count = current_processed_docs / (batch_size * update_every);
  • rho = pow(tau0 + update_count, -kappa);
  • decay_weight = 1-rho;
  • apply_weight = rho;
  • if apply_weight, decay_weight and update_after are set, they will be used, otherwise the code below will be used (with update_every, tau0 and kappa)
get_phi(topic_names=None, class_ids=None, model_name=None)

get custom Phi matrix of model. The extraction of the whole Phi matrix expects ARTM.phi_ call.

  • topic_names (list of str or str or None) – list with topics or single topic to extract, None value means all topics
  • class_ids (list of str or str or None) – list with class_ids or single class_id to extract, None means all class ids
  • model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters

  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the names of topics in topic model;
  • rows — the tokens of topic model;
  • data — content of Phi matrix.

get_phi_dense(topic_names=None, class_ids=None, model_name=None)

get phi matrix in dense format

  • topic_names (list of str or str or None) – list with topics or single topic to extract, None value means all topics
  • class_ids (list of str or str or None) – list with class_ids or single class_id to extract, None means all class ids
  • model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters

  • a 3-tuple of (data, rows, columns), where
  • data — numpy.ndarray with Phi data (i.e., p(w|t) values)
  • rows — the tokens of topic model;
  • columns — the names of topics in topic model;

get_phi_sparse(topic_names=None, class_ids=None, model_name=None, eps=None)

get phi matrix in sparse format

  • topic_names (list of str or str or None) – list with topics or single topic to extract, None value means all topics
  • class_ids (list of str or str or None) – list with class_ids or single class_id to extract, None means all class ids
  • model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters
  • eps (float) – threshold to consider values as zero

  • a 3-tuple of (data, rows, columns), where
  • data — scipy.sparse.csr_matrix with values
  • columns — the names of topics in topic model;
  • rows — the tokens of topic model;

Description:get score after fit_offline, fit_online or transform
Parameters:score_name (str) – the name of the score to return
Description:get Theta matrix for training set of documents (or cached after transform)
Parameters:topic_names (list of str or str or None) – list with topics or single topic to extract, None means all topics
  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the ids of documents, for which the Theta matrix was requested;
  • rows — the names of topics in topic model, that was used to create Theta;
  • data — content of Theta matrix.
get_theta_sparse(topic_names=None, eps=None)

get Theta matrix in sparse format

  • topic_names (list of str or str or None) – list with topics or single topic to extract, None means all topics
  • eps (float) – threshold to consider values as zero

  • a 3-tuple of (data, rows, columns), where
  • data — scipy.sparse.csr_matrix with values
  • columns — the ids of documents;
  • rows — the names of topics in topic model;

Description:returns internal diagnostics information about the model
Description:initialize topic model before learning
Parameters:dictionary (str or reference to Dictionary object) – loaded BigARTM collection dictionary
Description:the version of BigARTM library in a MAJOR.MINOR.PATCH format
load(filename, model_name='p_wt')

loads from disk the topic model saved by

  • filename (str) – the name of file containing model
  • model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’
  • Loaded model will overwrite ARTM.topic_names and class_ids fields.
  • All class_ids weights will be set to 1.0, you need to specify them by hand if it’s necessary.
  • The method call will empty ARTM.score_tracker.
  • All regularizers and scores will be forgotten.
  • etc.
  • We strongly recommend you to reset all important parameters of the ARTM model, used earlier.
Description:removes cached theta matrix
Description:update topic names of the model.

Adds, removes, and reorders columns of phi matrices according to the new set of topic names. New topics are initialized with zeros.

save(filename, model_name='p_wt')

saves one Phi-like matrix to disk

  • filename (str) – the name of file to store model
  • model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’

Gets or sets the list of topic names of the model.

  • Setting topic name allows you to put new labels on the existing topics. To add, remove or reorder topics use ARTM.reshape_topics() method.
  • In ARTM topic names are used just as string identifiers, which give a unique name to each column of the phi matrix. Typically you want to set topic names as something like “topic0”, “topic1”, etc. Later operations like get_phi() allow you to specify which topics you need to retrieve. Most regularizers allow you to limit the set of topics they act upon. If you configure a rich set of regularizers it is important design your topic names according to how they are regularizerd. For example, you may use names obj0, obj1, ..., objN for objective topics (those where you enable sparsity regularizers), and back0, back1, ..., backM for background topics (those where you enable smoothing regularizers).
transform(batch_vectorizer=None, theta_matrix_type='dense_theta', predict_class_id=None)

find Theta matrix for new documents

  • batch_vectorizer (object_reference) – an instance of BatchVectorizer class
  • theta_matrix_type (str) – type of matrix to be returned, possible values: ‘dense_theta’, ‘dense_ptdw’, ‘cache’, None, default=’dense_theta’
  • predict_class_id (str) – class_id of a target modality to predict. When this option is enabled the resulting columns of theta matrix will correspond to unique labels of a target modality. The values will represent p(c|d), which give the probability of class label c for document d.

  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the ids of documents, for which the Theta matrix was requested;
  • rows — the names of topics in topic model, that was used to create Theta;
  • data — content of Theta matrix.

  • ‘dense_ptdw’ mode provides simple access to values of p(t|w,d). The resulting pandas.DataFrame object will contain a flat theta matrix (no 3D) where each item has multiple columns - as many as the number of tokens in that document. These columns will have the same item_id. The order of columns with equal item_id is the same as the order of tokens in the input data (batch.item.token_id).
transform_sparse(batch_vectorizer, eps=None)

find Theta matrix for new documents as sparse scipy matrix

  • batch_vectorizer (object_reference) – an instance of BatchVectorizer class
  • eps (float) – threshold to consider values as zero

  • a 3-tuple of (data, rows, columns), where
  • data — scipy.sparse.csr_matrix with values
  • columns — the ids of documents;
  • rows — the names of topics in topic model;