ARTM model

This page describes ARTM class.

class artm.ARTM(num_topics=None, topic_names=None, num_processors=None, class_ids=None, transaction_typenames=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1, show_progress_bars=False, theta_name=None, parent_model=None, parent_model_weight=None)
__init__(num_topics=None, topic_names=None, num_processors=None, class_ids=None, transaction_typenames=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1, show_progress_bars=False, theta_name=None, parent_model=None, parent_model_weight=None)
Parameters:
  • num_topics (int) – the number of topics in model, will be overwrited if topic_names is set
  • num_processors (int) – how many threads will be used for model training, if not specified then number of threads will be detected by the lib
  • topic_names (list of str) – names of topics in model
  • class_ids (dict) – list of class_ids and their weights to be used in model, key — class_id, value — weight, if not specified then all class_ids will be used. :param dict transaction_typenames: list of transaction_typenames and their weights to be used in model, key — transaction_typename, value — weight, if not specified then all transaction_typenames will be used. Specify class_ids parameter when using custom transaction_typenames parameter.
  • cache_theta (bool) – save or not the Theta matrix in model. Necessary if ARTM.get_theta() usage expects
  • scores (list) – list of scores (objects of artm.*Score classes)
  • regularizers (list) – list with regularizers (objects of artm.*Regularizer classes)
  • num_document_passes (int) – number of inner iterations over each document
  • dictionary (str or reference to Dictionary object) – dictionary to be used for initialization, if None nothing will be done
  • reuse_theta (bool) – reuse Theta from previous iteration or not
  • theta_columns_naming (str) – either ‘id’ or ‘title’, determines how to name columns (documents) in theta dataframe
  • seed (unsigned int or -1) – seed for random initialization, -1 means no seed
  • show_progress_bars – a boolean flag indicating whether to show progress bar in fit_offline, fit_online and transform operations.
  • theta_name – string, name of ptd (theta) matrix
  • parent_model (ARTM) – An instance of ARTM class to use as parent level of hierarchy
  • parent_model_weight (float) – weight of parent model (by default 1.0)
Important public fields:
 
  • regularizers: contains dict of regularizers, included into model
  • scores: contains dict of scores, included into model
  • score_tracker: contains dict of scoring results: key — score name, value — ScoreTracker object, which contains info about values of score on each synchronization (e.g. collection pass) in list
Note:
  • Here and anywhere in BigARTM empty topic_names, class_ids means that model (or regularizer, or score) should use all topics and class_ids. Don’t confused with topic_name and class_id fields!
  • If some fields of regularizers or scores are not defined by user — internal lib defaults would be used.
  • If field ‘topic_names’ is None, it will be generated by BigARTM and will be available using ARTM.topic_names().
  • Most arguments of ARTM constructor have corresponding setter and getter of the same name that allows to change them at later time, after ARTM object has been created.
  • Setting theta_name to a non-empty string activates an experimental mode where cached theta matrix is internally stored as a phi matrix with tokens corresponding to item title, so user should guarantee that all ites has unique titles. With theta_name argument you specify the name of this matrix (for example ‘ptd’ or ‘theta’, or whatever name you like). Later you can retrieve this matix with ARTM.get_phi(model_name=ARTM.theta_name), change its values with ARTM.master.attach_model(model=ARTM.theta_name), export/import this matrix with ARTM.master.export_model(‘ptd’, filename) and ARTM.master.import_model(‘ptd’, file_name). In this case you are also able to work with theta matrix when using ‘dump_artm_model’ method and ‘load_artm_model’ function.
  • Setting parent_model parameter or, alternatively, calling ARTM.set_parent_model(), cause this ARTM instance to behave as if it is a child level in hierarchical topic model. This changes few things. First, fit_offline() method will respect parent’s model topics, as specified by parent_model_weight paremeter. Larger values of parent_model_weight result in your child model being more consistent with parent hierarchy. If you put parent_model_weight as 0 your child level will be effectively independent from its parent. Second, you may call ARTM.get_parent_psi() to retrieve a transition matrix, e.i. p(subtopic|topic). Third, you no longer can use ARTM.fit_online(), which will throw an exception. Fourth, you have to specify seed parameter (otherwise first topics in your child level will be initialized the same way as in parent’s model). If you previously used hARTM class, this functionality is fully equivalent. hARTM class is now deprecated. Note that dump_artm_model and load_artm_model is only partly supported. After load_artm_model() you need to set parent model manually via set_parent_model(), and also to specify value for ARTM.parent_model_weight property.
clone()
Description:

returns a deep copy of the artm.ARTM object

Note:
  • This method is equivalent to copy.deepcopy() of your artm.ARTM object. Both methods perform deep copy of the object, including a complete copy of its internal C++ state (e.g. a copy of all phi and theta matrices, scores and regularizers, as well as ScoreTracker information with history of the scores).
  • Attached phi matrices are copied as dense phi matrices.
dispose()
Description:

free all native memory, allocated for this model

Note:
  • This method does not free memory occupied by dictionaries, because dictionaries are shared across all models
  • ARTM class implements __exit__ and __del___ methods, which automatically call dispose.
dump_artm_model(data_path)
Description:dump all necessary model files into given folder.
Parameters:data_path (str) – full path to folder (should unexist)
fit_offline(batch_vectorizer=None, num_collection_passes=1, reset_nwt=True)
Description:

proceeds the learning of topic model in offline mode

Parameters:
  • batch_vectorizer (object_referenece) – an instance of BatchVectorizer class
  • num_collection_passes (int) – number of iterations over whole given collection
  • reset_nwt (bool) – a flag indicating whether to reset n_wt matrix to 0.
fit_online(batch_vectorizer=None, tau0=1024.0, kappa=0.7, update_every=1, apply_weight=None, decay_weight=None, update_after=None, asynchronous=False)
Description:

proceeds the learning of topic model in online mode

Parameters:
  • batch_vectorizer (object_reference) – an instance of BatchVectorizer class
  • update_every (int) – the number of batches; model will be updated once per it
  • tau0 (float) – coefficient (see ‘Update formulas’ paragraph)
  • kappa (float) (float) – power for tau0, (see ‘Update formulas’ paragraph)
  • update_after (list of int) – number of batches to be passed for Phi synchronizations
  • apply_weight (list of float) – weight of applying new counters
  • decay_weight (list of float) – weight of applying old counters
  • asynchronous (bool) – use or not the asynchronous implementation of the EM-algorithm
Note:

asynchronous=True leads to impossibility of score extraction via score_tracker. Use get_score() instead.

Update formulas:
 
  • The formulas for decay_weight and apply_weight:
  • update_count = current_processed_docs / (batch_size * update_every);
  • rho = pow(tau0 + update_count, -kappa);
  • decay_weight = 1-rho;
  • apply_weight = rho;
  • if apply_weight, decay_weight and update_after are set, they will be used, otherwise the code below will be used (with update_every, tau0 and kappa)
get_parent_psi()
Returns:p(subtopic|topic) matrix
get_phi(topic_names=None, class_ids=None, model_name=None)
Description:

get custom Phi matrix of model. The extraction of the whole Phi matrix expects ARTM.phi_ call.

Parameters:
  • topic_names (list of str or str or None) – list with topics or single topic to extract, None value means all topics
  • class_ids (list of str or str or None) – list with class_ids or single class_id to extract, None means all class ids
  • model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters
Returns:

  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the names of topics in topic model;
  • rows — the tokens of topic model;
  • data — content of Phi matrix.

get_phi_dense(topic_names=None, class_ids=None, model_name=None)
Description:

get phi matrix in dense format

Parameters:
  • topic_names (list of str or str or None) – list with topics or single topic to extract, None value means all topics
  • class_ids (list of str or str or None) – list with class_ids or single class_id to extract, None means all class ids
  • model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters
Returns:

  • a 3-tuple of (data, rows, columns), where
  • data — numpy.ndarray with Phi data (i.e., p(w|t) values)
  • rows — the tokens of topic model;
  • columns — the names of topics in topic model;

get_phi_sparse(topic_names=None, class_ids=None, model_name=None, eps=None)
Description:

get phi matrix in sparse format

Parameters:
  • topic_names (list of str or str or None) – list with topics or single topic to extract, None value means all topics
  • class_ids (list of str or str or None) – list with class_ids or single class_id to extract, None means all class ids
  • model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters
  • eps (float) – threshold to consider values as zero
Returns:

  • a 3-tuple of (data, rows, columns), where
  • data — scipy.sparse.csr_matrix with values
  • columns — the names of topics in topic model;
  • rows — the tokens of topic model;

get_score(score_name)
Description:get score after fit_offline, fit_online or transform
Parameters:score_name (str) – the name of the score to return
get_theta(topic_names=None)
Description:get Theta matrix for training set of documents (or cached after transform)
Parameters:topic_names (list of str or str or None) – list with topics or single topic to extract, None means all topics
Returns:
  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the ids of documents, for which the Theta matrix was requested;
  • rows — the names of topics in topic model, that was used to create Theta;
  • data — content of Theta matrix.
get_theta_sparse(topic_names=None, eps=None)
Description:

get Theta matrix in sparse format

Parameters:
  • topic_names (list of str or str or None) – list with topics or single topic to extract, None means all topics
  • eps (float) – threshold to consider values as zero
Returns:

  • a 3-tuple of (data, rows, columns), where
  • data — scipy.sparse.csr_matrix with values
  • columns — the ids of documents;
  • rows — the names of topics in topic model;

info
Description:returns internal diagnostics information about the model
initialize(dictionary=None)
Description:initialize topic model before learning
Parameters:dictionary (str or reference to Dictionary object) – loaded BigARTM collection dictionary
library_version
Description:the version of BigARTM library in a MAJOR.MINOR.PATCH format
load(filename, model_name='p_wt')
Description:

loads from disk the topic model saved by ARTM.save()

Parameters:
  • filename (str) – the name of file containing model
  • model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’
Note:
  • Loaded model will overwrite ARTM.topic_names, class_ids and transaction_typenames fields.
  • All transaction_typenames (class_ids) weights will be set to 1.0, you need to specify them by hand if it’s necessary.
  • The method call will empty ARTM.score_tracker.
  • All regularizers and scores will be forgotten.
  • etc.
  • We strongly recommend you to reset all important parameters of the ARTM model, used earlier.
remove_theta()
Description:removes cached theta matrix
reshape(topic_names=None, dictionary=None)
Description:change the shape of the model,

e.i. add/remove topics, or add/remove tokens.

Parameters:
  • topic_names (list of str) – names of topics in model
  • dictionary (str or reference to Dictionary object) – dictionary that define new set of tokens

Only one of the arguments (topic_names or dictionary) can be specified at a time. For further description see methods ARTM.reshape_topics() and ARTM.reshape_tokens().

reshape_tokens(dictionary)
Description:update tokens of the model.

Adds, removes, or reorders the tokens of the model according to a new dictionary. This operation changes n_wt matrix, but has no immediate effect on the p_wt matrix. You are expected to call ARTM.fit_offline() method to re-calculate p_wt matrix for the new set of tokens.

reshape_topics(topic_names)
Description:update topic names of the model.

Adds, removes, or reorders columns of phi matrices according to the new set of topic names. New topics are initialized with zeros.

save(filename, model_name='p_wt')
Description:

saves one Phi-like matrix to disk

Parameters:
  • filename (str) – the name of file to store model
  • model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’
set_parent_model(parent_model, parent_model_weight=None)
Description:sets parent model. For more details, see comment in ARTM.__init__.
Parameters:parent_model (ARTM) – An instance of ARTM class to use as parent level of hierarchy
topic_names
Description:

Gets or sets the list of topic names of the model.

Note:
  • Setting topic name allows you to put new labels on the existing topics. To add, remove or reorder topics use ARTM.reshape_topics() method.
  • In ARTM topic names are used just as string identifiers, which give a unique name to each column of the phi matrix. Typically you want to set topic names as something like “topic0”, “topic1”, etc. Later operations like get_phi() allow you to specify which topics you need to retrieve. Most regularizers allow you to limit the set of topics they act upon. If you configure a rich set of regularizers it is important design your topic names according to how they are regularizerd. For example, you may use names obj0, obj1, …, objN for objective topics (those where you enable sparsity regularizers), and back0, back1, …, backM for background topics (those where you enable smoothing regularizers).
transform(batch_vectorizer=None, theta_matrix_type='dense_theta', predict_class_id=None)
Description:

find Theta matrix for new documents

Parameters:
  • batch_vectorizer (object_reference) – an instance of BatchVectorizer class
  • theta_matrix_type (str) – type of matrix to be returned, possible values: ‘dense_theta’, ‘dense_ptdw’, ‘cache’, None, default=’dense_theta’
  • predict_class_id (str) – class_id of a target modality to predict. When this option is enabled the resulting columns of theta matrix will correspond to unique labels of a target modality. The values will represent p(c|d), which give the probability of class label c for document d.
Returns:

  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the ids of documents, for which the Theta matrix was requested;
  • rows — the names of topics in topic model, that was used to create Theta;
  • data — content of Theta matrix.

Note:
  • ‘dense_ptdw’ mode provides simple access to values of p(t|w,d). The resulting pandas.DataFrame object will contain a flat theta matrix (no 3D) where each item has multiple columns - as many as the number of tokens in that document. These columns will have the same item_id. The order of columns with equal item_id is the same as the order of tokens in the input data (batch.item.token_id).
transform_sparse(batch_vectorizer, eps=None)
Description:

find Theta matrix for new documents as sparse scipy matrix

Parameters:
  • batch_vectorizer (object_reference) – an instance of BatchVectorizer class
  • eps (float) – threshold to consider values as zero
Returns:

  • a 3-tuple of (data, rows, columns), where
  • data — scipy.sparse.csr_matrix with values
  • columns — the ids of documents;
  • rows — the names of topics in topic model;