ARTM model¶

This page describes ARTM class.

class artm.ARTM(num_topics=None, topic_names=None, num_processors=None, class_ids=None, transaction_typenames=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1, show_progress_bars=False, theta_name=None, parent_model=None, parent_model_weight=None)¶

__init__(num_topics=None, topic_names=None, num_processors=None, class_ids=None, transaction_typenames=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1, show_progress_bars=False, theta_name=None, parent_model=None, parent_model_weight=None)¶

Important public fields:
Parameters:	num_topics (int) – the number of topics in model, will be overwrited if topic_names is set num_processors (int) – how many threads will be used for model training, if not specified then number of threads will be detected by the lib topic_names (list of str) – names of topics in model class_ids (dict) – list of class_ids and their weights to be used in model, key — class_id, value — weight, if not specified then all class_ids will be used. :param dict transaction_typenames: list of transaction_typenames and their weights to be used in model, key — transaction_typename, value — weight, if not specified then all transaction_typenames will be used. Specify class_ids parameter when using custom transaction_typenames parameter. cache_theta (bool) – save or not the Theta matrix in model. Necessary if ARTM.get_theta() usage expects scores (list) – list of scores (objects of artm.Score classes) regularizers* (list) – list with regularizers (objects of artm.Regularizer classes) num_document_passes* (int) – number of inner iterations over each document dictionary (str or reference to Dictionary object) – dictionary to be used for initialization, if None nothing will be done reuse_theta (bool) – reuse Theta from previous iteration or not theta_columns_naming (str) – either ‘id’ or ‘title’, determines how to name columns (documents) in theta dataframe seed (unsigned int or -1) – seed for random initialization, -1 means no seed show_progress_bars – a boolean flag indicating whether to show progress bar in fit_offline, fit_online and transform operations. theta_name – string, name of ptd (theta) matrix parent_model (ARTM) – An instance of ARTM class to use as parent level of hierarchy parent_model_weight (float) – weight of parent model (by default 1.0)
	regularizers: contains dict of regularizers, included into model scores: contains dict of scores, included into model score_tracker: contains dict of scoring results: key — score name, value — ScoreTracker object, which contains info about values of score on each synchronization (e.g. collection pass) in list
Note:	Here and anywhere in BigARTM empty topic_names, class_ids means that model (or regularizer, or score) should use all topics and class_ids. Don’t confused with topic_name and class_id fields! If some fields of regularizers or scores are not defined by user — internal lib defaults would be used. If field ‘topic_names’ is None, it will be generated by BigARTM and will be available using ARTM.topic_names(). Most arguments of ARTM constructor have corresponding setter and getter of the same name that allows to change them at later time, after ARTM object has been created. Setting theta_name to a non-empty string activates an experimental mode where cached theta matrix is internally stored as a phi matrix with tokens corresponding to item title, so user should guarantee that all ites has unique titles. With theta_name argument you specify the name of this matrix (for example ‘ptd’ or ‘theta’, or whatever name you like). Later you can retrieve this matix with ARTM.get_phi(model_name=ARTM.theta_name), change its values with ARTM.master.attach_model(model=ARTM.theta_name), export/import this matrix with ARTM.master.export_model(‘ptd’, filename) and ARTM.master.import_model(‘ptd’, file_name). In this case you are also able to work with theta matrix when using ‘dump_artm_model’ method and ‘load_artm_model’ function. Setting parent_model parameter or, alternatively, calling ARTM.set_parent_model(), cause this ARTM instance to behave as if it is a child level in hierarchical topic model. This changes few things. First, fit_offline() method will respect parent’s model topics, as specified by parent_model_weight paremeter. Larger values of parent_model_weight result in your child model being more consistent with parent hierarchy. If you put parent_model_weight as 0 your child level will be effectively independent from its parent. Second, you may call ARTM.get_parent_psi() to retrieve a transition matrix, e.i. p(subtopic\|topic). Third, you no longer can use ARTM.fit_online(), which will throw an exception. Fourth, you have to specify seed parameter (otherwise first topics in your child level will be initialized the same way as in parent’s model). If you previously used hARTM class, this functionality is fully equivalent. hARTM class is now deprecated. Note that dump_artm_model and load_artm_model is only partly supported. After load_artm_model() you need to set parent model manually via set_parent_model(), and also to specify value for ARTM.parent_model_weight property.

clone()¶

Description:	returns a deep copy of the artm.ARTM object
Note:	This method is equivalent to copy.deepcopy() of your artm.ARTM object. Both methods perform deep copy of the object, including a complete copy of its internal C++ state (e.g. a copy of all phi and theta matrices, scores and regularizers, as well as ScoreTracker information with history of the scores). Attached phi matrices are copied as dense phi matrices.

dispose()¶

Description:	free all native memory, allocated for this model
Note:	This method does not free memory occupied by dictionaries, because dictionaries are shared across all models ARTM class implements __exit__ and __del___ methods, which automatically call dispose.

dump_artm_model(data_path)¶

Description:	dump all necessary model files into given folder.
Parameters:	data_path (str) – full path to folder (should unexist)

fit_offline(batch_vectorizer=None, num_collection_passes=1, reset_nwt=True)¶

Description:	proceeds the learning of topic model in offline mode
Parameters:	batch_vectorizer (object_referenece) – an instance of BatchVectorizer class num_collection_passes (int) – number of iterations over whole given collection reset_nwt (bool) – a flag indicating whether to reset n_wt matrix to 0.

fit_online(batch_vectorizer=None, tau0=1024.0, kappa=0.7, update_every=1, apply_weight=None, decay_weight=None, update_after=None, asynchronous=False)¶

Update formulas:
Description:	proceeds the learning of topic model in online mode
Parameters:	batch_vectorizer (object_reference) – an instance of BatchVectorizer class update_every (int) – the number of batches; model will be updated once per it tau0 (float) – coefficient (see ‘Update formulas’ paragraph) kappa (float) (float) – power for tau0, (see ‘Update formulas’ paragraph) update_after (list of int) – number of batches to be passed for Phi synchronizations apply_weight (list of float) – weight of applying new counters decay_weight (list of float) – weight of applying old counters asynchronous (bool) – use or not the asynchronous implementation of the EM-algorithm
Note:	asynchronous=True leads to impossibility of score extraction via score_tracker. Use get_score() instead.
	The formulas for decay_weight and apply_weight: update_count = current_processed_docs / (batch_size * update_every); rho = pow(tau0 + update_count, -kappa); decay_weight = 1-rho; apply_weight = rho; if apply_weight, decay_weight and update_after are set, they will be used, otherwise the code below will be used (with update_every, tau0 and kappa)

get_parent_psi()¶

Returns:	p(subtopic\|topic) matrix

get_phi(topic_names=None, class_ids=None, model_name=None)¶

Description:

get custom Phi matrix of model. The extraction of the whole Phi matrix expects ARTM.phi_ call.

Parameters:

topic_names (list of str or str or None) – list with topics or single topic to extract, None value means all topics
class_ids (list of str or str or None) – list with class_ids or single class_id to extract, None means all class ids
model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters

Returns:

pandas.DataFrame: (data, columns, rows), where:
columns — the names of topics in topic model;
rows — the tokens of topic model;
data — content of Phi matrix.

get_phi_dense(topic_names=None, class_ids=None, model_name=None)¶

Description:

get phi matrix in dense format

Parameters:

topic_names (list of str or str or None) – list with topics or single topic to extract, None value means all topics
class_ids (list of str or str or None) – list with class_ids or single class_id to extract, None means all class ids
model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters

Returns:

a 3-tuple of (data, rows, columns), where
data — numpy.ndarray with Phi data (i.e., p(w|t) values)
rows — the tokens of topic model;
columns — the names of topics in topic model;

get_phi_sparse(topic_names=None, class_ids=None, model_name=None, eps=None)¶

Description:

get phi matrix in sparse format

Parameters:

topic_names (list of str or str or None) – list with topics or single topic to extract, None value means all topics
class_ids (list of str or str or None) – list with class_ids or single class_id to extract, None means all class ids
model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters
eps (float) – threshold to consider values as zero

Returns:

a 3-tuple of (data, rows, columns), where
data — scipy.sparse.csr_matrix with values
columns — the names of topics in topic model;
rows — the tokens of topic model;

get_score(score_name)¶

Description:	get score after fit_offline, fit_online or transform
Parameters:	score_name (str) – the name of the score to return

get_theta(topic_names=None)¶

Description:	get Theta matrix for training set of documents (or cached after transform)
Parameters:	topic_names (list of str or str or None) – list with topics or single topic to extract, None means all topics
Returns:	pandas.DataFrame: (data, columns, rows), where: columns — the ids of documents, for which the Theta matrix was requested; rows — the names of topics in topic model, that was used to create Theta; data — content of Theta matrix.

get_theta_sparse(topic_names=None, eps=None)¶

Description:

get Theta matrix in sparse format

Parameters:

topic_names (list of str or str or None) – list with topics or single topic to extract, None means all topics
eps (float) – threshold to consider values as zero

Returns:

a 3-tuple of (data, rows, columns), where
data — scipy.sparse.csr_matrix with values
columns — the ids of documents;
rows — the names of topics in topic model;

info¶

Description:	returns internal diagnostics information about the model

initialize(dictionary=None)¶

Description:	initialize topic model before learning
Parameters:	dictionary (str or reference to Dictionary object) – loaded BigARTM collection dictionary

library_version¶

Description:	the version of BigARTM library in a MAJOR.MINOR.PATCH format

load(filename, model_name='p_wt')¶

Description:	loads from disk the topic model saved by ARTM.save()
Parameters:	filename (str) – the name of file containing model model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’
Note:	Loaded model will overwrite ARTM.topic_names, class_ids and transaction_typenames fields. All transaction_typenames (class_ids) weights will be set to 1.0, you need to specify them by hand if it’s necessary. The method call will empty ARTM.score_tracker. All regularizers and scores will be forgotten. etc. We strongly recommend you to reset all important parameters of the ARTM model, used earlier.

remove_theta()¶

Description:	removes cached theta matrix

reshape(topic_names=None, dictionary=None)¶

Description:	change the shape of the model,

e.i. add/remove topics, or add/remove tokens.

Parameters:	topic_names (list of str) – names of topics in model dictionary (str or reference to Dictionary object) – dictionary that define new set of tokens

Only one of the arguments (topic_names or dictionary) can be specified at a time. For further description see methods ARTM.reshape_topics() and ARTM.reshape_tokens().

reshape_tokens(dictionary)¶

Description:	update tokens of the model.

Adds, removes, or reorders the tokens of the model according to a new dictionary. This operation changes n_wt matrix, but has no immediate effect on the p_wt matrix. You are expected to call ARTM.fit_offline() method to re-calculate p_wt matrix for the new set of tokens.

reshape_topics(topic_names)¶

Description:	update topic names of the model.

Adds, removes, or reorders columns of phi matrices according to the new set of topic names. New topics are initialized with zeros.

save(filename, model_name='p_wt')¶

Description:	saves one Phi-like matrix to disk
Parameters:	filename (str) – the name of file to store model model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’

set_parent_model(parent_model, parent_model_weight=None)¶

Description:	sets parent model. For more details, see comment in ARTM.__init__.
Parameters:	parent_model (ARTM) – An instance of ARTM class to use as parent level of hierarchy

topic_names¶

Description:

Gets or sets the list of topic names of the model.

Note:

Setting topic name allows you to put new labels on the existing topics. To add, remove or reorder topics use ARTM.reshape_topics() method.
In ARTM topic names are used just as string identifiers, which give a unique name to each column of the phi matrix. Typically you want to set topic names as something like “topic0”, “topic1”, etc. Later operations like get_phi() allow you to specify which topics you need to retrieve. Most regularizers allow you to limit the set of topics they act upon. If you configure a rich set of regularizers it is important design your topic names according to how they are regularizerd. For example, you may use names obj0, obj1, …, objN for objective topics (those where you enable sparsity regularizers), and back0, back1, …, backM for background topics (those where you enable smoothing regularizers).

transform(batch_vectorizer=None, theta_matrix_type='dense_theta', predict_class_id=None)¶

Description:	find Theta matrix for new documents
Parameters:	batch_vectorizer (object_reference) – an instance of BatchVectorizer class theta_matrix_type (str) – type of matrix to be returned, possible values: ‘dense_theta’, ‘dense_ptdw’, ‘cache’, None, default=’dense_theta’ predict_class_id (str) – class_id of a target modality to predict. When this option is enabled the resulting columns of theta matrix will correspond to unique labels of a target modality. The values will represent p(c\|d), which give the probability of class label c for document d.
Returns:	pandas.DataFrame: (data, columns, rows), where: columns — the ids of documents, for which the Theta matrix was requested; rows — the names of topics in topic model, that was used to create Theta; data — content of Theta matrix.
Note:	‘dense_ptdw’ mode provides simple access to values of p(t\|w,d). The resulting pandas.DataFrame object will contain a flat theta matrix (no 3D) where each item has multiple columns - as many as the number of tokens in that document. These columns will have the same item_id. The order of columns with equal item_id is the same as the order of tokens in the input data (batch.item.token_id).

transform_sparse(batch_vectorizer, eps=None)¶

Description:

find Theta matrix for new documents as sparse scipy matrix

Parameters:

batch_vectorizer (object_reference) – an instance of BatchVectorizer class
eps (float) – threshold to consider values as zero

Returns:

a 3-tuple of (data, rows, columns), where
data — scipy.sparse.csr_matrix with values
columns — the ids of documents;
rows — the names of topics in topic model;