ARTM model¶

This page describes ARTM class.

class artm.ARTM(num_processors=0, topic_names=None, num_topics=10, class_ids=None, cache_theta=True, scores=None, regularizers=None, theta_columns_naming='id')¶

ARTM represents a topic model (public class)

Parameters:

num_processors (int) – how many threads will be used for model training, if not specified then number of threads will be detected by the lib
topic_names (list of str) – names of topics in model, if not specified will be auto-generated by lib according to num_topics
num_topics (int) – number of topics in model (is used if topic_names not specified), default=10
class_ids (dict) – list of class_ids and their weights to be used in model, key — class_id, value — weight, if not specified then all class_ids will be used
cache_theta (bool) – save or not the Theta matrix in model. Necessary if ARTM.get_theta() usage expects, default=True
scores (list) – list of scores (objects of artm.***Score classes), default=None
regularizers (list) – list with regularizers (objects of artm.***Regularizer classes), default=None
theta_columns_naming (string) – either ‘id’ or ‘title’, determines how to name columns (documents) in theta dataframe, default=’id’

Important public fields:

regularizers — contains dict of regularizers, included into model
scores — contains dict of scores, included into model
score_tracker — contains dict of scoring results; key — score name, value — ScoreTracker object, which contains info about values of score on each synchronization in list

Note

Here and anywhere in BigARTM empty topic_names or class_ids means that model (or regularizer, or score) should use all topics or class_ids.
If some fields of regularizers or scores are not defined by user — internal lib defaults would be used.
If field ‘topic_names’ is None, it will be generated by BigARTM and will be available using ARTM.topic_names().

create_dictionary(dictionary_name=None, dictionary_data=None)¶

ARTM.save_dictionary() — save the BigARTM dictionary of the collection on the disk

Parameters:	dictionary_name (str) – the name of the dictionary in the lib, default=None dictionary_data (DictionaryData instance) – configuration of dictionary, default=None

filter_dictionary(dictionary_name=None, dictionary_target_name=None, class_id=None, min_df=None, max_df=None, min_df_rate=None, max_df_rate=None, min_tf=None, max_tf=None)¶

ARTM.filter_dictionary() — filter the BigARTM dictionary of the collection, which was already loaded into the lib

Parameters:

dictionary_name (str) – name of the dictionary in the lib to filter
dictionary_target_name (str) – name for the new filtered dictionary in the lib
class_id (str) – class_id to filter
min_df (float) – min df value to pass the filter
max_df (float) – max df value to pass the filter
min_df_rate (float) – min df rate to pass the filter
max_df_rate (float) – max df rate to pass the filter
min_tf (float) – min tf value to pass the filter
max_tf (float) – max tf value to pass the filter

fit_offline(batch_vectorizer=None, num_collection_passes=20, num_document_passes=1, reuse_theta=True, dictionary_filename='dictionary.dict')¶

ARTM.fit_offline() — proceed the learning of topic model in off-line mode

Parameters:

batch_vectorizer – an instance of BatchVectorizer class
num_collection_passes (int) – number of iterations over whole given collection, default=20
num_document_passes (int) – number of inner iterations over each document for inferring theta, default=1
reuse_theta (bool) – using theta from previous pass of the collection, default=True
dictionary_filename (str) – the name of file with dictionary to use in inline initialization, default=’dictionary’

Note

ARTM.initialize() should be proceed before first call ARTM.fit_offline(), or it will be initialized by dictionary during first call.

fit_online(batch_vectorizer=None, tau0=1024.0, kappa=0.7, update_every=1, num_document_passes=10, reset_theta_scores=False, dictionary_filename='dictionary.dict')¶

ARTM.fit_online() — proceed the learning of topic model in on-line mode

Parameters:

batch_vectorizer – an instance of BatchVectorizer class
update_every (int) – the number of batches; model will be updated once per it, default=1
tau0 (float) – coefficient (see kappa), default=1024.0
kappa (float) – power for tau0, default=0.7
num_document_passes (int) – number of inner iterations over each document for inferring theta, default=10
reset_theta_scores (bool) – reset accumulated Theta scores before learning, default=False
dictionary_filename (str) – the name of file with dictionary to use in inline initialization, default=’dictionary’

Note

The formulas for decay_weight and apply_weight:

update_count = current_processed_docs / (batch_size * update_every)
rho = pow(tau0 + update_count, -kappa)
decay_weight = 1-rho
apply_weight = rho

Note

ARTM.initialize() should be proceed before first call ARTM.fit_online(), or it will be initialized by dictionary during first call.

fit_transform(topic_names=None, remove_theta=False)¶: ARTM.fit_transform() — obsolete way of theta retrieval. Use get_theta instead.

gather_dictionary(dictionary_target_name=None, data_path=None, cooc_file_path=None, vocab_file_path=None, symmetric_cooc_values=False)¶

ARTM.gather_dictionary() — create the BigARTM dictionary of the collection, represented as batches and load it in the lib

Parameters:

dictionary_target_name (str) – the name of the dictionary in the lib, default=None
data_path (str) – full path to batches folder
cooc_file_path (str) – full path to the file with cooc info
vocab_file_path (str) – full path to the file with vocabulary. If given, the dictionary token will have the same order, as in this file, otherwise the order will be random, default=None
symmetric_cooc_values (bool) – if the cooc matrix should considered to be symmetric or not, default=False

get_phi(topic_names=None, class_ids=None, model_name=None)¶

ARTM.get_phi() — get custom Phi matrix of model. The extraction of the whole Phi matrix expects ARTM.phi_ call.

Parameters:

topic_names (list of str) – list with topics to extract, default=None (means all topics)
class_ids (list of str) – list with class ids to extract, default=None (means all class ids)
model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters

Returns:

pandas.DataFrame (data, columns, rows), where:

columns — the names of topics in topic model
rows — the tokens of topic model
data — content of Phi matrix

get_theta(topic_names=None, remove_theta=False)¶

ARTM.get_theta() — get Theta matrix for training set of documents

Parameters:

topic_names (list of str) – list with topics to extract, default=None (means all topics)
remove_theta (bool) – flag indicates save or remove Theta from model after extraction, default=False

Returns:

pandas.DataFrame (data, columns, rows), where:

columns — the ids of documents, for which the Theta matrix was requested
rows — the names of topics in topic model, that was used to create Theta
data — content of Theta matrix

initialize(dictionary_name=None, seed=-1)¶

ARTM.initialize() — initialize topic model before learning

Parameters:	dictionary_name (str) – the name of loaded BigARTM collection dictionary, default=None seed (unsigned int or -1) – seed for random initialization, default=-1 (no seed)

load(filename)¶

ARTM.load() — load the topic model, saved by ARTM.save(), from disk

Parameters:	filename (str) – the name of file containing model, no default

Note

Loaded model will overwrite ARTM.topic_names and ARTM.num_topics fields. Also it will empty ARTM.score_tracker.

load_dictionary(dictionary_name=None, dictionary_path=None)¶

ARTM.load_dictionary() — load the BigARTM dictionary of the collection into the lib

Parameters:	dictionary_name (str) – the name of the dictionary in the lib, default=None dictionary_path (str) – full file name of the dictionary, default=None

load_text_dictionary(dictionary_name=None, dictionary_path=None, encoding='utf-8')¶

ARTM.load_text_dictionary() — load the BigARTM dictionary of the collection from the disk in the human readable text format

Parameters:	dictionary_name (str) – the name for the dictionary in the lib, default=None dictionary_path (str) – full file name of the text dictionary file, default=None encoding (str) – an encoding of text in diciotnary

remove_dictionary(dictionary_name=None)¶

ARTM.remove_dictionary() — remove the loaded BigARTM dictionary from the lib

Parameters:	dictionary_name (str) – the name of the dictionary in the lib, default=None

save(filename='artm_model')¶

ARTM.save() — save the topic model to disk

Parameters:	filename (str) – the name of file to store model, default=’artm_model’

save_dictionary(dictionary_name=None, dictionary_path=None)¶

ARTM.save_dictionary() — save the BigARTM dictionary of the collection on the disk

Parameters:	dictionary_name (str) – the name of the dictionary in the lib, default=None dictionary_path (str) – full file name for the dictionary, default=None

save_text_dictionary(dictionary_name=None, dictionary_path=None, encoding='utf-8')¶

ARTM.save_text_dictionary() — save the BigARTM dictionary of the collection on the disk in the human readable text format

Parameters:	dictionary_name (str) – the name of the dictionary in the lib, default=None dictionary_path (str) – full file name for the text dictionary file, default=None encoding (str) – an encoding of text in diciotnary

transform(batch_vectorizer=None, num_document_passes=1, predict_class_id=None)¶

ARTM.transform() — find Theta matrix for new documents

Parameters:

batch_vectorizer – an instance of BatchVectorizer class
num_document_passes (int) – number of inner iterations over each document for inferring theta, default = 1
predict_class_id (string) – class_id of a target modality to predict, default = None. When this option is enabled the resulting columns of theta matrix will correspond to unique labels of a target modality. The values will represent p(c|d), which give the probability of class label c for document d.

Returns:

pandas.DataFrame (data, columns, rows), where:

columns — the ids of documents, for which the Theta matrix was requested
rows — the names of topics in topic model, that was used to create Theta
data — content of Theta matrix.