ARTM model

This page describes ARTM class.

class artm.ARTM(num_processors=0, topic_names=None, num_topics=10, class_ids=None, cache_theta=True, scores=None, regularizers=None, theta_columns_naming='id')

ARTM represents a topic model (public class)

Parameters:
  • num_processors (int) – how many threads will be used for model training, if not specified then number of threads will be detected by the lib
  • topic_names (list of str) – names of topics in model, if not specified will be auto-generated by lib according to num_topics
  • num_topics (int) – number of topics in model (is used if topic_names not specified), default=10
  • class_ids (dict) – list of class_ids and their weights to be used in model, key — class_id, value — weight, if not specified then all class_ids will be used
  • cache_theta (bool) – save or not the Theta matrix in model. Necessary if ARTM.get_theta() usage expects, default=True
  • scores (list) – list of scores (objects of artm.***Score classes), default=None
  • regularizers (list) – list with regularizers (objects of artm.***Regularizer classes), default=None
  • theta_columns_naming (string) – either ‘id’ or ‘title’, determines how to name columns (documents) in theta dataframe, default=’id’
Important public fields:
  • regularizers — contains dict of regularizers, included into model
  • scores — contains dict of scores, included into model
  • score_tracker — contains dict of scoring results; key — score name, value — ScoreTracker object, which contains info about values of score on each synchronization in list

Note

  • Here and anywhere in BigARTM empty topic_names or class_ids means that model (or regularizer, or score) should use all topics or class_ids.
  • If some fields of regularizers or scores are not defined by user — internal lib defaults would be used.
  • If field ‘topic_names’ is None, it will be generated by BigARTM and will be available using ARTM.topic_names().
create_dictionary(dictionary_name=None, dictionary_data=None)

ARTM.save_dictionary() — save the BigARTM dictionary of the collection on the disk

Parameters:
  • dictionary_name (str) – the name of the dictionary in the lib, default=None
  • dictionary_data (DictionaryData instance) – configuration of dictionary, default=None
filter_dictionary(dictionary_name=None, dictionary_target_name=None, class_id=None, min_df=None, max_df=None, min_df_rate=None, max_df_rate=None, min_tf=None, max_tf=None)

ARTM.filter_dictionary() — filter the BigARTM dictionary of the collection, which was already loaded into the lib

Parameters:
  • dictionary_name (str) – name of the dictionary in the lib to filter
  • dictionary_target_name (str) – name for the new filtered dictionary in the lib
  • class_id (str) – class_id to filter
  • min_df (float) – min df value to pass the filter
  • max_df (float) – max df value to pass the filter
  • min_df_rate (float) – min df rate to pass the filter
  • max_df_rate (float) – max df rate to pass the filter
  • min_tf (float) – min tf value to pass the filter
  • max_tf (float) – max tf value to pass the filter
fit_offline(batch_vectorizer=None, num_collection_passes=20, num_document_passes=1, reuse_theta=True, dictionary_filename='dictionary.dict')

ARTM.fit_offline() — proceed the learning of topic model in off-line mode

Parameters:
  • batch_vectorizer – an instance of BatchVectorizer class
  • num_collection_passes (int) – number of iterations over whole given collection, default=20
  • num_document_passes (int) – number of inner iterations over each document for inferring theta, default=1
  • reuse_theta (bool) – using theta from previous pass of the collection, default=True
  • dictionary_filename (str) – the name of file with dictionary to use in inline initialization, default=’dictionary’

Note

ARTM.initialize() should be proceed before first call ARTM.fit_offline(), or it will be initialized by dictionary during first call.

fit_online(batch_vectorizer=None, tau0=1024.0, kappa=0.7, update_every=1, num_document_passes=10, reset_theta_scores=False, dictionary_filename='dictionary.dict')

ARTM.fit_online() — proceed the learning of topic model in on-line mode

Parameters:
  • batch_vectorizer – an instance of BatchVectorizer class
  • update_every (int) – the number of batches; model will be updated once per it, default=1
  • tau0 (float) – coefficient (see kappa), default=1024.0
  • kappa (float) – power for tau0, default=0.7
  • num_document_passes (int) – number of inner iterations over each document for inferring theta, default=10
  • reset_theta_scores (bool) – reset accumulated Theta scores before learning, default=False
  • dictionary_filename (str) – the name of file with dictionary to use in inline initialization, default=’dictionary’

Note

The formulas for decay_weight and apply_weight:

  • update_count = current_processed_docs / (batch_size * update_every)
  • rho = pow(tau0 + update_count, -kappa)
  • decay_weight = 1-rho
  • apply_weight = rho

Note

ARTM.initialize() should be proceed before first call ARTM.fit_online(), or it will be initialized by dictionary during first call.

fit_transform(topic_names=None, remove_theta=False)

ARTM.fit_transform() — obsolete way of theta retrieval. Use get_theta instead.

gather_dictionary(dictionary_target_name=None, data_path=None, cooc_file_path=None, vocab_file_path=None, symmetric_cooc_values=False)

ARTM.gather_dictionary() — create the BigARTM dictionary of the collection, represented as batches and load it in the lib

Parameters:
  • dictionary_target_name (str) – the name of the dictionary in the lib, default=None
  • data_path (str) – full path to batches folder
  • cooc_file_path (str) – full path to the file with cooc info
  • vocab_file_path (str) – full path to the file with vocabulary. If given, the dictionary token will have the same order, as in this file, otherwise the order will be random, default=None
  • symmetric_cooc_values (bool) – if the cooc matrix should considered to be symmetric or not, default=False
get_phi(topic_names=None, class_ids=None, model_name=None)

ARTM.get_phi() — get custom Phi matrix of model. The extraction of the whole Phi matrix expects ARTM.phi_ call.

Parameters:
  • topic_names (list of str) – list with topics to extract, default=None (means all topics)
  • class_ids (list of str) – list with class ids to extract, default=None (means all class ids)
  • model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters
Returns:

pandas.DataFrame (data, columns, rows), where:

  • columns — the names of topics in topic model
  • rows — the tokens of topic model
  • data — content of Phi matrix

get_theta(topic_names=None, remove_theta=False)

ARTM.get_theta() — get Theta matrix for training set of documents

Parameters:
  • topic_names (list of str) – list with topics to extract, default=None (means all topics)
  • remove_theta (bool) – flag indicates save or remove Theta from model after extraction, default=False
Returns:

pandas.DataFrame (data, columns, rows), where:

  • columns — the ids of documents, for which the Theta matrix was requested
  • rows — the names of topics in topic model, that was used to create Theta
  • data — content of Theta matrix

initialize(dictionary_name=None, seed=-1)

ARTM.initialize() — initialize topic model before learning

Parameters:
  • dictionary_name (str) – the name of loaded BigARTM collection dictionary, default=None
  • seed (unsigned int or -1) – seed for random initialization, default=-1 (no seed)
load(filename)

ARTM.load() — load the topic model, saved by ARTM.save(), from disk

Parameters:filename (str) – the name of file containing model, no default

Note

Loaded model will overwrite ARTM.topic_names and ARTM.num_topics fields. Also it will empty ARTM.score_tracker.

load_dictionary(dictionary_name=None, dictionary_path=None)

ARTM.load_dictionary() — load the BigARTM dictionary of the collection into the lib

Parameters:
  • dictionary_name (str) – the name of the dictionary in the lib, default=None
  • dictionary_path (str) – full file name of the dictionary, default=None
load_text_dictionary(dictionary_name=None, dictionary_path=None, encoding='utf-8')

ARTM.load_text_dictionary() — load the BigARTM dictionary of the collection from the disk in the human readable text format

Parameters:
  • dictionary_name (str) – the name for the dictionary in the lib, default=None
  • dictionary_path (str) – full file name of the text dictionary file, default=None
  • encoding (str) – an encoding of text in diciotnary
remove_dictionary(dictionary_name=None)

ARTM.remove_dictionary() — remove the loaded BigARTM dictionary from the lib

Parameters:dictionary_name (str) – the name of the dictionary in the lib, default=None
save(filename='artm_model')

ARTM.save() — save the topic model to disk

Parameters:filename (str) – the name of file to store model, default=’artm_model’
save_dictionary(dictionary_name=None, dictionary_path=None)

ARTM.save_dictionary() — save the BigARTM dictionary of the collection on the disk

Parameters:
  • dictionary_name (str) – the name of the dictionary in the lib, default=None
  • dictionary_path (str) – full file name for the dictionary, default=None
save_text_dictionary(dictionary_name=None, dictionary_path=None, encoding='utf-8')

ARTM.save_text_dictionary() — save the BigARTM dictionary of the collection on the disk in the human readable text format

Parameters:
  • dictionary_name (str) – the name of the dictionary in the lib, default=None
  • dictionary_path (str) – full file name for the text dictionary file, default=None
  • encoding (str) – an encoding of text in diciotnary
transform(batch_vectorizer=None, num_document_passes=1, predict_class_id=None)

ARTM.transform() — find Theta matrix for new documents

Parameters:
  • batch_vectorizer – an instance of BatchVectorizer class
  • num_document_passes (int) – number of inner iterations over each document for inferring theta, default = 1
  • predict_class_id (string) – class_id of a target modality to predict, default = None. When this option is enabled the resulting columns of theta matrix will correspond to unique labels of a target modality. The values will represent p(c|d), which give the probability of class label c for document d.
Returns:

pandas.DataFrame (data, columns, rows), where:

  • columns — the ids of documents, for which the Theta matrix was requested
  • rows — the names of topics in topic model, that was used to create Theta
  • data — content of Theta matrix.