hARTM

This page describes hARTM class.

class artm.hARTM(num_processors=None, class_ids=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1, tmp_files_path='')
__init__(num_processors=None, class_ids=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1, tmp_files_path='')
Description:

a class for constructing topic hierarchy that is a sequence of tied artm.ARTM() models (levels)

Parameters:
  • num_processors (int) – how many threads will be used for model training, if not specified then number of threads will be detected by the lib
  • class_ids (dict) – list of class_ids and their weights to be used in model, key — class_id, value — weight, if not specified then all class_ids will be used
  • cache_theta (bool) – save or not the Theta matrix in model. Necessary if ARTM.get_theta() usage expects
  • scores (list) – list of scores (objects of artm.*Score classes)
  • regularizers (list) – list with regularizers (objects of artm.*Regularizer classes)
  • num_document_passes (int) – number of inner iterations over each document
  • dictionary (str or reference to Dictionary object) – dictionary to be used for initialization, if None nothing will be done
  • reuse_theta (bool) – reuse Theta from previous iteration or not
  • theta_columns_naming (str) – either ‘id’ or ‘title’, determines how to name columns (documents) in theta dataframe
  • seed (unsigned int or -1) – seed for random initialization, -1 means no seed
  • tmp_files_path (str) – a path where to save temporary files (temporary solution), default value: current directory
Usage:
  • to construct hierarchy you have to learn several ARTM models:

    hier = artm.hARTM() level0 = hier.add_level(num_topics=5) # returns artm.ARTM() instance # work with level0 as with usual model level1 = hier.add_level(num_topics=25, parent_level_weight=1) # work with level1 as with usual model # …

  • to get the i-th level’s model, use

    level = hier[i]

    or

    level = hier.get_level(i)

  • to iterate through levels use
    for level in hier:

    # some work with level

  • method hier.del_level(…) removes i-th level and all levels after it

  • other hARTM methods correspond to those in ARTM class and call them sequantially for all levels of hierarchy from 0 to the last one. For example, to fit levels offline you may call fit_offline method of hARTM instance or of each level individually.

add_level(num_topics=None, topic_names=None, parent_level_weight=1)
Description:

adds new level to the hierarchy

Parameters:
  • num_topics (int) – the number of topics in level model, will be overwriten if parameter topic_names is set
  • topic_names (list of str) – names of topics in model
  • parent_level_weight (float) – the coefficient of smoothing n_wt by n_wa, a enumerates parent topics
Returns:

ARTM or derived ARTM_Level instance

Notes:
  • hierarchy structure assumes the number of topics on each following level is greater than on previous one
  • work with returned value as with usual ARTM model
  • to access any level, use [] or get_level method
  • Important! You cannot add next level before previous one is initialized and fit.
clone()
Description:

returns a deep copy of the artm.hARTM object

Note:
  • This method is equivalent to copy.deepcopy() of your artm.hARTM object. For more information refer to artm.ARTM.clone() method.
del_level(level_idx)
Description:removes i-th level and all following levels.
Parameters:level_idx (int) – the number of level from what to start removing if -1, the last level is removed
dispose()
Description:

free all native memory, allocated for this hierarchy

Note:
  • This method does not free memory occupied by models’ dictionaries, because dictionaries are shared across all models
  • hARTM class implements __exit__ and __del__ methods, which automatically call dispose.
fit_offline(batch_vectorizer, num_collection_passes=1)
Description:

proceeds the learning of all hirarchy levels from 0 to the last one

Parameters:
  • batch_vectorizer (object_referenece) – an instance of BatchVectorizer class
  • num_collection_passes (int) – number of iterations over whole given collection for each level
Note:
  • You cannot add add next level before previous one is fit. So use this method only when all levels are added, initialized and fit, for example, when you added one more regularizer or loaded hierarchy from disk.
get_level(level_idx)
Description:access level
Parameters:level_idx (int) – the number of level to return
Returns:specified level that is ARTM or derived ARTM_Level instance
get_phi(class_ids=None, model_name=None)
Description:

get level-wise horizontally stacked Phi matrices

Parameters:
  • class_ids (list of str or str) – list with class_ids or single class_id to extract, None means all class ids
  • model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters
Returns:

  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the names of topics in format level_X_Y
    where X is level index and Y is topic name;
  • rows — the tokens of topic model;
  • data — content of Phi matrix.

Note:
  • if you need to extract specified topics, use get_phi() method of individual level model
get_theta(topic_names=None)
Description:get level-wise vertically stacked Theta matrices for training set of documents
Parameters:topic_names (list of str) – list with topics to extract, None means all topics
Returns:
  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the ids of documents, for which the Theta matrix was requested;
  • rows — the names of topics in format level_X_Y
    where X is level index and Y is topic name;
  • data — content of Theta matrix.
library_version
Description:the version of BigARTM library in a MAJOR.MINOR.PATCH format
load(path)
Description:

loads models of already constructed hierarchy

Parameters:

path (str) – a path where hierarchy was saved by hARTM.save method

Notes:
  • Loaded models will overwrite ARTM.topic_names and class_ids fields of each level.
  • All class_ids weights will be set to 1.0, you need to specify them by hand if it’s necessary.
  • The method call will empty ARTM.score_tracker of each level.
  • All regularizers and scores will be forgotten.
  • etc.
  • We strongly recommend you to reset all important parameters of the ARTM models and hARTM, used earlier.
save(path)
Description:saves all levels
Parameters:path (str) – a path where to save hierarchy files This must be existing empty path, otherwise exception is raised
transform(batch_vectorizer)
Description:

get level-wise vertically stacked Theta matrices for new documents

Parameters:

batch_vectorizer (object_reference) – an instance of BatchVectorizer class

Returns:

  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the ids of documents, for which the Theta matrix was requested;
  • rows — the names of topics in format level_X_Y
    where X is level index and Y is topic name;
  • data — content of Theta matrix.

Note:
  • to access p(t|d, w) matrix or to predict class use transform method of hierarchy level individually