hARTM¶
This page describes hARTM class.
-
class
artm.
hARTM
(num_processors=None, class_ids=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1, tmp_files_path='')¶ -
__init__
(num_processors=None, class_ids=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1, tmp_files_path='')¶ Description: a class for constructing topic hierarchy that is a sequence of tied artm.ARTM() models (levels)
Parameters: - num_processors (int) – how many threads will be used for model training, if not specified then number of threads will be detected by the lib
- class_ids (dict) – list of class_ids and their weights to be used in model, key — class_id, value — weight, if not specified then all class_ids will be used
- cache_theta (bool) – save or not the Theta matrix in model. Necessary if ARTM.get_theta() usage expects
- scores (list) – list of scores (objects of artm.*Score classes)
- regularizers (list) – list with regularizers (objects of artm.*Regularizer classes)
- num_document_passes (int) – number of inner iterations over each document
- dictionary (str or reference to Dictionary object) – dictionary to be used for initialization, if None nothing will be done
- reuse_theta (bool) – reuse Theta from previous iteration or not
- theta_columns_naming (str) – either ‘id’ or ‘title’, determines how to name columns (documents) in theta dataframe
- seed (unsigned int or -1) – seed for random initialization, -1 means no seed
- tmp_files_path (str) – a path where to save temporary files (temporary solution), default value: current directory
Usage: - to construct hierarchy you have to learn several ARTM models:
hier = artm.hARTM() level0 = hier.add_level(num_topics=5) # returns artm.ARTM() instance # work with level0 as with usual model level1 = hier.add_level(num_topics=25, parent_level_weight=1) # work with level1 as with usual model # …
- to get the i-th level’s model, use
level = hier[i]
- or
level = hier.get_level(i)
- to iterate through levels use
- for level in hier:
# some work with level
method hier.del_level(…) removes i-th level and all levels after it
other hARTM methods correspond to those in ARTM class and call them sequantially for all levels of hierarchy from 0 to the last one. For example, to fit levels offline you may call fit_offline method of hARTM instance or of each level individually.
-
add_level
(num_topics=None, topic_names=None, parent_level_weight=1)¶ Description: adds new level to the hierarchy
Parameters: - num_topics (int) – the number of topics in level model, will be overwriten if parameter topic_names is set
- topic_names (list of str) – names of topics in model
- parent_level_weight (float) – the coefficient of smoothing n_wt by n_wa, a enumerates parent topics
Returns: ARTM or derived ARTM_Level instance
Notes: - hierarchy structure assumes the number of topics on each following level is greater than on previous one
- work with returned value as with usual ARTM model
- to access any level, use [] or get_level method
- Important! You cannot add next level before previous one is initialized and fit.
-
clone
()¶ Description: returns a deep copy of the artm.hARTM object
Note: - This method is equivalent to copy.deepcopy() of your artm.hARTM object. For more information refer to artm.ARTM.clone() method.
-
del_level
(level_idx)¶ Description: removes i-th level and all following levels. Parameters: level_idx (int) – the number of level from what to start removing if -1, the last level is removed
-
dispose
()¶ Description: free all native memory, allocated for this hierarchy
Note: - This method does not free memory occupied by models’ dictionaries, because dictionaries are shared across all models
- hARTM class implements __exit__ and __del__ methods, which automatically call dispose.
-
fit_offline
(batch_vectorizer, num_collection_passes=1)¶ Description: proceeds the learning of all hirarchy levels from 0 to the last one
Parameters: - batch_vectorizer (object_referenece) – an instance of BatchVectorizer class
- num_collection_passes (int) – number of iterations over whole given collection for each level
Note: - You cannot add add next level before previous one is fit. So use this method only when all levels are added, initialized and fit, for example, when you added one more regularizer or loaded hierarchy from disk.
-
get_level
(level_idx)¶ Description: access level Parameters: level_idx (int) – the number of level to return Returns: specified level that is ARTM or derived ARTM_Level instance
-
get_phi
(class_ids=None, model_name=None)¶ Description: get level-wise horizontally stacked Phi matrices
Parameters: - class_ids (list of str or str) – list with class_ids or single class_id to extract, None means all class ids
- model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters
Returns: - pandas.DataFrame: (data, columns, rows), where:
- columns — the names of topics in format level_X_Y
- where X is level index and Y is topic name;
- rows — the tokens of topic model;
- data — content of Phi matrix.
Note: - if you need to extract specified topics, use get_phi() method of individual level model
-
get_theta
(topic_names=None)¶ Description: get level-wise vertically stacked Theta matrices for training set of documents Parameters: topic_names (list of str) – list with topics to extract, None means all topics Returns: - pandas.DataFrame: (data, columns, rows), where:
- columns — the ids of documents, for which the Theta matrix was requested;
- rows — the names of topics in format level_X_Y
- where X is level index and Y is topic name;
- data — content of Theta matrix.
-
library_version
¶ Description: the version of BigARTM library in a MAJOR.MINOR.PATCH format
-
load
(path)¶ Description: loads models of already constructed hierarchy
Parameters: path (str) – a path where hierarchy was saved by hARTM.save method
Notes: - Loaded models will overwrite ARTM.topic_names and class_ids fields of each level.
- All class_ids weights will be set to 1.0, you need to specify them by hand if it’s necessary.
- The method call will empty ARTM.score_tracker of each level.
- All regularizers and scores will be forgotten.
- etc.
- We strongly recommend you to reset all important parameters of the ARTM models and hARTM, used earlier.
-
save
(path)¶ Description: saves all levels Parameters: path (str) – a path where to save hierarchy files This must be existing empty path, otherwise exception is raised
-
transform
(batch_vectorizer)¶ Description: get level-wise vertically stacked Theta matrices for new documents
Parameters: batch_vectorizer (object_reference) – an instance of BatchVectorizer class
Returns: - pandas.DataFrame: (data, columns, rows), where:
- columns — the ids of documents, for which the Theta matrix was requested;
- rows — the names of topics in format level_X_Y
- where X is level index and Y is topic name;
- data — content of Theta matrix.
Note: - to access p(t|d, w) matrix or to predict class use transform method of hierarchy level individually
-