Python Interface

This document explains all classes in python interface of BigARTM library.

Library

class artm.library.Library(artm_shared_library = "")

Creates an ArtmLibrary object, wrapping the BigARTM shared library.

The artm_shared_library is an optional argument, which provides full file name of artm shared library (a disk path plus artm.dll on Windows or artm.so on Linux). When artm_shared_library is not specified the shared library will be searched in folders listed in PATH system variable. You may also configure ARTM_SHARED_LIBRARY system variable to provide full file name of artm shared library.

CreateMasterComponent(config = None)

Creates and returns an instance of MasterComponent class. config defines an optional MasterComponentConfig parameter that may carry the configuration of the master component.

SaveBatch(batch, disk_path)

Saves a given Batch into a disk location specified by disk_path.

ParseCollection(collection_parser_config)

Parses a text collection as defined by collection_parser_config (CollectionParserConfig). Returns an instance of DictionaryConfig which carry all unique words in the collection and their frequencies.

For more information refer to ArtmRequestParseCollection() and ArtmRequestLoadDictionary().

LoadDictionary(full_filename)

Loads a DictionaryConfig from the file, defined by full_filename argument.

For more information refer to ArtmRequestLoadDictionary().

LoadBatch(full_filename)

Loads a Batch from the file, defined by full_filename argument.

For more information refer to ArtmRequestLoadBatch().

ParseCollectionOrLoadDictionary(docword_file_path, vocab_file_path, target_folder)

A simple helper method that runs ParseCollection() when target_folder is empty, otherwise tried to use LoadDictionary() to load the dictionary from target_folder.

The docword_file_path and vocab_file_path arguments should provide the disk location of docword and vocab files of the collection to be parsed.

MasterComponent

class artm.library.MasterComponent(config = None, lib = None, disk_path = None)

Creates a master component.

config is an optional instance of MasterComponentConfig, providing an initial configuration of the master component.

lib is an optional argument pointing to Library. When not specified, a default library will be used. Check the constructor of Library for more details.

disk_path is an optional value providing the disk folder with batches to process by this master component. Changing disk_path is not supported (you must recreate a new instance MasterComponent to do so). Use InvokeIteration() will process all batches, located under disk_path. Alternatively use AddBatch() to add a specific batch into processor queue.

Dispose()

Disposes the master component and releases all unmanaged resources.

config()

Returns current MasterComponentConfig of the master component.

CreateModel(config=None, topics_count=None, inner_iterations_count=None, class_ids=None, class_weights=None,
topic_names=None, use_sparse_format=None, request_type=None)

Creates and returns an instance of Model class based on a given ModelConfig. Note that the model has to be further tuned by several iterative scans over the text collection. Use InvokeIteration() to perform such scans.

All parameters will override values, specifed in config.

RemoveModel(model)

Removes an instance of Model from the master component. After this operation the model object became invalid and must not be used.

CreateRegularizer(name, type, config)

Creates and returns an instance of Regularizer component. name can be any unique identifier, that you can further use to identify regularizer (for example, in ModelConfig.regularizer_name). type can be any regularizer type (for example, the RegularizerConfig_Type_DirichletTheta). config can be any regularizer config (for example, a SmoothSparseThetaConfig).

CreateSmoothSparseThetaRegularizer(name = None, config = None)

Creates an instance of SmoothSparseThetaRegularizer. config is an optional argument of SmoothSparseThetaConfig type.

CreateSmoothSparsePhiRegularizer(name = None, config = None, topic_names=None, class_ids=None)

Creates an instance of SmoothSparsePhiRegularizer. config is an optional argument of SmoothSparsePhiConfig type.

CreateDecorrelatorPhiRegularizer(name = None, config = None, topic_names=None, class_ids=None)

Creates an instance of DecorrelatorPhiRegularizer. config is an optional argument of DecorrelatorPhiConfig type.

RemoveRegularizer(regularizer)

Removes an instance of Regularizer from the master component. After this operation the regularizer object became invalid and must not be used.

CreateScore(name, type, config)

Creates a score calculator inside the master component. name can be any unique identifier, that you can further use to identify the score (for example, in ModelConfig.score_name). type can be any score type (for example, the ScoreConfig_Type_Perplexity). config can be any score config (for example, a PerplexityScoreConfig).

CreatePerplexityScore(self, name = None, config = None, stream_name = None, class_ids=None)

Creates an instance of PerplexityScore. config is an optional argument of PerplexityScoreConfig type.

CreateSparsityThetaScore(self, name = None, config = None, topic_names=None)

Creates an instance of SparsityThetaScore. config is an optional argument of SparsityThetaScoreConfig type.

CreateSparsityPhiScore(self, name = None, config = None, topic_names=None, class_id=None)

Creates an instance of SparsityPhiScore. config is an optional argument of SparsityPhiScoreConfig type.

CreateItemsProcessedScore(self, name = None, config = None)

Creates an instance of ItemsProcessedScore. config is an optional argument of ItemsProcessedScoreConfig type.

CreateTopTokensScore(self, name = None, config = None, num_tokens = None, class_id = None, topic_names=None)

Creates an instance of TopTokensScore. config is an optional argument of TopTokensScoreConfig type.

CreateThetaSnippetScore(self, name = None, config = None)

Creates an instance of ThetaSnippetScore. config is an optional argument of ThetaSnippetScoreConfig type.

CreateTopicKernelScore(self, name = None, config = None, topic_names=None, class_id=None)

Creates an instance of TopicKernelScore. config is an optional argument of TopicKernelScoreConfig type.

RemoveScore(name)

Removes a score calculator with the specific name from the master component.

CreateDictionary(config)

Creates and returns an instance of Dictionary class component with a specific DictionaryConfig.

RemoveDictionary(dictionary)

Removes an instance of Dictionary from the master component. After this operation the dictionary object became invalid and must not be used.

Reconfigure(config = None)

Updates the configuration of the master component with new MasterComponentConfig value, provided by config parameter. Remember that some changes of the configuration are not allowed (for example, the MasterComponentConfig.disk_path must not change). Such configuration parameters must be provided in the constructor of MasterComponent.

AddBatch(self, batch = None, batch_filename = None, timeout = None, reset_scores = False, args=None)

Adds an instance of Batch class to the processor queue. Master component creates a copy of the batch, so any further changes of the batch object will not be picked up. batch_filename is an alternative to file with binary-serialized batch (you must use either batch or batch_filename option, but not both at the same time).

This operation awaits until there is enough space in processor queue. It returns True if await succeeded within the timeout, otherwise returns False. The provided timeout is in milliseconds. By default it allows an infinite time for AddBatch() operation.

args is an optional argument of AddBatchArgs type.

InvokeIteration(iterations_count = 1, disk_path = None, args=None)

Invokes several iterations over the collection. The recommended value for iterations_count is 1. disk_path defines the disk location with batches to process on this iteration. For more iterations use for loop around InvokeIteration() method. This operation is asynchronous. Use WaitIdle() to await until all iterations succeeded.

args is an optional argument of InvokeIterationArgs type.

WaitIdle(timeout = None, args=None)

Awaits for ongoing iterations. Returns True if iterations had been finished within the timeout, otherwise returns False. The provided timeout is in milliseconds. Use timeout = -1 to allow infinite time for WaitIdle() operation. Remember to call Model.Synchronize() operation to synchronize each model that you are currently processing.

args is an optional argument of WaitIdleArgs type.

CreateStream(stream)

Creates a data stream base on the stream (Stream).

RemoveStream(stream_name)

Removes a stream with the specific name from the master component.

GetTopicModel(model = None, args = None)

Retrieves and returns an instance of TopicModel class, carrying all the data of the topic model (including the Phi matrix). Parameter model should be an instance of Model class. For more settings use args parameter (see GetTopicModelArgs for all available options).

GetRegularizerState(regularizer_name)

Retrieves and returns the internal state of a regularizer with the specific name.

GetThetaMatrix(model = None, batch = None, clean_cache = None, args = None)

Retrieves an instance of ThetaMatrix class. The content depends on batch parameter. When batch is provided, the resulting ThetaMatrix will contain theta values estimated for all documents in the batch. When batch is not provided, the resulting ThetaMatrix will contain theta values gathered during the last iteration.

Parameter model should be an instance of Model class. For more settings use args parameter (see GetThetaMatrixArgs for all available options).

When used without batch, this operation require MasterComponentConfig.cache_theta to be set to True before starting the last iteration. In this case the entire ThetaMatrix must fit into CPU memory, and for this reason MasterComponentConfig.cache_theta is turned off by default.

Model

class artm.library.Model

This constructor must not be used explicitly. The only correct way of creating a Model is through MasterComponent.CreateModel() method.

name()

Returns the string name of the model.

Reconfigure(config = None)

Updates the configuration of the topic model with new ModelConfig value, provided by config parameter. When config is not specified the configuration is updated with config() value. Remember that some changes of the configuration are applied immediately after this call. For example, changes to ModelConfig.topics_count or ModelConfig.topic_name will be applied only during the next Synchronize call.

Note that changes ModelConfig.topics_count or ModelConfig.topic_name are only supported on an idle master component (e.g. in between iterations). Changing these values during an ongoing iteration may cause unexpected results.

topics_count()

Returns the number of topics in the model.

config()

Returns current ModelConfig of the topic model.

Synchronize(decay_weight = 0.0, apply_weight = 1.0, invoke_regularizers = True, args=None)

This operation updates the Phi matrix of the topic model with all model increments, collected since the last call to Synchronize() method. The Phi matrix is calculated according to decay_weight and apply_weight (refer to SynchronizeModelArgs.decay_weight for more details). Depending on invoke_regularizers parameter this operation may also invoke all regularizers.

Remember to call Synchronize() operation every time after call MasterComponent.WaitIdle().

For more settings use args parameter (see SynchronizeModelArgs for all available options).

Initialize(dictionary = None, args=None)

Generates a random initial approximation for the Phi matrix of the topic model.

dictionary must be an instance of Dictionary class.

For more settings use args parameter (see InitializeModelArgs for all available options).

Export(filename)

Exports topic model into a file.

Import(filename)

Imports topic model from a file.

Overwrite(topic_model, commit = True)

Updates the model with new Phi matrix, defined by topic_model (TopicModel). This operation can be used to provide an explicit initial approximation of the topic model, or to adjust the model in between iterations.

Depending on the commit flag the change can be applied immediately (commit = true) or queued (commit = false). The default setting is to use commit = true. You may want to use commit = false if your model is too big to be updated in a single protobuf message. In this case you should split your model into parts, each part containing subset of all tokens, and then submit each part in separate Overwrite operation with commit = false. After that remember to call MasterComponent.WaitIdle() and Model.Synchronize() to propagate your change.

Enable()

Sets ModelConfig.enabled to True for the current topic model. This means that the model will be updated on MasterComponent.InvokeIteration().

EnableScore(score)

By default model does calculate any scores even if they are created with MasterComponent.CreateScore(). Method EnableScore tells to the model that score should be applied to the model. Parameter tau defines the regularization coefficient of the regularizer. score must be an instance of Score class.

EnableRegularizer(regularizer, tau)

By default model does not use any regularizers even if they are created with MasterComponent.CreateRegularizer(). Method EnableRegularizer tells to the model that regularizer should be applied to the model. Parameter tau defines the regularization coefficient of the regularizer. regularizer must be an instance of Regularizer class.

Disable()

Sets ModelConfig.enabled to False` for the current topic model. This means that the model will not be updated on MasterComponent.InvokeIteration(), but the the scores for the model still will be collected.

Regularizer

class artm.library.Regularizer

This constructor must not be used explicitly. The only correct way of creating a Regularizer is through MasterComponent.CreateRegularizer() method (or similar methods in MasterComponent class, dedicated to a particular type of the regularizer).

name()

Returns the string name of the regularizer.

Reconfigure(type, config)

Updates the configuration of the regularizer with new regularizer configuration, provided by config parameter. The config object can be, for example, of SmoothSparseThetaConfig type (or similar). The type must match the current type of the regularizer.

Score

class artm.library.Score

This constructor must not be used explicitly. The only correct way of creating a Score is through MasterComponent.CreateScore() method (or similar methods in MasterComponent class, dedicated to a particular type of the score).

name()

Returns the string name of the score.

GetValue(model = None, batch = None)

Retrieves the score for a specific model. For cumulative scores such as Perplexity of ThetaSparsity score it is possible to use batch argument.

Dictionary

class artm.library.Dictionary(master_component, config)

This constructor must not be used explicitly. The only correct way of creating a Dictionary is through MasterComponent.CreateDictionary() method.

name()

Returns the string name of the dictionary.

Reconfigure(config)

Updates the configuration of the dictionary with new DictionaryConfig value, provided by config parameter.

Visualizers

class artm.library.Visualizers

This class provides a set of static method to visualize some scores.

PrintTopTokensScore(top_tokens_score)

Prints the TopTokensScore.

PrintThetaSnippetScore(theta_snippet_score)

Prints the ThetaSnippetScore.

Exceptions

exception artm.library.InternalError

An exception class corresponding to ARTM_INTERNAL_ERROR error code.

exception artm.library.ArgumentOutOfRangeException

An exception class corresponding to ARTM_ARGUMENT_OUT_OF_RANGE error code.

exception artm.library.InvalidMasterIdException

An exception class corresponding to ARTM_INVALID_MASTER_ID error code.

exception artm.library.CorruptedMessageException

An exception class corresponding to ARTM_CORRUPTED_MESSAGE error code.

exception artm.library.InvalidOperationException

An exception class corresponding to ARTM_INVALID_OPERATION error code.

exception artm.library.DiskReadException

An exception class corresponding to ARTM_DISK_READ_ERROR error code.

exception artm.library.DiskWriteException

An exception class corresponding to ARTM_DISK_WRITE_ERROR error code.

Constants

artm.library.Stream_Type_Global
artm.library.Stream_Type_ItemIdModulus
artm.library.RegularizerConfig_Type_DirichletTheta
artm.library.RegularizerConfig_Type_DirichletPhi
artm.library.RegularizerConfig_Type_SmoothSparseTheta
artm.library.RegularizerConfig_Type_SmoothSparsePhi
artm.library.RegularizerConfig_Type_DecorrelatorPhi
artm.library.ScoreConfig_Type_Perplexity
artm.library.ScoreData_Type_Perplexity
artm.library.ScoreConfig_Type_SparsityTheta
artm.library.ScoreData_Type_SparsityTheta
artm.library.ScoreConfig_Type_SparsityPhi
artm.library.ScoreData_Type_SparsityPhi
artm.library.ScoreConfig_Type_ItemsProcessed
artm.library.ScoreData_Type_ItemsProcessed
artm.library.ScoreConfig_Type_TopTokens
artm.library.ScoreData_Type_TopTokens
artm.library.ScoreConfig_Type_ThetaSnippet
artm.library.ScoreData_Type_ThetaSnippet
artm.library.ScoreConfig_Type_TopicKernel
artm.library.ScoreData_Type_TopicKernel
artm.library.PerplexityScoreConfig_Type_UnigramDocumentModel
artm.library.PerplexityScoreConfig_Type_UnigramCollectionModel
artm.library.CollectionParserConfig_Format_BagOfWordsUci