This document explains all classes in python interface of BigARTM library.
- class artm.library.Library(artm_shared_library = "")¶
Creates an ArtmLibrary object, wrapping the BigARTM shared library.
The artm_shared_library is an optional argument, which provides full file name of artm shared library (a disk path plus artm.dll on Windows or artm.so on Linux). When artm_shared_library is not specified the shared library will be searched in folders listed in PATH system variable. You may also configure ARTM_SHARED_LIBRARY system variable to provide full file name of artm shared library.
- CreateMasterComponent(config = messages_pb2.MasterComponentConfig())¶
Creates a node controller at a specific endpoint.
Returns an instance of NodeController class.
Loads a DictionaryConfig from the file, defined by full_filename argument.
For more information refer to ArtmRequestLoadDictionary().
Loads a Batch from the file, defined by full_filename argument.
For more information refer to ArtmRequestLoadBatch().
- ParseCollectionOrLoadDictionary(docword_file_path, vocab_file_path, target_folder)¶
The docword_file_path and vocab_file_path arguments should provide the disk location of docword and vocab files of the collection to be parsed.
- class artm.library.MasterComponent(config = messages_pb2.MasterComponentConfig(), lib = None, disk_path = None, proxy_endpoint = None)¶
Creates a master component.
config is an optional instance of MasterComponentConfig, providing an initial configuration of the master component.
disk_path is an optional value providing the disk folder with batches to process by this master component. Changing disk_path is not supported (you must recreate a new instance MasterComponent to do so). Use InvokeIteration() will process all batches, located under disk_path. Alternatively use AddBatch() to add a specific batch into processor queue.
proxy_endpoint is an optional string value that provides connect endpoint of a remote node controller. When specified, the master component will operate in a proxy mode (that is, it will redirect all commands to the remote master component instantiated in the remote node controller).
Disposes the master component and releases all unmanaged resources (like memory, network connections, etc).
- CreateModel(config = messages_pb2.ModelConfig(), topics_count = None, inner_iterations_count = None)¶
Creates and returns an instance of Model class based on a given ModelConfig. Note that the model has to be further tuned by several iterative scans over the text collection. Use InvokeIteration() to perform such scans.
Parameters topics_count and inner_iterations_count will override values, specified in config.
Removes an instance of Model from the master component. After this operation the model object became invalid and must not be used.
- CreateRegularizer(name, type, config)¶
Creates and returns an instance of Regularizer component. name can be any unique identifier, that you can further use to identify regularizer (for example, in ModelConfig.regularizer_name). type can be any regularizer type (for example, the RegularizerConfig_Type_DirichletTheta). config can be any regularizer config (for example, a SmoothSparseThetaConfig).
- CreateDirichletThetaRegularizer(name = None, config = messages_pb2.DirichletThetaConfig())¶
Creates an instance of DirichletThetaRegularizer.
- CreateDirichletPhiRegularizer(name = None, config = messages_pb2.DirichletPhiConfig())¶
Creates an instance of DirichletPhiRegularizer.
- CreateSmoothSparseThetaRegularizer(name = None, config = messages_pb2.SmoothSparseThetaConfig())¶
Creates an instance of SmoothSparseThetaRegularizer.
- CreateSmoothSparsePhiRegularizer(name = None, config = messages_pb2.SmoothSparsePhiConfig())¶
Creates an instance of SmoothSparsePhiRegularizer.
- CreateDecorrelatorPhiRegularizer(name = None, config = messages_pb2.DecorrelatorPhiConfig())¶
Creates an instance of DecorrelatorPhiRegularizer.
Removes an instance of Regularizer from the master component. After this operation the regularizer object became invalid and must not be used.
- CreateScore(name, type, config)¶
Creates a score calculator inside the master component. name can be any unique identifier, that you can further use to identify the score (for example, in ModelConfig.score_name). type can be any score type (for example, the ScoreConfig_Type_Perplexity). config can be any score config (for example, a PerplexityScoreConfig).
- CreatePerplexityScore(self, name = None, config = messages_pb2.PerplexityScoreConfig(), stream_name = None)¶
Creates an instance of PerplexityScore.
- CreateSparsityThetaScore(self, name = None, config = messages_pb2.SparsityThetaScoreConfig())¶
Creates an instance of SparsityThetaScore.
- CreateSparsityPhiScore(self, name = None, config = messages_pb2.SparsityPhiScoreConfig())¶
Creates an instance of SparsityPhiScore.
- CreateItemsProcessedScore(self, name = None, config = messages_pb2.ItemsProcessedScoreConfig())¶
Creates an instance of ItemsProcessedScore.
- CreateTopTokensScore(self, name = None, config = messages_pb2.TopTokensScoreConfig(), num_tokens = None, class_id = None)¶
Creates an instance of TopTokensScore.
- CreateThetaSnippetScore(self, name = None, config = messages_pb2.ThetaSnippetScoreConfig())¶
Creates an instance of ThetaSnippetScore.
- CreateTopicKernelScore(self, name = None, config = messages_pb2.TopicKernelScoreConfig())¶
Creates an instance of TopicKernelScore.
Removes a score calculator with the specific name from the master component.
Removes an instance of Dictionary from the master component. After this operation the dictionary object became invalid and must not be used.
- Reconfigure(config = None)¶
Updates the configuration of the master component with new MasterComponentConfig value, provided by config parameter. Remember that some changes of the configuration are not allowed (for example, the MasterComponentConfig.disk_path must not change). Such configuration parameters must be provided in the constructor of MasterComponent.
- AddBatch(self, batch = None, batch_filename = None, timeout = -1, reset_scores = False)¶
Adds an instance of Batch class to the processor queue. Master component creates a copy of the batch, so any further changes of the batch object will not be picked up. batch_filename is an alternative to file with binary-serialized batch (you must use either batch or batch_filename option, but not both at the same time).
This operation awaits until there is enough space in processor queue. It returns True if await succeeded within the timeout, otherwise returns False. The provided timeout is in milliseconds. Use timeout = -1 to allow infinite time for AddBatch() operation.
- InvokeIteration(iterations_count = 1)¶
Invokes several iterations over the collection. The recommended value for iterations_count is 1. For more iterations use for loop around InvokeIteration() method. This operation is asynchronous. Use WaitIdle() to await until all iterations succeeded.
- WaitIdle(timeout = -1)¶
Awaits for ongoing iterations. Returns True if iterations had been finished within the timeout, otherwise returns False. The provided timeout is in milliseconds. Use timeout = -1 to allow infinite time for WaitIdle() operation. Remember to call Model.Synchronize() operation to synchronize each model that you are currently processing.
Removes a stream with the specific name from the master component.
- GetTopicModel(model = None, args = messages_pb2.GetTopicModelArgs())¶
Retrieves and returns an instance of TopicModel class, carrying all the data of the topic model (including the Phi matrix). Parameter model should be an instance of Model class. For more settings use args parameter (see GetTopicModelArgs for all available options).
Retrieves and returns the internal state of a regularizer with the specific name.
- GetThetaMatrix(model = None, batch = None, args = messages_pb2.GetThetaMatrixArgs())¶
Retrieves an instance of ThetaMatrix class. The content depends on batch parameter. When batch is provided, the resulting ThetaMatrix will contain theta values estimated for all documents in the batch. When batch is not provided, the resulting ThetaMatrix will contain theta values gathered during the last iteration.
When used without batch, this operation require MasterComponentConfig.cache_theta to be set to True before starting the last iteration. In this case the entire ThetaMatrix must fit into CPU memory, and for this reason MasterComponentConfig.cache_theta is turned off by default.
This operation is not supported when MasterComponentConfig.modus_operandi is set to Network.
- class artm.library.NodeController(endpoint, lib = None)¶
Creates a node controller on a specific endpoint. See NodeControllerConfig for more details.
Disposes the node controller and releases all unmanaged resources (like memory, network connections, etc).
- class artm.library.Model¶
This constructor must not be used explicitly. The only correct way of creating a Model is through MasterComponent.CreateModel() method.
Returns the string name of the model.
- Reconfigure(config = None)¶
Updates the configuration of the topic model with new ModelConfig value, provided by config parameter. When config is not specified the configuration is updated with config() value. Remember that some changes of the configuration are applied immediately after this call. For example, changes to ModelConfig.topics_count or ModelConfig.topic_name will be applied only during the next Synchronize call.
Note that changes ModelConfig.topics_count or ModelConfig.topic_name are only supported on an idle master component (e.g. in between iterations). Changing these values during an ongoing iteration may cause unexpected results.
Returns the number of topics in the model.
- Synchronize(decay_weight = 0.0, apply_weight = 1.0, invoke_regularizers = True)¶
This operation updates the Phi matrix of the topic model with all model increments, collected since the last call to Synchronize() method. The Phi matrix is calculated according to decay_weight and apply_weight (refer to SynchronizeModelArgs.decay_weight for more details). Depending on invoke_regularizers parameter this operation may also invoke all regularizers.
- Initialize(tokens, dictionary)¶
Generates a random initial approximation for the Phi matrix of the topic model.
dictionary must be an instance of Dictionary class.
- Overwrite(topic_model, commit = True)¶
Updates the model with new Phi matrix, defined by topic_model (TopicModel). This operation can be used to provide an explicit initial approximation of the topic model, or to adjust the model in between iterations.
Depending on the commit flag the change can be applied immediately (commit = true) or queued (commit = false). The default setting is to use commit = true. You may want to use commit = false if your model is too big to be updated in a single protobuf message. In this case you should split your model into parts, each part containing subset of all tokens, and then submit each part in separate Overwrite operation with commit = false. After that remember to call MasterComponent.WaitIdle() and Model.Synchronize() to propagate your change.
By default model does calculate any scores even if they are created with MasterComponent.CreateScore(). Method EnableScore tells to the model that score should be applied to the model. Parameter tau defines the regularization coefficient of the regularizer. score must be an instance of Score class.
- EnableRegularizer(regularizer, tau)¶
By default model does not use any regularizers even if they are created with MasterComponent.CreateRegularizer(). Method EnableRegularizer tells to the model that regularizer should be applied to the model. Parameter tau defines the regularization coefficient of the regularizer. regularizer must be an instance of Regularizer class.
- class artm.library.Regularizer¶
This constructor must not be used explicitly. The only correct way of creating a Regularizer is through MasterComponent.CreateRegularizer() method (or similar methods in MasterComponent class, dedicated to a particular type of the regularizer).
Returns the string name of the regularizer.
- class artm.library.Score¶
This constructor must not be used explicitly. The only correct way of creating a Score is through MasterComponent.CreateScore() method (or similar methods in MasterComponent class, dedicated to a particular type of the score).
Returns the string name of the score.
- GetValue(model = None, batch = None)¶
Retrieves the score for a specific model. For cumulative scores such as Perplexity of ThetaSparsity score it is possible to use batch argument.
- class artm.library.Dictionary(master_component, config)¶
This constructor must not be used explicitly. The only correct way of creating a Dictionary is through MasterComponent.CreateDictionary() method.
Returns the string name of the dictionary.
- exception artm.library.InternalError¶
An exception class corresponding to ARTM_INTERNAL_ERROR error code.
- exception artm.library.ArgumentOutOfRangeException¶
An exception class corresponding to ARTM_ARGUMENT_OUT_OF_RANGE error code.
- exception artm.library.InvalidMasterIdException¶
An exception class corresponding to ARTM_INVALID_MASTER_ID error code.
- exception artm.library.CorruptedMessageException¶
An exception class corresponding to ARTM_CORRUPTED_MESSAGE error code.
- exception artm.library.InvalidOperationException¶
An exception class corresponding to ARTM_INVALID_OPERATION error code.
- exception artm.library.DiskReadException¶
An exception class corresponding to ARTM_DISK_READ_ERROR error code.
- exception artm.library.DiskWriteException¶
An exception class corresponding to ARTM_DISK_WRITE_ERROR error code.