Messages¶
This document explains all protobuf messages that can be transfered between the user code and BigARTM library.
Warning
Remember that all fields is marked as optional to enhance backwards compatibility of the binary protobuf format. Some fields will result in run-time exception when not specified. Please refer to the documentation of each field for more details.
Note that we discourage any usage of fields marked as obsolete. Those fields will be removed in future releases.
DoubleArray¶
-
class
messages_pb2.
DoubleArray
¶
Represents an array of double-precision floating point values.
message DoubleArray {
repeated double value = 1 [packed = true];
}
FloatArray¶
-
class
messages_pb2.
FloatArray
¶
Represents an array of single-precision floating point values.
message FloatArray {
repeated float value = 1 [packed = true];
}
BoolArray¶
-
class
messages_pb2.
BoolArray
¶
Represents an array of boolean values.
message BoolArray {
repeated bool value = 1 [packed = true];
}
IntArray¶
-
class
messages_pb2.
IntArray
¶
Represents an array of integer values.
message IntArray {
repeated int32 value = 1 [packed = true];
}
Item¶
-
class
messages_pb2.
Item
¶
Represents a unit of textual information. A typical example of an item is a document that belongs to some text collection.
message Item {
optional int32 id = 1;
repeated Field field = 2;
optional string title = 3;
}
-
Item.
id
¶ An integer identifier of the item.
-
Item.
field
¶ A set of all fields withing the item.
-
Item.
title
¶ An optional title of the item.
Field¶
-
class
messages_pb2.
Field
¶
Represents a field withing an item. The idea behind fields is that each item might have its title, author, body, abstract, actual text, links, year of publication, etc. Each of this entities should be represented as a Field. The topic model defines how those fields should be taken into account when BigARTM infers a topic model. Currently each field is represented as “bag-of-words” — each token is listed together with the number of its occurrences. Note that each Field is always part of an Item, Item is part of a Batch, and a batch always contains a list of tokens. Therefore, each Field just lists the indexes of tokens in the Batch.
message Field {
optional string name = 1 [default = "@body"];
repeated int32 token_id = 2;
repeated int32 token_count = 3;
repeated int32 token_offset = 4;
optional string string_value = 5;
optional int64 int_value = 6;
optional double double_value = 7;
optional string date_value = 8;
repeated string string_array = 16;
repeated int64 int_array = 17;
repeated double double_array = 18;
repeated string date_array = 19;
}
Batch¶
-
class
messages_pb2.
Batch
¶
Represents a set of items. In BigARTM a batch is never split into smaller parts. When it comes to concurrency this means that each batch goes to a single processor. Two batches can be processed concurrently, but items in one batch are always processed sequentially.
message Batch {
repeated string token = 1;
repeated Item item = 2;
repeated string class_id = 3;
optional string description = 4;
optional string id = 5;
}
-
Batch.
token
¶ A set value that defines all tokens than may appear in the batch.
-
Batch.
item
¶ A set of items of the batch.
-
Batch.
class_id
¶ A set of values that define for classes (modalities) of tokens. This repeated field must have the same length as
token
. This value is optional, use an empty list indicate that all tokens belong to the default class.
-
Batch.
description
¶ An optional text description of the batch. You may describe for example the source of the batch, preprocessing technique and the structure of its fields.
-
Batch.
id
¶ Unique identifier of the batch in a form of a GUID (example:
4fb38197-3f09-4871-9710-392b14f00d2e
). This field is required.
Stream¶
-
class
messages_pb2.
Stream
¶
Represents a configuration of a stream. Streams provide a mechanism to split the entire collection into virtual subsets (for example, the ‘train’ and ‘test’ streams).
message Stream {
enum Type {
Global = 0;
ItemIdModulus = 1;
}
optional Type type = 1 [default = Global];
optional string name = 2 [default = "@global"];
optional int32 modulus = 3;
repeated int32 residuals = 4;
}
-
Stream.
type
¶ A value that defines the type of the stream.
Global
Defines a stream containing all items in the collection.ItemIdModulus
Defines a stream containing all items with ID thatmatches modulus and residuals. An item belongs to thestream iff the modulo reminder of item ID is containedin the residuals field.
-
Stream.
name
¶ A value that defines the name of the stream. The name must be unique across all streams defined in the master component.
MasterComponentConfig¶
-
class
messages_pb2.
MasterComponentConfig
¶
Represents a configuration of a master component.
message MasterComponentConfig {
optional string disk_path = 2;
repeated Stream stream = 3;
optional bool compact_batches = 4 [default = true];
optional bool cache_theta = 5 [default = false];
optional int32 processors_count = 6 [default = 1];
optional int32 processor_queue_max_size = 7 [default = 10];
optional int32 merger_queue_max_size = 8 [default = 10];
repeated ScoreConfig score_config = 9;
optional bool online_batch_processing = 13 [default = false]; // obsolete in BigARTM v0.5.8
optional string disk_cache_path = 15;
}
-
MasterComponentConfig.
disk_path
¶ A value that defines the disk location to store or load the collection.
-
MasterComponentConfig.
stream
¶ A set of all data streams to configure in master component. Streams can overlap if needed.
-
MasterComponentConfig.
compact_batches
¶ A flag indicating whether to compact batches in AddBatch() operation. Compaction is a process that shrinks the dictionary of each batch by removing all unused tokens.
-
MasterComponentConfig.
cache_theta
¶ A flag indicating whether to cache theta matrix. Theta matrix defines the discrete probability distribution of each document across the topics in topic model. By default BigARTM infers this distribution every time it processes the document. Option ‘cache_theta’ allows to cache this theta matrix and re-use theha values when the same document is processed on the next iteration. This option must be set to ‘true’ before calling method
ArtmRequestThetaMatrix()
.
-
MasterComponentConfig.
processors_count
¶ A value that defines the number of concurrent processor components. The number of processors should normally not exceed the number of CPU cores.
-
MasterComponentConfig.
processor_queue_max_size
¶ A value that defines the maximal size of the processor queue. Processor queue contains batches, prefetch from disk into memory. Recommendations regarding the maximal queue size are as follows:
- the queue size should be at least as large as the number of concurrent processors;
-
MasterComponentConfig.
merger_queue_max_size
¶ A value that defines the maximal size of the merger queue. Merger queue size contains an incremental updates of topic model, produced by processor components. Try reducing this parameter if BigARTM consumes too much memory.
-
MasterComponentConfig.
score_config
¶ A set of all scores, available for calculation.
-
MasterComponentConfig.
online_batch_processing
¶ Obsolete in BigARTM v0.5.8.
-
MasterComponentConfig.
disk_cache_path
¶ A value that defines a writtable disk location where this master component can store some temporary files. This can reduce memory usage, particularly when
cache_theta
option is enabled. Note that on clean shutdown master component will will be cleaned this folder automatically, but otherwise it is your responsibility to clean this folder to avoid running out of disk.
ModelConfig¶
-
class
messages_pb2.
ModelConfig
¶
Represents a configuration of a topic model.
message ModelConfig {
optional string name = 1 [default = "@model"];
optional int32 topics_count = 2 [default = 32];
repeated string topic_name = 3;
optional bool enabled = 4 [default = true];
optional int32 inner_iterations_count = 5 [default = 10];
optional string field_name = 6 [default = "@body"]; // obsolete in BigARTM v0.5.8
optional string stream_name = 7 [default = "@global"];
repeated string score_name = 8;
optional bool reuse_theta = 9 [default = false];
repeated string regularizer_name = 10;
repeated double regularizer_tau = 11;
repeated string class_id = 12;
repeated float class_weight = 13;
optional bool use_sparse_bow = 14 [default = true];
optional bool use_random_theta = 15 [default = false];
optional bool use_new_tokens = 16 [default = true];
optional bool opt_for_avx = 17 [default = true];
}
-
ModelConfig.
name
¶ A value that defines the name of the topic model. The name must be unique across all models defined in the master component.
-
ModelConfig.
topics_count
¶ A value that defines the number of topics in the topic model.
-
ModelConfig.
topic_name
¶ A repeated field that defines the names of the topics. All topic names must be unique within each topic model. This field is optional, but either
topics_count
ortopic_name
must be specified. If both specified, thentopics_count
will be ignored, and the number of topics in the model will be based on the length oftopic_name
field. Whentopic_name
is not specified the names for all topics will be autogenerated.
-
ModelConfig.
enabled
¶ A flag indicating whether to update the model during iterations.
-
ModelConfig.
inner_iterations_count
¶ A value that defines the fixed number of iterations, performed to infer the theta distribution for each document.
-
ModelConfig.
field_name
¶ Obsolete in BigARTM v0.5.8
-
ModelConfig.
stream_name
¶ A value that defines which stream the model should use.
-
ModelConfig.
score_name
¶ A set of names that defines which scores should be calculated for the model.
-
ModelConfig.
reuse_theta
¶ A flag indicating whether the model should reuse theta values cached on the previous iterations. This option require cache_theta flag to be set to ‘true’ in MasterComponentConfig.
-
ModelConfig.
regularizer_name
¶ A set of names that define which regularizers should be enabled for the model. This repeated field must have the same length as
regularizer_tau
.
-
ModelConfig.
regularizer_tau
¶ A set of values that define the regularization coefficients of the corresponding regularizer. This repeated field must have the same length as
regularizer_name
.
-
ModelConfig.
class_id
¶ A set of values that define for which classes (modalities) to build topic model. This repeated field must have the same length as
class_weight
.
-
ModelConfig.
class_weight
¶ A set of values that define the weights of the corresponding classes (modalities). This repeated field must have the same length as
class_id
. This value is optional, use an empty list to set equal weights for all classes.
-
ModelConfig.
use_sparse_bow
¶ A flag indicating whether to use sparse representation of the Bag-of-words data. The default setting (use_sparse_bow = true) is best suited for processing textual collections where every token is represented in a small fraction of all documents. Dense representation (use_sparse_bow = false) better fits for non-textual collections (for example for matrix factorization).
Note that
class_weight
andclass_id
must not be used together with use_sparse_bow=false.
-
ModelConfig.
use_random_theta
¶ A flag indicating whether to initialize
p(t|d)
distribution with random uniform distribution. The default setting (use_random_theta = false) setsp(t|d) = 1/T
, whereT
stands fortopics_count
. Note thatreuse_theta
flag takes priority over use_random_theta flag, so that if reuse_theta = true and there is a cache entry from previous iteration the cache entry will be used regardless of use_random_theta flag.
-
ModelConfig.
use_new_tokens
¶ A flag indicating whether to automatically include new tokens into the topic model. This setting is set to True by default. As a result, every new token observed in batches is automatically incorporated into topic model during the next model synchronization (
ArtmSynchronizeModel()
). Then_wt_
weights for new tokens randomly generated from[0..1]
range.
-
ModelConfig.
opt_for_avx
¶ An experimental flag that allows to disable AVX optimization in processor. By default this option is enabled as on average it adds ca. 40% speedup on physical hardware. You may want to disable this option if you are running on Windows inside virtual machine, or in situation when BigARTM performance degrades from iteration to interation.
This option does not affect the results, and is only intended for advanced users experimenting with BigARTM performance.
RegularizerConfig¶
-
class
messages_pb2.
RegularizerConfig
¶
Represents a configuration of a general regularizer.
message RegularizerConfig {
enum Type {
SmoothSparseTheta = 0;
SmoothSparsePhi = 1;
DecorrelatorPhi = 2;
LabelRegularizationPhi = 4;
}
optional string name = 1;
optional Type type = 2;
optional bytes config = 3;
}
-
RegularizerConfig.
name
¶ A value that defines the name of the regularizer. The name must be unique across all names defined in the master component.
-
RegularizerConfig.
type
¶ A value that defines the type of the regularizer.
SmoothSparseTheta
Smooth-sparse regularizer for theta matrix SmoothSparsePhi
Smooth-sparse regularizer for phi matrix DecorrelatorPhi
Decorrelator regularizer for phi matrix LabelRegularizationPhi
Label regularizer for phi matrix
-
RegularizerConfig.
config
¶ A serialized protobuf message that describes regularizer config for the specific regularizer type.
SmoothSparseThetaConfig¶
-
class
messages_pb2.
SmoothSparseThetaConfig
¶
Represents a configuration of a SmoothSparse Theta regularizer.
message SmoothSparseThetaConfig {
repeated string topic_name = 1;
repeated float alpha_iter = 2;
}
-
SmoothSparseThetaConfig.
topic_name
¶ A set of topic names that defines which topics in the model should be regularized. This value is optional, use an empty list to regularize all topics.
-
SmoothSparseThetaConfig.
alpha_iter
¶ A field of the same length as
ModelConfig.inner_iterations_count
that defines relative regularization weight for every iteration inner iterations. The actual regularization value is calculated as product ofalpha_iter[i]
andModelConfig.regularizer_tau
.To specify different regularization weight for different topics create multiple regularizers with different
topic_name
set, and use different values ofModelConfig.regularizer_tau
.
SmoothSparsePhiConfig¶
-
class
messages_pb2.
SmoothSparsePhiConfig
¶
Represents a configuration of a SmoothSparse Phi regularizer.
message SmoothSparsePhiConfig {
repeated string topic_name = 1;
repeated string class_id = 2;
optional string dictionary_name = 3;
}
-
SmoothSparsePhiConfig.
topic_name
¶ A set of topic names that defines which topics in the model should be regularized. This value is optional, use an empty list to regularize all topics.
-
SmoothSparsePhiConfig.
class_id
¶ This set defines which classes in the model should be regularized. This value is optional, use an empty list to regularize all classes.
-
SmoothSparsePhiConfig.
dictionary_name
¶ An optional value defining the name of the dictionary to use. The entries of the dictionary are expected to have
DictionaryEntry.key_token
,DictionaryEntry.class_id
andDictionaryEntry.value
fields. The actual regularization value will be calculated as a product ofDictionaryEntry.value
andModelConfig.regularizer_tau
.This value is optional, if no dictionary is specified than all tokens will be regularized with the same weight.
DecorrelatorPhiConfig¶
-
class
messages_pb2.
DecorrelatorPhiConfig
¶
Represents a configuration of a Decorrelator Phi regularizer.
message DecorrelatorPhiConfig {
repeated string topic_name = 1;
repeated string class_id = 2;
}
-
DecorrelatorPhiConfig.
topic_name
¶ A set of topic names that defines which topics in the model should be regularized. This value is optional, use an empty list to regularize all topics.
-
DecorrelatorPhiConfig.
class_id
¶ This set defines which classes in the model should be regularized. This value is optional, use an empty list to regularize all classes.
LabelRegularizationPhiConfig¶
-
class
messages_pb2.
LabelRegularizationPhiConfig
¶
Represents a configuration of a Label Regularizer Phi regularizer.
message LabelRegularizationPhiConfig {
repeated string topic_name = 1;
repeated string class_id = 2;
optional string dictionary_name = 3;
}
-
LabelRegularizationPhiConfig.
topic_name
¶ A set of topic names that defines which topics in the model should be regularized.
-
LabelRegularizationPhiConfig.
class_id
¶ This set defines which classes in the model should be regularized. This value is optional, use an empty list to regularize all classes.
-
LabelRegularizationPhiConfig.
dictionary_name
¶ An optional value defining the name of the dictionary to use.
RegularizerInternalState¶
-
class
messages_pb2.
RegularizerInternalState
¶
Represents an internal state of a general regularizer.
message RegularizerInternalState {
enum Type {
MultiLanguagePhi = 5;
}
optional string name = 1;
optional Type type = 2;
optional bytes data = 3;
}
DictionaryConfig¶
-
class
messages_pb2.
DictionaryConfig
¶
Represents a static dictionary.
message DictionaryConfig {
optional string name = 1;
repeated DictionaryEntry entry = 2;
optional int32 total_token_count = 3;
optional int32 total_items_count = 4;
}
-
DictionaryConfig.
name
¶ A value that defines the name of the dictionary. The name must be unique across all dictionaries defined in the master component.
-
DictionaryConfig.
entry
¶ A list of all entries of the dictionary.
-
DictionaryConfig.
total_token_count
¶ A sum of
DictionaryEntry.token_count
across all entries in this dictionary. The value is optional and might be missing when all entries in the dictionary does not carry theDictionaryEntry.token_count
attribute.
-
DictionaryConfig.
total_items_count
¶ A sum of
DictionaryEntry.items_count
across all entries in this dictionary. The value is optional and might be missing when all entries in the dictionary does not carry theDictionaryEntry.items_count
attribute.
DictionaryEntry¶
-
class
messages_pb2.
DictionaryEntry
¶
Represents one entry in a static dictionary.
message DictionaryEntry {
optional string key_token = 1;
optional string class_id = 2;
optional float value = 3;
repeated string value_tokens = 4;
optional FloatArray values = 5;
optional int32 token_count = 6;
optional int32 items_count = 7;
}
-
DictionaryEntry.
key_token
¶ A token that defines the key of the entry.
-
DictionaryEntry.
class_id
¶ The class of the
DictionaryEntry.key_token
.
-
DictionaryEntry.
value
¶ An optional generic value, associated with the entry. The meaning of this value depends on the usage of the dictionary.
-
DictionaryEntry.
token_count
¶ An optional value, indicating the overall number of token occurrences in some collection.
-
DictionaryEntry.
items_count
¶ An optional value, indicating the overall number of documents containing the token.
ScoreConfig¶
-
class
messages_pb2.
ScoreConfig
¶
Represents a configuration of a general score.
message ScoreConfig {
enum Type {
Perplexity = 0;
SparsityTheta = 1;
SparsityPhi = 2;
ItemsProcessed = 3;
TopTokens = 4;
ThetaSnippet = 5;
TopicKernel = 6;
}
optional string name = 1;
optional Type type = 2;
optional bytes config = 3;
}
-
ScoreConfig.
name
¶ A value that defines the name of the score. The name must be unique across all names defined in the master component.
-
ScoreConfig.
type
¶ A value that defines the type of the score.
Perplexity
Defines a config of the Perplexity score SparsityTheta
Defines a config of the SparsityTheta score SparsityPhi
Defines a config of the SparsityPhi score ItemsProcessed
Defines a config of the ItemsProcessed score TopTokens
Defines a config of the TopTokens score ThetaSnippet
Defines a config of the ThetaSnippet score TopicKernel
Defines a config of the TopicKernel score
-
ScoreConfig.
config
¶ A serialized protobuf message that describes score config for the specific score type.
ScoreData¶
-
class
messages_pb2.
ScoreData
¶
Represents a general result of score calculation.
message ScoreData {
enum Type {
Perplexity = 0;
SparsityTheta = 1;
SparsityPhi = 2;
ItemsProcessed = 3;
TopTokens = 4;
ThetaSnippet = 5;
TopicKernel = 6;
}
optional string name = 1;
optional Type type = 2;
optional bytes data = 3;
}
-
ScoreData.
name
¶ A value that describes the name of the score. This name will match the name of the corresponding score config.
-
ScoreData.
type
¶ A value that defines the type of the score.
Perplexity
Defines a Perplexity score data SparsityTheta
Defines a SparsityTheta score data SparsityPhi
Defines a SparsityPhi score data ItemsProcessed
Defines a ItemsProcessed score data TopTokens
Defines a TopTokens score data ThetaSnippet
Defines a ThetaSnippet score data TopicKernel
Defines a TopicKernel score data
-
ScoreData.
data
¶ A serialized protobuf message that provides the specific score result.
PerplexityScoreConfig¶
-
class
messages_pb2.
PerplexityScoreConfig
¶
Represents a configuration of a perplexity score.
message PerplexityScoreConfig {
enum Type {
UnigramDocumentModel = 0;
UnigramCollectionModel = 1;
}
optional string field_name = 1 [default = "@body"]; // obsolete in BigARTM v0.5.8
optional string stream_name = 2 [default = "@global"];
optional Type model_type = 3 [default = UnigramDocumentModel];
optional string dictionary_name = 4;
optional float theta_sparsity_eps = 5 [default = 1e-37];
repeated string theta_sparsity_topic_name = 6;
}
-
PerplexityScoreConfig.
field_name
¶ Obsolete in BigARTM v0.5.8
-
PerplexityScoreConfig.
stream_name
¶ A value that defines which stream should be used in perplexity calculation.
PerplexityScore¶
-
class
messages_pb2.
PerplexityScore
¶
Represents a result of calculation of a perplexity score.
message PerplexityScore {
optional double value = 1;
optional double raw = 2;
optional double normalizer = 3;
optional int32 zero_words = 4;
optional double theta_sparsity_value = 5;
optional int32 theta_sparsity_zero_topics = 6;
optional int32 theta_sparsity_total_topics = 7;
}
-
PerplexityScore.
value
¶ A perplexity value which is calculated as exp(-raw/normalizer).
-
PerplexityScore.
raw
¶ A numerator of perplexity calculation. This value is equal to the likelihood of the topic model.
-
PerplexityScore.
normalizer
¶ A denominator of perplexity calculation. This value is equal to the total number of tokens in all processed items.
-
PerplexityScore.
zero_words
¶ A number of tokens that have zero probability p(w|t,d) in a document. Such tokens are evaluated based on to unigram document model or unigram colection model.
-
PerplexityScore.
theta_sparsity_value
¶ A fraction of zero entries in the theta matrix.
SparsityThetaScoreConfig¶
-
class
messages_pb2.
SparsityThetaScoreConfig
¶
Represents a configuration of a theta sparsity score.
message SparsityThetaScoreConfig {
optional string field_name = 1 [default = "@body"]; // obsolete in BigARTM v0.5.8
optional string stream_name = 2 [default = "@global"];
optional float eps = 3 [default = 1e-37];
repeated string topic_name = 4;
}
-
SparsityThetaScoreConfig.
field_name
¶ Obsolete in BigARTM v0.5.8
-
SparsityThetaScoreConfig.
stream_name
¶ A value that defines which stream should be used in theta sparsity calculation.
-
SparsityThetaScoreConfig.
eps
¶ A small value that defines zero threshold for theta probabilities. Theta values below the threshold will be counted as zeros when calculating theta sparsity score.
-
SparsityThetaScoreConfig.
topic_name
¶ A set of topic names that defines which topics should be used for score calculation. The names correspond to
ModelConfig.topic_name
. This value is optional, use an empty list to calculate the score for all topics.
SparsityThetaScore¶
-
class
messages_pb2.
SparsityThetaScoreConfig
Represents a result of calculation of a theta sparsity score.
message SparsityThetaScore {
optional double value = 1;
optional int32 zero_topics = 2;
optional int32 total_topics = 3;
}
-
SparsityThetaScore.
value
¶ A value of theta sparsity that is calculated as zero_topics / total_topics.
-
SparsityThetaScore.
zero_topics
¶ A numerator of theta sparsity score. A number of topics that have zero probability in a topic-item distribution.
-
SparsityThetaScore.
total_topics
¶ A denominator of theta sparsity score. A total number of topics in a topic-item distributions that are used in theta sparsity calculation.
SparsityPhiScoreConfig¶
-
class
messages_pb2.
SparsityPhiScoreConfig
¶
Represents a configuration of a sparsity phi score.
message SparsityPhiScoreConfig {
optional float eps = 1 [default = 1e-37];
optional string class_id = 2;
repeated string topic_name = 3;
}
-
SparsityPhiScoreConfig.
eps
¶ A small value that defines zero threshold for phi probabilities. Phi values below the threshold will be counted as zeros when calculating phi sparsity score.
-
SparsityPhiScoreConfig.
class_id
¶ A value that defines the class of tokens to use for score calculation. This value corresponds to
ModelConfig.class_id
field. This value is optional. By default the score will be calculated for the default class (‘@default_class’).
-
SparsityPhiScoreConfig.
topic_name
¶ A set of topic names that defines which topics should be used for score calculation. This value is optional, use an empty list to calculate the score for all topics.
SparsityPhiScore¶
-
class
messages_pb2.
SparsityPhiScore
¶
Represents a result of calculation of a phi sparsity score.
message SparsityPhiScore {
optional double value = 1;
optional int32 zero_tokens = 2;
optional int32 total_tokens = 3;
}
-
SparsityPhiScore.
value
¶ A value of phi sparsity that is calculated as zero_tokens / total_tokens.
-
SparsityPhiScore.
zero_tokens
¶ A numerator of phi sparsity score. A number of tokens that have zero probability in a token-topic distribution.
-
SparsityPhiScore.
total_tokens
¶ A denominator of phi sparsity score. A total number of tokens in a token-topic distributions that are used in phi sparsity calculation.
ItemsProcessedScoreConfig¶
-
class
messages_pb2.
ItemsProcessedScoreConfig
¶
Represents a configuration of an items processed score.
message ItemsProcessedScoreConfig {
optional string field_name = 1 [default = "@body"]; // obsolete in BigARTM v0.5.8
optional string stream_name = 2 [default = "@global"];
}
-
ItemsProcessedScoreConfig.
field_name
¶ Obsolete in BigARTM v0.5.8
-
ItemsProcessedScoreConfig.
stream_name
¶ A value that defines which stream should be used in calculation of processed items.
ItemsProcessedScore¶
-
class
messages_pb2.
ItemsProcessedScore
¶
Represents a result of calculation of an items processed score.
message ItemsProcessedScore {
optional int32 value = 1;
}
-
ItemsProcessedScore.
value
¶ A number of items that belong to the stream
ItemsProcessedScoreConfig.stream_name
and have been processed during iterations. Currently this number is aggregated throughout all iterations.
TopTokensScoreConfig¶
-
class
messages_pb2.
TopTokensScoreConfig
¶
Represents a configuration of a top tokens score.
message TopTokensScoreConfig {
optional int32 num_tokens = 1 [default = 10];
optional string class_id = 2;
repeated string topic_name = 3;
}
-
TopTokensScoreConfig.
num_tokens
¶ A value that defines how many top tokens should be retrieved for each topic.
-
TopTokensScoreConfig.
class_id
¶ A value that defines for which class of the model to collect top tokens. This value corresponds to
ModelConfig.class_id
field.This parameter is optional. By default tokens will be retrieved for the default class (‘@default_class’).
-
TopTokensScoreConfig.
topic_name
¶ A set of values that represent the names of the topics to include in the result. The names correspond to
ModelConfig.topic_name
.This parameter is optional. By default top tokens will be calculated for all topics in the model.
TopTokensScore¶
-
class
messages_pb2.
TopTokensScore
¶
Represents a result of calculation of a top tokens score.
message TopTokensScore {
optional int32 num_entries = 1;
repeated string topic_name = 2;
repeated int32 topic_index = 3;
repeated string token = 4;
repeated float weight = 5;
}
The data in this score is represented in a table-like format. sorted on topic_index. The following code block gives a typical usage example. The loop below is guarantied to process all top-N tokens for the first topic, then for the second topic, etc.
for (int i = 0; i < top_tokens_score.num_entries(); i++) {
// Gives a index from 0 to (model_config.topics_size() - 1)
int topic_index = top_tokens_score.topic_index(i);
// Gives one of the topN tokens for topic 'topic_index'
std::string token = top_tokens_score.token(i);
// Gives the weight of the token
float weight = top_tokens_score.weight(i);
}
-
TopTokensScore.
num_entries
¶ A value indicating the overall number of entries in the score. All the remaining repeated fiels in this score will have this length.
-
TopTokensScore.
token
¶ A repeated field of
num_entries
elements, containing tokens with high probability.
-
TopTokensScore.
weight
¶ A repeated field of
num_entries
elements, containing the p(t|w) probabilities.
-
TopTokensScore.
topic_index
¶ A repeated field of
num_entries
elements, containing integers between 0 and (ModelConfig.topics_count
- 1).
-
TopTokensScore.
topic_name
¶ A repeated field of
num_entries
elements, corresponding to the values ofModelConfig.topic_name
field.
ThetaSnippetScoreConfig¶
-
class
messages_pb2.
ThetaSnippetScoreConfig
¶
Represents a configuration of a theta snippet score.
message ThetaSnippetScoreConfig {
optional string field_name = 1 [default = "@body"]; // obsolete in BigARTM v0.5.8
optional string stream_name = 2 [default = "@global"];
repeated int32 item_id = 3 [packed = true]; // obsolete in BigARTM v0.5.8
optional int32 item_count = 4 [default = 10];
}
-
ThetaSnippetScoreConfig.
field_name
¶ Obsolete in BigARTM v0.5.8
-
ThetaSnippetScoreConfig.
stream_name
¶ A value that defines which stream should be used in calculation of a theta snippet.
-
ThetaSnippetScoreConfig.
item_id
¶ Obsolete in BigARTM v0.5.8.
-
ThetaSnippetScoreConfig.
item_count
¶ The number of items to retrieve. ThetaSnippetScore will select last item_count processed items and return their theta vectors.
ThetaSnippetScore¶
-
class
messages_pb2.
ThetaSnippetScore
¶
Represents a result of calculation of a theta snippet score.
message ThetaSnippetScore {
repeated int32 item_id = 1;
repeated FloatArray values = 2;
}
-
ThetaSnippetScore.
item_id
¶ A set of item ids for which theta snippet have been calculated. Items are identified by the item id.
-
ThetaSnippetScore.
values
¶ A set of values that define topic probabilities for each item. The length of these repeated values will match the number of item ids specified in
ThetaSnippetScore.item_id
. Each repeated field contains float array of topic probabilities in the natural order of topic ids.
TopicKernelScoreConfig¶
-
class
messages_pb2.
TopicKernelScoreConfig
¶
Represents a configuration of a topic kernel score.
message TopicKernelScoreConfig {
optional float eps = 1 [default = 1e-37];
optional string class_id = 2;
repeated string topic_name = 3;
optional double probability_mass_threshold = 4 [default = 0.1];
}
- Kernel of a topic model is defined as the list of all tokens such that
the probability
p(t | w)
exceeds probability mass threshold. - Kernel size of a topic
t
is defined as the number of tokens in its kernel. - Topic purity of a topic
t
is defined as the sum ofp(w | t)
across all tokensw
in the kernel. - Topic contrast of a topic
t
is defined as the sum ofp(t | w)
across all tokensw
in the kernel defided by the size of the kernel.
-
TopicKernelScoreConfig.
eps
¶ Defines the minimum threshold on kernel size. In most cases this parameter should be kept at the default value.
-
TopicKernelScoreConfig.
class_id
¶ A value that defines the class of tokens to use for score calculation. This value corresponds to
ModelConfig.class_id
field. This value is optional. By default the score will be calculated for the default class (‘@default_class’).
-
TopicKernelScoreConfig.
topic_name
¶ A set of topic names that defines which topics should be used for score calculation. This value is optional, use an empty list to calculate the score for all topics.
-
TopicKernelScoreConfig.
probability_mass_threshold
¶ Defines the probability mass threshold (see the definition of kernel above).
TopicKernelScore¶
-
class
messages_pb2.
TopicKernelScore
¶
Represents a result of calculation of a topic kernel score.
message TopicKernelScore {
optional DoubleArray kernel_size = 1;
optional DoubleArray kernel_purity = 2;
optional DoubleArray kernel_contrast = 3;
optional double average_kernel_size = 4;
optional double average_kernel_purity = 5;
optional double average_kernel_contrast = 6;
}
-
TopicKernelScore.
kernel_size
¶ Provides the kernel size for all requested topics. The length of this DoubleArray is always equal to the overall number of topics. The values of
-1
correspond to non-calculated topics. The remaining values carry the kernel size of the requested topics.
-
TopicKernelScore.
kernel_purity
¶ Provides the kernel purity for all requested topics. The length of this DoubleArray is always equal to the overall number of topics. The values of
-1
correspond to non-calculated topics. The remaining values carry the kernel size of the requested topics.
-
TopicKernelScore.
kernel_contrast
¶ Provides the kernel contrast for all requested topics. The length of this DoubleArray is always equal to the overall number of topics. The values of
-1
correspond to non-calculated topics. The remaining values carry the kernel contrast of the requested topics.
-
TopicKernelScore.
average_kernel_size
¶ Provides the average kernel size across all the requested topics.
-
TopicKernelScore.
average_kernel_purity
¶ Provides the average kernel purity across all the requested topics.
-
TopicKernelScore.
average_kernel_contrast
¶ Provides the average kernel contrast across all the requested topics.
TopicModel¶
-
class
messages_pb2.
TopicModel
¶
Represents a topic model.
This message can contain data in either dense or sparse format.
The key idea behind sparse format is to avoid storing zero p(w|t)
elements of the Phi matrix.
Please refer to the description of TopicModel.topic_index
field for more details.
To distinguish between these two formats
check whether repeated field TopicModel.topic_index
is empty.
An empty field indicate a dense format,
otherwise the message contains data in a sparse format.
To request topic model in a sparse format set
GetTopicModelArgs.use_sparse_format
field to True
when calling ArtmRequestTopicModel()
.
message TopicModel {
enum OperationType {
Initialize = 0;
Increment = 1;
Overwrite = 2;
Remove = 3;
Ignore = 4;
}
optional string name = 1 [default = "@model"];
optional int32 topics_count = 2;
repeated string topic_name = 3;
repeated string token = 4;
repeated FloatArray token_weights = 5;
repeated string class_id = 6;
message TopicModelInternals {
repeated FloatArray n_wt = 1;
repeated FloatArray r_wt = 2;
}
optional bytes internals = 7; // obsolete in BigARTM v0.6.3
repeated IntArray topic_index = 8;
repeated OperationType operation_type = 9;
}
-
TopicModel.
name
¶ A value that describes the name of the topic model (
TopicModel.name
).
-
TopicModel.
topics_count
¶ A value that describes the number of topics in this message.
-
TopicModel.
topic_name
¶ A value that describes the names of the topics included in given TopicModel message. This values will represent a subset of topics, defined by
GetTopicModelArgs.topic_name
message. In case of emptyGetTopicModelArgs.topic_name
this values will correspond to the entire set of topics, defined inModelConfig.topic_name
field.
-
TopicModel.
token
¶ The set of all tokens, included in the topic model.
-
TopicModel.
token_weights
¶ A set of token weights. The length of this repeated field will match the length of the repeated field
TopicModel.token
. The length of each FloatArray will match theTopicModel.topics_count
field (in dense representation), or the length of the corresponding IntArray fromTopicModel.topic_index
field (in sparse representation).
-
TopicModel.
class_id
¶ A set values that specify the class (modality) of the tokens. The length of this repeated field will match the length of the repeated field
TopicModel.token
.
-
TopicModel.
internals
¶ Obsolete in BigARTM v0.6.3.
-
TopicModel.
topic_index
¶ A repeated field used for sparse topic model representation. This field has the same length as
TopicModel.token
,TopicModel.class_id
andTopicModel.token_weights
. Each element in topic_index is an instance of IntArray message, containing a list of values between 0 and the length ofTopicModel.topic_name
field. This values correspond to the indices inTopicModel.topic_name
array, and tell which topics has non-zerop(w|t)
probabilities for a given token. The actualp(w|t)
values can be found inTopicModel.token_weights
field. The length of each IntArray message inTopicModel.topic_index
field equals to the length of the corresponding FloatArray message inTopicModel.token_weights
field.Warning
Be careful with
TopicModel.topic_index
when this message represents a subset of topics, defined byGetTopicModelArgs.topic_name
. In this case indices correspond to the selected subset of topics, which might not correspond to topic indices in the original ModelConfig message.
-
TopicModel.
operation_type
¶ A set of values that define operation to perform on each token when topic model is used as an argument of
ArtmOverwriteTopicModel()
.Initialize
Indicates that a new token should be added to the topic model. Initial n_wt
counter will be initialized with random value from[0, 1]
range.TopicModel.token_weights
is ignored. This operation is ignored if token already exists.Increment
Indicates that n_wt
counter of the token should be increased by values, specified inTopicModel.token_weights
field. A new token will be created if it does not exist yet.Overwrite
Indicates that n_wt
counter of the token should be set to the value, specified inTopicModel.token_weights
field. A new token will be created if it does not exist yet.Remove
Indicates that the token should be removed from the topic model. TopicModel.token_weights
is ignored.Ignore
Indicates no operation for the token. The effect is the same as if the token is not present in this message.
ThetaMatrix¶
-
class
messages_pb2.
ThetaMatrix
¶
Represents a theta matrix.
This message can contain data in either dense or sparse format.
The key idea behind sparse format is to avoid storing zero p(t|d)
elements of the Theta matrix.
Sparse representation of Theta matrix is equivalent to sparse representation
of Phi matrix. Please, refer to TopicModel for detailed description of the sparse format.
message ThetaMatrix {
optional string model_name = 1 [default = "@model"];
repeated int32 item_id = 2;
repeated FloatArray item_weights = 3;
repeated string topic_name = 4;
optional int32 topics_count = 5;
repeated string item_title = 6;
repeated IntArray topic_index = 7;
}
-
ThetaMatrix.
model_name
¶ A value that describes the name of the topic model. This name will match the name of the corresponding model config.
-
ThetaMatrix.
item_weights
¶ A set of item ID weights. The length of this repeated field will match the length of the repeated field
ThetaMatrix.item_id
. The length of each FloatArray will match theThetaMatrix.topics_count
field (in dense representation), or the length of the corresponding IntArray fromThetaMatrix.topic_index
field (in sparse representation).
-
ThetaMatrix.
topic_name
¶ A value that describes the names of the topics included in given ThetaMatrix message. This values will represent a subset of topics, defined by
GetThetaMatrixArgs.topic_name
message. In case of emptyGetTopicModelArgs.topic_name
this values will correspond to the entire set of topics, defined inModelConfig.topic_name
field.
-
ThetaMatrix.
topics_count
¶ A value that describes the number of topics in this message.
-
ThetaMatrix.
item_title
¶ A set of item titles, corresponding to
Item.title
values. Beware that this field might be empty (e.g. of zero length) if all items did not have title specified inItem.title
.
-
ThetaMatrix.
topic_index
¶ A repeated field used for sparse theta matrix representation. This field has the same length as
ThetaMatrix.item_id
,ThetaMatrix.item_weights
andThetaMatrix.item_title
. Each element in topic_index is an instance of IntArray message, containing a list of values between 0 and the length ofTopicModel.topic_name
field. This values correspond to the indices inThetaMatrix.topic_name
array, and tell which topics has non-zerop(t|d)
probabilities for a given item. The actualp(t|d)
values can be found inThetaMatrix.item_weights
field. The length of each IntArray message inThetaMatrix.topic_index
field equals to the length of the corresponding FloatArray message inThetaMatrix.item_weights
field.Warning
Be careful with
ThetaMatrix.topic_index
when this message represents a subset of topics, defined byGetThetaMatrixArgs.topic_name
. In this case indices correspond to the selected subset of topics, which might not correspond to topic indices in the original ModelConfig message.
CollectionParserConfig¶
-
class
messages_pb2.
CollectionParserConfig
¶
Represents a configuration of a collection parser.
message CollectionParserConfig {
enum Format {
BagOfWordsUci = 0;
MatrixMarket = 1;
}
optional Format format = 1 [default = BagOfWordsUci];
optional string docword_file_path = 2;
optional string vocab_file_path = 3;
optional string target_folder = 4;
optional string dictionary_file_name = 5;
optional int32 num_items_per_batch = 6 [default = 1000];
optional string cooccurrence_file_name = 7;
repeated string cooccurrence_token = 8;
optional bool use_unity_based_indices = 9 [default = true];
}
-
CollectionParserConfig.
format
¶ A value that defines the format of a collection to be parsed.
BagOfWordsUci
A bag-of-words collection, stored in UCI format.UCI format must have two files - vocab.*.txtand docword.*.txt, defined byandvocab_file_path
.The format of the docword.*.txt file is 3 headerlines, followed by NNZ triples:D W NNZ docID wordID count docID wordID count ... docID wordID count
The file must be sorted on docID.Values of wordID must be unity-based (not zero-based).The format of the vocab.*.txt file is line containing wordID=n.Note that words must not have spaces or tabs.In vocab.*.txt file it is also possible to specifyBatch.class_id
for tokens, as it is shown in this example:token1 @default_class token2 custom_class token3 @default_class token4
Use space or tab to separate token from its class.Token that are not followed by class label automaticallyget ‘’@default_class’’ as a lable (see ‘’token4’’ in the example).MatrixMarket
See the description at http://math.nist.gov/MatrixMarket/formats.htmlIn this mode parameterdocword_file_path
must refer to a filein Matrix Market format. Parametervocab_file_path
is also required and must refer to a dictionary file exported ingensim format (dictionary.save_as_text()).
-
CollectionParserConfig.
docword_file_path
¶ A value that defines the disk location of a
docword.*.txt
file (the bag of words file in sparse format).
-
CollectionParserConfig.
vocab_file_path
¶ A value that defines the disk location of a
vocab.*.txt
file (the file with the vocabulary of the collection).
-
CollectionParserConfig.
target_folder
¶ A value that defines the disk location where to stores all the results after parsing the colleciton. Usually the resulting location will contain a set of batches, and a DictionaryConfig that contains all unique tokens occured in the collection. Such location can be further passed MasterComponent via
MasterComponentConfig.disk_path
.
-
CollectionParserConfig.
dictionary_file_name
¶ A file name where to save the DictionaryConfig message that contains all unique tokens occured in the collection. The file will be created in
target_folder
.This parameter is optional. The dictionary will be still collected even when this parameter is not provided, but the resulting dictionary will be only returned as the result of ArtmRequestParseCollection, but it will not be stored to disk.
In the resulting dictionary each entry will have the following fields:
DictionaryEntry.key_token
- the textual representation of the token,DictionaryEntry.class_id
- the label of the default class (“@DefaultClass”),DictionaryEntry.token_count
- the overall number of occurrences of the token in the collection,DictionaryEntry.items_count
- the number of documents in the collection, containing the token.DictionaryEntry.value
- the ratio betweentoken_count
andtotal_token_count
.
Use ArtmRequestLoadDictionary method to load the resulting dictionary.
-
CollectionParserConfig.
num_items_per_batch
¶ A value indicating the desired number of items per batch.
-
CollectionParserConfig.
cooccurrence_file_name
¶ A file name where to save the DictionaryConfig message that contains information about co-occurrence of all pairs of tokens in the collection. The file will be created in
target_folder
.This parameter is optional. No cooccurrence information will be collected if the filename is not provided.
In the resulting dictionary each entry will correspond to two tokens (‘<first>’ and ‘<second>’), and carry the information about co-occurrence of this tokens in the collection.
DictionaryEntry.key_token
- a string of the form ‘<first>~<second>’, produced by concatenation of two tokens together via the tilde symbol (‘~’). <first> tokens is guarantied lexicographic less than the <second> token.DictionaryEntry.class_id
- the label of the default class (“@DefaultClass”).DictionaryEntry.items_count
- the number of documents in the collection, containing both tokens (‘<first>’ and ‘<second>’)
Use ArtmRequestLoadDictionary method to load the resulting dictionary.
-
CollectionParserConfig.
cooccurrence_token
¶ A list of tokens to collect cooccurrence information. A cooccurrence of the pair <first>~<second> will be collected only when both tokens are present in
CollectionParserConfig.cooccurrence_token
.
-
CollectionParserConfig.
use_unity_based_indices
¶ A flag indicating whether to interpret indices in docword file as unity-based or as zero-based. By default ‘use_unity_based_indices = True`, as required by UCI bag-of-words format.
SynchronizeModelArgs¶
-
class
messages_pb2.
SynchronizeModelArgs
¶
Represents an argument of synchronize model operation.
message SynchronizeModelArgs {
optional string model_name = 1;
optional float decay_weight = 2 [default = 0.0];
optional bool invoke_regularizers = 3 [default = true];
optional float apply_weight = 4 [default = 1.0];
}
-
SynchronizeModelArgs.
model_name
¶ The name of the model to be synchronized. This value is optional. When not set, all models will be synchronized with the same decay weight.
-
SynchronizeModelArgs.
decay_weight
¶ The decay weight and
apply_weight
define how to combine existing topic model with all increments, calculated since the lastArtmSynchronizeModel()
. This is best described by the following formula:n_wt_new = n_wt_old * decay_weight + n_wt_inc * apply_weight
,where
n_wt_old
describe current topic model,n_wt_inc
describe increment calculated since lastArtmSynchronizeModel()
,n_wt_new
define the resulting topic model.Expected values of both parameters are between 0.0 and 1.0. Here are some examples:
- Combination of decay_weight=0.0 and apply_weight=1.0 states that the previous Phi matrix of the topic model will be disregarded completely, and the new Phi matrix will be formed based on new increments gathered since last model synchronize.
- Combination of decay_weight=1.0 and apply_weight=1.0 states that new increments will be appended to the current Phi matrix without any decay.
- Combination of decay_weight=1.0 and apply_weight=0.0 states that new increments will be disregarded, and current Phi matrix will stay unchanged.
- To reproduce Online variational Bayes for LDA algorighm by Matthew D. Hoffman set decay_weight = 1 - rho and apply_weight = rho, where parameter rho is defined as rho = exp(tau + t, -kappa). See Online Learning for Latent Dirichlet Allocation for further details.
-
SynchronizeModelArgs.
apply_weight
¶ See
decay_weight
for the description.
-
SynchronizeModelArgs.
invoke_regularizers
¶ A flag indicating whether to invoke all phi-regularizers.
InitializeModelArgs¶
-
class
messages_pb2.
InitializeModelArgs
¶
Represents an argument of ArtmInitializeModel()
operation.
Please refer to
example14_initialize_topic_model.py
for further information.
message InitializeModelArgs {
enum SourceType {
Dictionary = 0;
Batches = 1;
}
message Filter {
optional string class_id = 1;
optional float min_percentage = 2;
optional float max_percentage = 3;
optional int32 min_items = 4;
optional int32 max_items = 5;
optional int32 min_total_count = 6;
optional int32 min_one_item_count = 7;
}
optional string model_name = 1;
optional string dictionary_name = 2;
optional SourceType source_type = 3 [default = Dictionary];
optional string disk_path = 4;
repeated Filter filter = 5;
}
-
InitializeModelArgs.
model_name
¶ The name of the model to be initialized.
-
InitializeModelArgs.
dictionary_name
¶ The name of the dictionary containing all tokens that should be initialized.
GetTopicModelArgs¶
Represents an argument of ArtmRequestTopicModel()
operation.
message GetTopicModelArgs {
enum RequestType {
Pwt = 0;
Nwt = 1;
}
optional string model_name = 1;
repeated string topic_name = 2;
repeated string token = 3;
repeated string class_id = 4;
optional bool use_sparse_format = 5;
optional float eps = 6 [default = 1e-37];
optional RequestType request_type = 7 [default = Pwt];
}
-
GetTopicModelArgs.
model_name
¶ The name of the model to be retrieved.
-
GetTopicModelArgs.
topic_name
¶ The list of topic names to be retrieved. This value is optional. When not provided, all topics will be retrieved.
-
GetTopicModelArgs.
token
¶ The list of tokens to be retrieved. The length of this field must match the length of
class_id
field. This field is optional. When not provided, all tokens will be retrieved.
-
GetTopicModelArgs.
class_id
¶ The list of classes corresponding to all tokens. The length of this field must match the length of
token
field. This field is only required together withtoken
, otherwise it is ignored.
-
GetTopicModelArgs.
use_sparse_format
¶ An optional flag that defines whether to use sparse format for the resulting
TopicModel
message. SeeTopicModel
message for additional information about the sparse format. Note that setting use_sparse_format = true results in emptyTopicModel.internals
field.
-
GetTopicModelArgs.
eps
¶ A small value that defines zero threshold for
p(w|t)
probabilities. This field is only used in sparse format.p(w|t)
below the threshold will be excluded from the resulting Phi matrix.
-
GetTopicModelArgs.
request_type
¶ An optional value that defines what kind of data to retrieve in this operation.
Pwt Indicates that the resulting TopicModel message should contain p(w|t)
probabilities. This values are normalized to form a probability distribution (sum_w p(w|t) = 1
for all topicst
).Nwt Indicates that the resulting TopicModel message should contain internal n_wt
counters of the topic model. This values represent an internal state of the topic model.Default setting is to retrieve
p(w|t)
probabilities. This probabilities are sufficient to inferp(t|d)
distributions using this topic model.n_wt
counters allow you to restore the precise state of the topic model. By passing this values inArtmOverwriteTopicModel()
operation you are guarantied to get the model in the same state as you retrieved it. As the result you may continue topic model inference from the point you have stopped it last time.p(w|t)
values can be also restored via c:func:ArtmOverwriteTopicModel operation. The resulting model will give the samep(t|d)
distributions, however you should consider this model as read-only, and do not callArtmSynchronizeModel()
on it.
GetThetaMatrixArgs¶
Represents an argument of ArtmRequestThetaMatrix()
operation.
message GetThetaMatrixArgs {
optional string model_name = 1;
optional Batch batch = 2;
repeated string topic_name = 3;
repeated int32 topic_index = 4;
optional bool clean_cache = 5 [default = false];
optional bool use_sparse_format = 6 [default = false];
optional float eps = 7 [default = 1e-37];
}
-
GetThetaMatrixArgs.
model_name
¶ The name of the model to retrieved theta matrix for.
-
GetThetaMatrixArgs.
topic_name
¶ The list of topic names, describing which topics to include in the Theta matrix. The values of this field should correspond to values in
ModelConfig.topic_name
. This field is optional, by default all topics will be included.
-
GetThetaMatrixArgs.
topic_index
¶ The list of topic indices, describing which topics to include in the Theta matrix. The values of this field should be an integers between 0 and (
ModelConfig.topics_count
- 1). This field is optional, by default all topics will be included.Note that this field acts similar to
GetThetaMatrixArgs.topic_name
. It is not allowed to specify both topic_index and topic_name at the same time. The recommendation is to use topic_name.
-
GetThetaMatrixArgs.
clean_cache
¶ An optional flag that defines whether to clear the theta matrix cache after this operation. Setting this value to True will clear the cache for a topic model, defined by
GetThetaMatrixArgs.model_name
. This value is only applicable whenMasterComponentConfig.cache_theta
is set to True.
-
GetThetaMatrixArgs.
use_sparse_format
¶ An optional flag that defines whether to use sparse format for the resulting
ThetaMatrix
message. SeeThetaMatrix
message for additional information about the sparse format.
-
GetThetaMatrixArgs.
eps
¶ A small value that defines zero threshold for
p(t|d)
probabilities. This field is only used in sparse format.p(t|d)
below the threshold will be excluded from the resulting Theta matrix.
GetScoreValueArgs¶
Represents an argument of get score operation.
message GetScoreValueArgs {
optional string model_name = 1;
optional string score_name = 2;
optional Batch batch = 3;
}
-
GetScoreValueArgs.
model_name
¶ The name of the model to retrieved score for.
-
GetScoreValueArgs.
score_name
¶ The name of the score to retrieved.
AddBatchArgs¶
Represents an argument of ArtmAddBatch()
operation.
message AddBatchArgs {
optional Batch batch = 1;
optional int32 timeout_milliseconds = 2 [default = -1];
optional bool reset_scores = 3 [default = false];
optional string batch_file_name = 4;
}
-
AddBatchArgs.
timeout_milliseconds
¶ Timeout in milliseconds for this operation.
-
AddBatchArgs.
reset_scores
¶ An optional flag that defines whether to reset all scores before this operation.
-
AddBatchArgs.
batch_file_name
¶ An optional value that defines disk location of the batch to add. You must choose between parameters batch_file_name or batch (either of them has to be specified, but not both at the same time).
InvokeIterationArgs¶
Represents an argument of ArtmInvokeIteration()
operation.
message InvokeIterationArgs {
optional int32 iterations_count = 1 [default = 1];
optional bool reset_scores = 2 [default = true];
optional string disk_path = 3;
}
-
InvokeIterationArgs.
iterations_count
¶ An integer value describing how many iterations to invoke.
-
InvokeIterationArgs.
reset_scores
¶ An optional flag that defines whether to reset all scores before this operation.
-
InvokeIterationArgs.
disk_path
¶ A value that defines the disk location with batches to process on this iteration.
WaitIdleArgs¶
Represents an argument of ArtmWaitIdle()
operation.
message WaitIdleArgs {
optional int32 timeout_milliseconds = 1 [default = -1];
}
-
WaitIdleArgs.
timeout_milliseconds
¶ Timeout in milliseconds for this operation.
ExportModelArgs¶
Represents an argument of ArtmExportModel()
operation.
message ExportModelArgs {
optional string file_name = 1;
optional string model_name = 2;
}
-
ExportModelArgs.
file_name
¶ A target file name where to store topic model.
-
ExportModelArgs.
model_name
¶ A value that describes the name of the topic model. This name will match the name of the corresponding model config.
ImportModelArgs¶
Represents an argument of ArtmImportModel()
operation.
message ImportModelArgs {
optional string file_name = 1;
optional string model_name = 2;
}
-
ImportModelArgs.
file_name
¶ A target file name from where to load topic model.
-
ImportModelArgs.
model_name
¶ A value that describes the name of the topic model. This name will match the name of the corresponding model config.