Changes in Protobuf Messages


  • added CollectionParserConfig.num_threads to control the number of threads that perform parsing. At the moment the feature is only implemented for VW-format.
  • added CollectionParserConfig.class_id (repeated string) to control which modalities should be parsed. If token’s class_id is not from this list, it will be excluded from the resulting batches. When the list is empty, all modalities are included (this is the default behavior, as before).
  • added CollectionParserInfo message to export diagnostics information from ArtmParseCollection
  • added FilterDictionaryArgs.max_dictionary_size to give user an easy option to limit his dictionary size
  • added MergeModelArgs.dictionary_name to define the set of tokens in the resulting matrix
  • added ThetaMatrix.num_values, TopicModel.num_values to define number of non-zero elements in sparse format



New batches, created in BigARTM v0.8, CAN NOT be used in the previous versions of the library. Old batches, created prior to BigARTM v0.8, can still be used. See below for details.

  • added token_id and token_weight field in Item message, and obsoleted Item.field. Internally the library will merge the content of Field.token_id and Field.token_weight across all fields, and store the result back into Item.token_id, Item.token_weight. New Item message is as follows:

    message Item {
      optional int32 id = 1;
      repeated Field field = 2;  // obsolete in BigARTM v0.8.0
      optional string title = 3;
      repeated int32 token_id = 4;
      repeated float token_weight = 5;
  • renamed topics_count into num_topics across multiple messsages (TopicModel, ThetaMatrix, etc)

  • renamed inner_iterations_count into num_document_passes in ProcessBatchesArgs

  • renamed passes into num_collection_passes in FitOfflineMasterModelArgs

  • renamed threads into num_processors in MasterModelConfig

  • renamed topic_index field into topic_indices in TopicModel and ThetaMatrix messages

  • added messages ScoreArray, GetScoreArrayArgs and ClearScoreArrayCacheArgs to bring score tracking functionality down into BigARTM core

  • added messages BackgroundTokensRatioConfig and BackgroundTokensRatio (new score)

  • moved model_name from GetScoreValueArgs into ScoreConfig; this is done to support score tracking functionality in BigARTM core; each Phi score needs to know which model to use in calculation

  • removed topics_count from InitializeModelArgs; users must specify topic names in InitializeModelArgs.topic_name field

  • removed topic_index from GetThetaMatrixArgs; users must specify topic names to retrieve in GetThetaMatrixArgs.topic_name

  • removed batch field in GetThetaMatrixArgs and GetScoreValueArgs.batch messages; users should use ArtmRequestTransformMasterModel or ArtmRequestProcessBatches to process new batches and calculate theta scores

  • removed reset_scores flag in ProcessBatchesArgs; users should use new API ArtmClearScoreCache

  • removed clean_cache flag in GetThetaMatrixArgs; users should use new API ArtmClearThetaCache

  • removed MasterComponentConfig; users should user ArtmCreateMasterModel and pass MasterModelConfig

  • removed obsolete fields in CollectionParserConfig; same arguments can be specified at GatherDictionaryArgs and passed to ArtmGatherDictionary

  • removed Filter message in InitializeModelArgs; same arguments can be specified at FilterDictionaryArgs and passed to ArtmFilterDictionary

  • removed batch_name from ImportBatchesArgs; the field is no longer needed; batches will be identified via their identifier

  • removed use_v06_api in MasterModelConfig

  • removed ModelConfig message

  • removed SynchronizeModelArgs, AddBatchArgs, InvokeIterationArgs, WaitIdleArgs messages; users should use new APIs based on MasterModel

  • removed GetRegularizerStateArgs, RegularizerInternalState, MultiLanguagePhiInternalState messages

  • removed model_name and model_name_cache in ThetaMatrix, GetThetaMatrixArgs and ProcessBatchesArgs; the code of master component is simplified to only handle one theta matrix, so there is no longer any reason to identify theta matrix with model_name

  • removed Stream message, field, and all stream_name fields across several messages; train/test streaming functionality is fully removed; users are expected to manage their train and test collections (for example as separate folders with batches)

  • removed use_sparse_bow field in several messages; the computation mode with dense matrices is no longer supported;

  • renamed item_count into num_items in ThetaSnippetScoreConfig

  • add global enum ScoreType as a replacement for enums Type from ScoreConfig and ScoreData messages

  • add global enum RegularizerType as a replacement for enum Type from RegularizerConfig message

  • add global enum MatrixLayout as a replacement for enum MatrixLayout from GetThetaMatrixArgs and GetTopicModelArgs messages

  • add global enum ThetaMatrixType as a replacement for enum ThetaMatrixType from ProcessBatchesArgs and TransformMasterModelArgs messages

  • renamed enum Type into SmoothType in SmoothPtdwConfig to avoid conflicts in C# messages

  • renamed enum Mode into SparseMode in SpecifiedSparsePhiConfig to avoid conflicts in C# messages

  • renamed enum Format into CollectionFormat in CollectionParserConfig to avoid conflicts in C# messages

  • renamed enum NameType into BatchNameType in CollectionParserConfig to avoid conflicts in C# messages

  • renamed field transform_type into type in TransformConfig to avoid conflicts in C# messages

  • remove message CopyRequestResultArgs; this is a breaking change; please check that

    • all previous calls to ArtmCopyRequestResult are changed to to ArtmCopyRequestedMessage
    • all previous calls to ArtmCopyRequestResultEx with request types GetThetaSecondPass and GetModelSecondPass are changed to ArtmCopyRequestedObject
    • all previous calls to ArtmCopyRequestResultEx with DefaultRequestType are changed to ArtmCopyRequestedMessage
  • remove field request_type in GetTopicModelArgs; to request only topics and/or tokens users should set GetTopicModelArgs.matrix_layout to MatrixLayout_Sparse, and GetTopicModelArgs.eps = 1.001 (any number greather that 1.0).

  • change optional FloatArray into repeated float in field coherence of TopTokensScore

  • change optional DoubleArray into repeated double in fields kernel_size, kernel_purity, kernel_contrast and coherence of TopicKernelScore

  • change optional StringArray into repeated string in field topic_name of TopicKernelScore