BigARTM v0.7.4 Release notes¶
BigARTM v0.7.4 is a big release that includes major rework of dictionaries and MasterModel.
bigartm/stable branch¶
Up until now BigARTM has only one master branch, containing the latest code.
This branch potentially includes untested code and unfinished features.
We are now introducing bigartm/stable branch, and encourage all users to
stop using master and start fetching from stable.
stable branch will be lagging behind master, and moved forward to master
as soon as mainteiners decide that it is ready.
At the same point we will introduce a new tag (something like v0.7.3 )
and produce a new release for Windows.
In addition, stable branch also might receive small urgent fixes in between releases,
typically to address critical issues reported by our users.
Such fixes will be also included in master branch.
MasterModel¶
MasterModel is a new set of low-level APIs that allow users of C-interface to infer models and apply them to new data.
The APIs are ArtmCreateMasterModel, ArtmReconfigureMasterModel, ArtmFitOfflineMasterModel, ArtmFitOnlineMasterModel and ArtmRequestTransformMasterModel,
togehter with corresponding protobuf messages. For a usage example see src/bigartm/srcmain.cc.
This APIs should be easy to understand for the users who are familiar with Python interface. Basically, we take ARTM class in Python,
and push it down to the core.
Now users can create their model via MasterModelConfig (protobuf message),
fit via ArtmFitOfflineMasterModel or ArtmFitOnlineMasterModel, and apply to the new data via ArtmRequestTransformMasterModel.
This means that the user no longer has to orchestrate low-level building blocks such as ArtmProcessBatches, ArtmMergeModel, ArtmRegularizeModel and ArtmNormalizeModel.
ArtmCreateMasterModel is similar to ArtmCreateMasterComponent in a sence that it returns master_id,
which can be later passed to all other APIs. This mean that most APIs will continue working as before.
This applies to ArtmRequestThetaMatrix, ArtmRequestTopicModel, ArtmRequestScore, and many others.
Rework of dictionaries¶
Previous implementation of the dictionaries was really messy, and we are trying to clean this up. This effort is not finished yet, however we decided to release current version because
it is a major improvement comparing to the previous version.
At the low-level (c_interface), we now have the following methods to work with dictionaries:
ArtmGatherDictionarycollects a dictionary based on a folder with batches,ArtmFilterDictionaryfilter tokens from the dictinoary based on their term frequency or document frequency,ArtmCreateDictionarycreates a dictionary from a customDictionaryDataobject (protobuf message),ArtmRequestDictionaryretrieves a dictionary asDictionaryDataobject (protobuf message),ArtmDisposeDictionarydeletes dictionary object from BigARTM,ArtmImportDictionaryimport dictionary from binary file,ArtmExportDictionaryexpor tdictionary into binary file.
All dictionaries are identified by a string ID (dictionary_name).
Dictionaries can be used to initialize the model, in regularizers or in scores.
Note that ArtmImportDictionary and ArtmExportDictionary now uses a different format.
For this reason we require that all imported or exported files end with .dict extension.
This limitation is only introduced to make users aware of the change in binary format.
Warning
Please note that you have to re-generate all dictionaries, created in previous BigARTM versions.
To force this limitation we decided that
ArtmImportDictionary and ArtmExportDictionary will require
all imported or exported files end with .dict extension.
This limitation is only introduced to make users aware of the change in binary format.
Please note that in the next version (BigARTM v0.8.0) we are planing to break dictionary format once again.
This is because we will introduce boost.serialize library for all import and export methods.
From that point boost.serialize library will allow us to upgrade formats without breaking backwards compatibility.
The following example illustrate how to work with new dictionaries from Python.
# Parse collection in UCI format from D:\Datasets\docword.kos.txt and D:\Datasets\vocab.kos.txt
# and store the resulting batches into D:\Datasets\kos_batches
batch_vectorizer = artm.BatchVectorizer(data_format='bow_uci',
data_path=r'D:\Datasets',
collection_name='kos',
target_folder=r'D:\Datasets\kos_batches')
# Initialize the model. For now dictionaries exist within the model,
# but we will address this in the future.
model = artm.ARTM(...)
# Gather dictionary named `dict` from batches.
# The resulting dictionary will contain all distinct tokens that occur
# in those batches, and their term frequencies
model.gather_dictionary("dict", "D:\Datasets\kos_batches")
# Filter dictionary by removing tokens with too high or too low term frequency
# Save the result as `filtered_dict`"
model.filter_dictionary(dictionary_name='dict',
dictionary_target_name='filtered_dict',
min_df=10, max_df_rate=0.4)
# Initialize model from `diltered_dict`
model.initialize("filtered_dict")
# Import/export functionality
model.save_dictionary("filtered_dict", "D:\Datasets\kos.dict")
model.load_dictionary("filtered_dict2", "D:\Datasets\kos.dict")
Changes in the infrastructure¶
- Static linkage for bigartm command-line executable on Linux.
To disable static linkage use
cmake -DBUILD_STATIC_BIGARTM=OFF .. - Install BigARTM python API via
python setup.py install
Changes in core functionality¶
- Custom transform function for KL-div regularizers
- Ability to initialize the model with custom seed
TopicSelectionregularizersPeakMemoryscore (Windows only)- Different options to name batches when parsing collection
(
GUIDas today, andCODEfor sequential numbering)
Changes in Python API¶
ARTM.dispose()method for managing native memoryARTM.get_info()method to retrieve internal state- Performance fixes
- Expose class prediction functionality
Changes in C++ interface¶
- Consume
MasterModelAPIs in C++ interface. Going forward this is the only C++ interface that we will support.
Changes in console interface¶
- Better options to work with dictionaries
--write-dictionary-readableto export dictionary--forceswitch to let user overwrite existing files--helpgenerates much better examples--model-v06to experiment with old APIs (ArtmInvokeIteration/ArtmWaitIdle/ArtmSynchronizeModel)--write-scoresswitch to export scores into file--time-limitoption to time-box model inference(as an alternative to--passesswitch)