Changes in Python API¶
This page describes recent changes in BigARTM’s Python API. Note that the API might be affected by changes in the underlying protobuf messages. For this reason we recommend to review Changes in Protobuf Messages.
- New source type ‘bow_n_wd’ was added into BatchVectorizer class. This type oriented on using the output of CountVectorizer and TfIdfVectorizers classes from sklearn. New parameters of BatchVectorizer are: n_wd (numpy.array) and vocabulary(dict)
- LDA model was added as one of the public interfaces. It is a restricted ARTM model created to simplify BigARTM usage for new users with few experience in topic modeling.
- BatchVectorizer got a flag ‘gather_dictionary’, which has default value ‘True’. This means that BV would create dictionary and save it in the BV.dictionary field. For ‘bow_n_wd’ format the dictionary will be gathered whenever the flag was set to ‘False’ or to ‘True’.
- Add relative regularization for Phi matrix
Note that your script can be affected by our changes in the default values for
reuse_theta parameters (see below).
We recommend to use our new default settings,
num_document_passes = 10 and
reuse_theta = False.
However, if you choose to explicitly set
num_document_passes = 1 then make sure to also set
reuse_theta = True,
otherwise you will experience very slow convergence.
all operations to work with dictionaries were moved into a separate class
artm.Dictionary. (details in the documentation). The mapping between old and new methods is very straighforward:
ARTM.gather_dictionaryis replaced with
Dictionary.gathermethod, which allows to gather a dictionary from a set of batches;
ARTM.filter_dictionaryis replaced with
Dictionary.filtermethod, which allows to filter a dictionary based on term frequency and document frequency;
ARTM.load_dictionaryis replaced with
Dictionary.loadmethod, which allows to load a dictionary previously exported to disk in
ARTM.create_dictionaryis replaced with
Dictionary.createmethod, which allows to create a dictionary based on custom protobuf message
DictionaryData, containing a set of dictionary entries; etc... The following code snippet gives a basic example:
my_dictionary = artm.Dictionary() my_dictionary.gather(data_path='my_collection_batches', vocab_file_path='vocab.txt') my_dictionary.save(dictionary_path='my_collection_batches/my_dictionary') my_dictionary.load(dictionary_path='my_collection_batches/my_dictionary.dict') model = artm.ARTM(num_topics=20, dictionary=my_dictionary) model.scores.add(artm.PerplexityScore(name='my_fisrt_perplexity_score', use_unigram_document_model=False, dictionary=my_dictionary))
ARTMclass to query for the version of the underlying BigARTM library; returns a string in MAJOR.MINOR.PATCH format;
dictionary_nameargument had been renamed to
dictionaryin many places across python interface, including scores and regularizers. This is done because those arguments can now except not just a string, but also the
Dictionaryclass users no longer have to generate names for their dictionaries (e.g. the unique
dictionary_nameidentifier that references the dictionary). You may use
Dictionary.namefield to access to the underlying name of the dictionary.
ARTM.__init__constructor to let user initialize the model; note that we’ve change the behavior that model is automatically initialized whenever user calls
fit_online. Now this is no longer the case, and we expect user to either pass a dictionary in
ARTM.__init__constructor, or manually call
ARTM.initializemethod. If neither is performed then
ARTM.fit_onlinewill throw an exception.
ARTM.__init__constructor to let user randomly initialize the model;
added new score and score tracker
remove the default value from
ARTM.__init__constructor, which previously was defaulting to
num_topics = 10; now user must always specify the desired number of topics;
ARTM.__init__constructor; the argument is still used to indicate that the previous theta matrix should be re-used on the next pass over the collection; setting
reuse_theta = Truein the constructor will now be applied to
fit_online, which previously did not have this option.
moved common argument
changed the default value of
ARTM.__init__constructor); this is done to avoid excessive memory usage due to caching of the entire Theta matrix; if caching is indeed required user has to manually turn it on by setting
cache_theta = True.
changed the default value of
ARTM.__init__constructor); the reason is the same as for changing the default for
changed the default value of
ARTM.fit_onlinemethod; each argument accepts a list of floats; setting all three arguments will override the default behavior of the online algorithm that rely on a specific formula with
async(boolean flag) in
ARTM.fit_onlinemethod for improved performance.
ARTM.transformmethod; potential values are:
None; default matrix type is
introduced a separate method
ARTM.remove_thetato clear cached theta matrix; remove corresponding boolean switch
ARTM.fit_transformmethod; note that the name was confusing because this method has never fitted the model; the purpose of
ARTM.fit_transformwas to retrieve Theta matrix after fitting the model (
ARTM.fit_online); same functionality is now available via
ARTM.get_scoremethod, which will exist in parallel to score tracking functionality; the goal for
ARTM.get_score(score_name)is to always return the latest version of the score; for Phi scores this means to calculate them on fly; for Theta scores this means to return a score aggregated over last call to
ARTM.transformmethods; opposite to
ARTM.get_scorethe score tracking functionality returns the overall history of a score. For further details on score calculation refer to Q&A section in our wiki page.
BatchVectorizer.__init__constructor to let user specify an individual weight for each batch
score tracker classes had been rewritten, so you should make minor changes in the code that retrieves scores; for example:
added an API to initialize logging with custom logging directory, log level, etc... Search out wiki page Q&A for more details.
# in v0.7.x print model.score_tracker['Top100Tokens'].last_topic_info[topic_name].tokens # in v0.8.0 last_tokens = model.score_tracker['Top100Tokens'].last_tokens print last_tokens[topic_name]