Changes in Python API¶
This page describes recent changes in BigARTM’s Python API. Note that the API might be affected by changes in the underlying protobuf messages. For this reason we recommend to review Changes in Protobuf Messages.
For further reference about Python API refer to ARTM model, Q & A or tutorials.
v0.9.0¶
- Remove flag
process_in_memory
fromartm.BatchVectorizer
, and renamemodel
parameter intoprocess_in_memory_model
. Passing this parameter withdata_format='batches'
will trigger in-memory processing of batches specified inbatches
parameter. artm.BatchVectorizer
now can receive sparse matrices (subclasses ofscipy.sparse.spmatrix
) for in-memory batch processing.- Enable custom decorrelation by specifying separate decorrelation coefficients for every pair of topics into
artm.DecorrelatorPhiRegularizer
via optional parametertopic_pairs
. - Add new regularizers:
artm.SmoothTimeInTopicsPhiRegularizer
andartm.NetPlsaRegularizer
. - Add support for entire model saving and loading via
dump_artm_model
andload_artm_model
methods ofartm.ARTM
.
v0.8.3¶
- Enable copying of ARTM, LDA, hARTM and ARTM_Level objects with
clone()
method andcopy.deepcopy(obj)
. - Experimental support for import/export/editing of theta matrices; for more details see python reference of
ARTM.__init__(ptd_name='ptd')
. ARTM.get_phi_dense()
method to extract phi matrix without pandas.DataFrame, see #758.- Bug fix in
ARTM.get_phi_sparse()
to return tokens as rows, and topic names as columns (previously it was the opposite way)
v0.8.2¶
Warning
BigARTM 3rdparty dependency had been upgraded from protobuf 2.6.1
to protobuf 3.0.0
.
This may affect you upgrade from previous version of bigartm.
Pelase report any issues at bigartm-users@googlegroups.com.
Warning
BigARTM now require you to install tqdm library to visualize progress bars.
To install use pip install tqdm
or conda install -c conda-forge tqdm
.
Add support for python 3.0
Add hARTM class to support hierarchy model
Add HierarchySparsingTheta for advanced inference of hierarchical models
Enable replacing regularizers in ARTM-like models:
# using operator[]-like style model.regularizers['somename'] = SomeRegularizer(...) # using keyword argument overwrite in add function model.regularizers.add(SomeRegularizer(name='somename', ...), overwrite=True)
Better error reporting: raise exception in fit_offline, fit_online and transform if there is no data to process)
Better support for changes in topic names, with reconfigure(), initialize() and merge_model()
Show progress bars in
fit_offline
,fit_online
andtransform
.Add
ARTM.reshape_topics
method to add/remove/reorder topics.Add
max_dictionary_size
parameter toDictionary.filter()
Add
class_ids
parameter toBatchVectorizer.__init__()
Add
dictionary_name
parameter toMasterComponent.merge_model()
Add
ARTM.transform_sparse()
andARTM.get_theta_sparse()
for sparse retrieval of theta matrixAdd
ARTM.get_phi_sparse()
for sparse retrieval of phi matrix
v0.8.1¶
- New source type ‘bow_n_wd’ was added into BatchVectorizer class. This type oriented on using the output of CountVectorizer and TfIdfVectorizers classes from sklearn. New parameters of BatchVectorizer are: n_wd (numpy.array) and vocabulary(dict)
- LDA model was added as one of the public interfaces. It is a restricted ARTM model created to simplify BigARTM usage for new users with few experience in topic modeling.
- BatchVectorizer got a flag ‘gather_dictionary’, which has default value ‘True’. This means that BV would create dictionary and save it in the BV.dictionary field. For ‘bow_n_wd’ format the dictionary will be gathered whenever the flag was set to ‘False’ or to ‘True’.
- Add relative regularization for Phi matrix
v0.8.0¶
Warning
Note that your script can be affected by our changes in the default values for num_document_passes
and reuse_theta
parameters (see below).
We recommend to use our new default settings, num_document_passes = 10
and reuse_theta = False
.
However, if you choose to explicitly set num_document_passes = 1
then make sure to also set reuse_theta = True
,
otherwise you will experience very slow convergence.
all operations to work with dictionaries were moved into a separate class
artm.Dictionary
. (details in the documentation). The mapping between old and new methods is very straighforward:ARTM.gather_dictionary
is replaced withDictionary.gather
method, which allows to gather a dictionary from a set of batches;ARTM.filter_dictionary
is replaced withDictionary.filter
method, which allows to filter a dictionary based on term frequency and document frequency;ARTM.load_dictionary
is replaced withDictionary.load
method, which allows to load a dictionary previously exported to disk inDictionary.save
method;ARTM.create_dictionary
is replaced withDictionary.create
method, which allows to create a dictionary based on custom protobuf messageDictionaryData
, containing a set of dictionary entries; etc... The following code snippet gives a basic example:my_dictionary = artm.Dictionary() my_dictionary.gather(data_path='my_collection_batches', vocab_file_path='vocab.txt') my_dictionary.save(dictionary_path='my_collection_batches/my_dictionary') my_dictionary.load(dictionary_path='my_collection_batches/my_dictionary.dict') model = artm.ARTM(num_topics=20, dictionary=my_dictionary) model.scores.add(artm.PerplexityScore(name='my_fisrt_perplexity_score', use_unigram_document_model=False, dictionary=my_dictionary))
added
library_version
property toARTM
class to query for the version of the underlying BigARTM library; returns a string in MAJOR.MINOR.PATCH format;dictionary_name
argument had been renamed todictionary
in many places across python interface, including scores and regularizers. This is done because those arguments can now except not just a string, but also theartm.Dictionary
class itself.with
Dictionary
class users no longer have to generate names for their dictionaries (e.g. the uniquedictionary_name
identifier that references the dictionary). You may useDictionary.name
field to access to the underlying name of the dictionary.added
dictionary
argument toARTM.__init__
constructor to let user initialize the model; note that we’ve change the behavior that model is automatically initialized whenever user callsfit_offline
orfit_online
. Now this is no longer the case, and we expect user to either pass a dictionary inARTM.__init__
constructor, or manually callARTM.initialize
method. If neither is performed thenARTM.fit_offline
andARTM.fit_online
will throw an exception.added
seed
argument toARTM.__init__
constructor to let user randomly initialize the model;added new score and score tracker
BackgroundTokensRatio
remove the default value from
num_topics
argument inARTM.__init__
constructor, which previously was defaulting tonum_topics = 10
; now user must always specify the desired number of topics;moved argument
reuse_theta
fromfit_offline
method intoARTM.__init__
constructor; the argument is still used to indicate that the previous theta matrix should be re-used on the next pass over the collection; settingreuse_theta = True
in the constructor will now be applied tofit_online
, which previously did not have this option.moved common argument
num_document_passes
fromARTM.fit_offline
,ARTM.fit_online
,ARTM.transform
methods intoARTM.__init__
constructor.changed the default value of
cache_theta
parameter fromTrue
toFalse
(inARTM.__init__
constructor); this is done to avoid excessive memory usage due to caching of the entire Theta matrix; if caching is indeed required user has to manually turn it on by settingcache_theta = True
.changed the default value of
reuse_theta
parameter fromTrue
toFalse
(inARTM.__init__
constructor); the reason is the same as for changing the default forcache_theta
parameterchanged the default value of
num_document_passes
parameter from1
to10
(inARTM.__init__
constructor);added arguments
apply_weight
,decay_weight
andupdate_after
inARTM.fit_online
method; each argument accepts a list of floats; setting all three arguments will override the default behavior of the online algorithm that rely on a specific formula withtau0
,kappa
andupdate_every
.added argument
async
(boolean flag) inARTM.fit_online
method for improved performance.added argument
theta_matrix_type
inARTM.transform
method; potential values are:"dense_theta"
,"dense_ptdw"
,None
; default matrix type is"dense_theta"
.introduced a separate method
ARTM.remove_theta
to clear cached theta matrix; remove corresponding boolean switchremove_theta
fromARTM.get_theta
method.removed
ARTM.fit_transform
method; note that the name was confusing because this method has never fitted the model; the purpose ofARTM.fit_transform
was to retrieve Theta matrix after fitting the model (ARTM.fit_offline
orARTM.fit_online
); same functionality is now available viaARTM.get_theta
method.introduced
ARTM.get_score
method, which will exist in parallel to score tracking functionality; the goal forARTM.get_score(score_name)
is to always return the latest version of the score; for Phi scores this means to calculate them on fly; for Theta scores this means to return a score aggregated over last call toARTM.fit_offline
,ARTM.fit_online
orARTM.transform
methods; opposite toARTM.get_score
the score tracking functionality returns the overall history of a score. For further details on score calculation refer to Q&A section in our wiki page.added
data_weight
inBatchVectorizer.__init__
constructor to let user specify an individual weight for each batchscore tracker classes had been rewritten, so you should make minor changes in the code that retrieves scores; for example:
added an API to initialize logging with custom logging directory, log level, etc... Search out wiki page Q&A for more details.
# in v0.7.x print model.score_tracker['Top100Tokens'].last_topic_info[topic_name].tokens # in v0.8.0 last_tokens = model.score_tracker['Top100Tokens'].last_tokens print last_tokens[topic_name]