BigARTM command line utilityΒΆ
This document provides an overview of cpp_client
,
a simple command-line utility shipped with BigARTM.
To run cpp_client you need to download input data (a textual collection represented in bag-of-words format). We recommend to download vocab and docword files by links provided in Downloads section of the tutorial. Then you can use cpp_client as follows:
cpp_client -d docword.kos.txt -v vocab.kos.txt
You may append the following options to customize the resulting topic model:
-t
or--num_topic
sets the number of topics in the resulting topic model.-i
or--num_iters
sets the number of iterative scans over the collection.--num_inner_iters
sets the number of updates of theta matrix performed on each iteration.--reuse_theta
enables caching of Theta matrix and re-uses last Theta matrix from the previous iteration as initial approximation on the next iteration. The default alternative without--reuse_theta
switch is to generate random approximation of Theta matrix on each iteration.--tau_phi
,--tau_theta
and--tau_decor
allows you to specify weights of different regularizers. Currently cpp_client does not allow you to customize regularizer weights for different topics and for different iterations. This limitation is only related to cpp_client, and you can simply achieve this by using BigARTM interface (either in Python or in C++).--update_every
is a parameter of the online algorithm. When specified, the model will be updated every update_every documents.
You may also apply the following optimizations that should not change the resulting model
-p
allows you to specify number of concurrent processors. The recommended value is to use the number of logical cores on your machine.--no_scores
disables calculation and visualization of all scores. This is a clean way of measuring pure performance of BigARTM, because at the moment some scores takes unnecessary long time to calculate.--disk_cache_folder
applies only together with--reuse_theta
. This parameter allows you to specify a writable disk location where BigARTM can cache Theta matrix between iterations to avoid storing it in main memory.--merger_queue_size
limits the size of the merger queue. Decrease the size of the queue might reduce memory usage, but decrease CPU utilization of the processors.
>cpp_client --help
BigARTM - library for advanced topic modeling (http://bigartm.org):
Basic options:
-h [ --help ] display this help message
-d [ --docword ] arg docword file in UCI format
-v [ --vocab ] arg vocab file in UCI format
-b [ --batch_folder ] arg If docword or vocab arguments are not
provided, cpp_client will try to read
pre-parsed batches from batch_folder
location. Otherwise, if both docword and
vocab arguments are provided, cpp_client
will parse data and store batches in
batch_folder location.
-t [ --num_topic ] arg (=16) number of topics
-p [ --num_processors ] arg (=2) number of concurrent processors
-i [ --num_iters ] arg (=10) number of outer iterations
--num_inner_iters arg (=10) number of inner iterations
--reuse_theta reuse theta between iterations
--dictionary_file arg (=dictionary) filename of dictionary file
--items_per_batch arg (=500) number of items per batch
--tau_phi arg (=0) regularization coefficient for PHI
matrix
--tau_theta arg (=0) regularization coefficient for THETA
matrix
--tau_decor arg (=0) regularization coefficient for topics
decorrelation (use with care, since this
value heavily depends on the size of the
dataset)
--paused wait for keystroke (allows to attach a
debugger)
--no_scores disable calculation of all scores
--update_every arg (=0) [online algorithm] requests an update of
the model after update_every document
--parsing_format arg (=0) parsing format (0 - UCI, 1 - matrix
market)
--disk_cache_folder arg disk cache folder
--merger_queue_size arg size of the merger queue
Examples:
cpp_client -d docword.kos.txt -v vocab.kos.txt
set GLOG_logtostderr=1 & cpp_client -d docword.kos.txt -v vocab.kos.txt
For further details please refer to the source code of cpp_client.