BigARTM command line utility¶

This document provides an overview of cpp_client, a simple command-line utility shipped with BigARTM.

To run cpp_client you need to download input data (a textual collection represented in bag-of-words format). We recommend to download vocab and docword files by links provided in Parse collection section of the tutorial. Then you can use cpp_client as follows:

cpp_client -d docword.kos.txt -v vocab.kos.txt

You may append the following options to customize the resulting topic model:

-t or --num_topic sets the number of topics in the resulting topic model.

-i or --num_iters sets the number of iterative scans over the collection.

--num_inner_iters sets the number of updates of theta matrix performed on each iteration.

--reuse_theta enables caching of Theta matrix and re-uses last Theta matrix from the previous iteration as initial approximation on the next iteration. The default alternative without --reuse_theta switch is to generate random approximation of Theta matrix on each iteration.

--tau_phi, --tau_theta and --tau_decor allows you to specify weights of different regularizers. Currently cpp_client does not allow you to customize regularizer weights for different topics and for different iterations. This limitation is only related to cpp_client, and you can simply achieve this by using BigARTM interface (either in Python or in C++).

--online_decay and --online_period are the parameters of online algorithm. online_period specifies time interval in milliseconds between model synchronization. Once per this time interval the topic model will be re-calculated as follows: n_wt counters in the old model are decreased according to online_decay coefficient, and increased by n_wt counters calculated by all processors since the last model synchronization. The suggested way to choose online_period is to measure single iteration time for cpp_client in offline mode (e.g. without online_period option). Then set online_period to lambda = 0.25 of the iteration time. For larger collections it may be more efficient to set online_period to lambda = 0.1 of the iteration time, but then remember to update online_decay parameter accordingly (1-lambda should be a reasonable value).

You may also apply the following optimizations that should not change the resulting model

--reuse_batches skips parsing of docword and vocab files, and tries to use batches located in --batch_folder. You may download pre-parsed batches by links provided in Parse collection section of the tutorial.

-p allows you to specify number of concurrent processors. The recommended value is to use the number of logical cores on your machine.

--no_scores disables calculation and visualization of all scores. This is a clean way of measuring pure performance of BigARTM, because at the moment some scores takes unnecessary long time to calculate.

--disk_cache_folder applies only together with --reuse_theta. This parameter allows you to specify a writable disk location where BigARTM can cache Theta matrix between iterations to avoid storing it in main memory.

>cpp_client --help
BigARTM - library for advanced topic modeling (http://bigartm.org):

Basic options:
  -h [ --help ]                         display this help message
  -d [ --docword ] arg                  docword file in UCI format
  -v [ --vocab ] arg                    vocab file in UCI format
  -t [ --num_topic ] arg (=16)          number of topics
  -p [ --num_processors ] arg (=2)      number of concurrent processors
  -i [ --num_iters ] arg (=10)          number of outer iterations
  --num_inner_iters arg (=10)           number of inner iterations
  --reuse_theta                         reuse theta between iterations
  --batch_folder arg (=batches)         temporary folder to store batches
  --dictionary_file arg (=filename of dictionary file)
  --reuse_batches                       reuse batches found in batch_folder
                                        (default = false)
  --items_per_batch arg (=500)          number of items per batch
  --tau_phi arg (=0)                    regularization coefficient for PHI
                                        matrix
  --tau_theta arg (=0)                  regularization coefficient for THETA
                                        matrix
  --tau_decor arg (=0)                  regularization coefficient for topics
                                        decorrelation (use with care, since
                                        this value heavily depends on the size
                                        of the dataset)
  --paused                              wait for keystroke (allows to attach a
                                        debugger)
  --no_scores                           disable calculation of all scores
  --online_period arg (=0)              period in milliseconds between model
                                        synchronization on the online algorithm
  --online_decay arg (=0.75)            decay coefficient [0..1] for online
                                        algorithm
  --parsing_format arg (=0)             parsing format (0 - UCI, 1 - matrix
                                        market)
  --disk_cache_folder arg               disk cache folder

Networking options (experimental):
  --nodes arg                  endpoints of the remote nodes (enables network
                               modus operandi)
  --localhost arg (=localhost) DNS name or the IP address of the localhost
  --port arg (=5550)           port to use for master node
  --proxy arg                  proxy endpoint
  --timeout arg (=1000)        network communication timeout in milliseconds

Examples:
        cpp_client -d docword.kos.txt -v vocab.kos.txt
        set GLOG_logtostderr=1 & cpp_client -d docword.kos.txt -v vocab.kos.txt

For further details please refer to the source code of cpp_client.