BigARTM command line utility¶
This document provides an overview of cpp_client, a simple command-line utility shipped with BigARTM.
To run cpp_client you need to download input data (a textual collection represented in bag-of-words format). We recommend to download vocab and docword files by links provided in Downloads section of the tutorial. Then you can use cpp_client as follows:
cpp_client -d docword.kos.txt -v vocab.kos.txt
You may append the following options to customize the resulting topic model:
- -t or --num_topic sets the number of topics in the resulting topic model.
- -i or --num_iters sets the number of iterative scans over the collection.
- --num_inner_iters sets the number of updates of theta matrix performed on each iteration.
- --reuse_theta enables caching of Theta matrix and re-uses last Theta matrix from the previous iteration as initial approximation on the next iteration. The default alternative without --reuse_theta switch is to generate random approximation of Theta matrix on each iteration.
- --tau_phi, --tau_theta and --tau_decor allows you to specify weights of different regularizers. Currently cpp_client does not allow you to customize regularizer weights for different topics and for different iterations. This limitation is only related to cpp_client, and you can simply achieve this by using BigARTM interface (either in Python or in C++).
- --update_every is a parameter of the online algorithm. When specified, the model will be updated every update_every documents.
You may also apply the following optimizations that should not change the resulting model
- --reuse_batches skips parsing of docword and vocab files, and tries to use batches located in --batch_folder. You may download pre-parsed batches by links provided in Downloads section.
- -p allows you to specify number of concurrent processors. The recommended value is to use the number of logical cores on your machine.
- --no_scores disables calculation and visualization of all scores. This is a clean way of measuring pure performance of BigARTM, because at the moment some scores takes unnecessary long time to calculate.
- --disk_cache_folder applies only together with --reuse_theta. This parameter allows you to specify a writable disk location where BigARTM can cache Theta matrix between iterations to avoid storing it in main memory.
- --merger_queue_size limits the size of the merger queue. Decrease the size of the queue might reduce memory usage, but decrease CPU utilization of the processors.
>cpp_client --help BigARTM - library for advanced topic modeling (http://bigartm.org): Basic options: -h [ --help ] display this help message -d [ --docword ] arg docword file in UCI format -v [ --vocab ] arg vocab file in UCI format -t [ --num_topic ] arg (=16) number of topics -p [ --num_processors ] arg (=2) number of concurrent processors -i [ --num_iters ] arg (=10) number of outer iterations --num_inner_iters arg (=10) number of inner iterations --reuse_theta reuse theta between iterations --batch_folder arg (=batches) temporary folder to store batches --dictionary_file arg (=filename of dictionary file) --reuse_batches reuse batches found in batch_folder (default = false) --items_per_batch arg (=500) number of items per batch --tau_phi arg (=0) regularization coefficient for PHI matrix --tau_theta arg (=0) regularization coefficient for THETA matrix --tau_decor arg (=0) regularization coefficient for topics decorrelation (use with care, since this value heavily depends on the size of the dataset) --paused wait for keystroke (allows to attach a debugger) --no_scores disable calculation of all scores --update_every arg (=0) [online algorithm] requests an update of the model after update_every document --parsing_format arg (=0) parsing format (0 - UCI, 1 - matrix market) --disk_cache_folder arg disk cache folder --merger_queue_size arg size of the merger queue Networking options (experimental): --nodes arg endpoints of the remote nodes (enables network modus operandi) --localhost arg (=localhost) DNS name or the IP address of the localhost --port arg (=5550) port to use for master node --proxy arg proxy endpoint --timeout arg (=1000) network communication timeout in milliseconds Examples: cpp_client -d docword.kos.txt -v vocab.kos.txt set GLOG_logtostderr=1 & cpp_client -d docword.kos.txt -v vocab.kos.txt
For further details please refer to the source code of cpp_client.