BigARTM command line utilityΒΆ

This document provides an overview of bigartm command-line utility shipped with BigARTM.

For a detailed description of bigartm command line interface refer to bigartm.exe notebook (in Russian).

In brief, you need to download some input data (a textual collection represented in bag-of-words format). We recommend to download vocab and docword files by links provided in Downloads section of the tutorial. Then you can use bigartm as described by bigartm --help:

>bigartm --help
BigARTM - library for advanced topic modeling (http://bigartm.org):

Input data:
  -c [ --read-vw-corpus ] arg         Raw corpus in Vowpal Wabbit format
  -d [ --read-uci-docword ] arg       docword file in UCI format
  -v [ --read-uci-vocab ] arg         vocab file in UCI format
  --batch-size arg (=500)             number of items per batch
  --use-batches arg                   folder with batches to use

Dictionary:
  --dictionary-min-df arg             filter out tokens present in less than N
                                      documents / less than P% of documents
  --dictionary-max-df arg             filter out tokens present in less than N
                                      documents / less than P% of documents
  --use-dictionary arg                filename of binary dictionary file to use

Model:
  --load-model arg                    load model from file before processing
  -t [ --topics ] arg (=16)           number of topics
  --use-modality arg                  modalities (class_ids) and their weights

Learning:
  -p [ --passes ] arg (=10)           number of outer iterations
  --inner-iterations-count arg (=10)  number of inner iterations
  --update-every arg (=0)             [online algorithm] requests an update of
                                      the model after update_every document
  --tau0 arg (=1024)                  [online algorithm] weight option from
                                      online update formula
  --kappa arg (=0.699999988)          [online algorithm] exponent option from
                                      online update formula
  --reuse-theta                       reuse theta between iterations
  --regularizer arg                   regularizers (SmoothPhi,SparsePhi,SmoothT
                                      heta,SparseTheta,Decorrelation)
  --threads arg (=0)                  number of concurrent processors (default:
                                      auto-detect)

Output:
  --save-model arg                    save the model to binary file after
                                      processing
  --save-batches arg                  batch folder
  --save-dictionary arg               filename of dictionary file
  --write-model-readable arg          output the model in a human-readable
                                      format
  --write-predictions arg             write prediction in a human-readable
                                      format
  --score-level arg (=2)              score level (0, 1, 2, or 3
  --score arg                         scores (Perplexity, SparsityTheta,
                                      SparsityPhi, TopTokens, ThetaSnippet, or
                                      TopicKernel)
  --final-score arg                   final scores (same as scores)

Other options:
  -h [ --help ]                       display this help message
  --response-file arg                 response file
  --paused                            start paused and waits for a keystroke
                                      (allows to attach a debugger)
  --disk-cache-folder arg             disk cache folder
  --disable-avx-opt                   disable AVX optimization (gives similar
                                      behavior of the Processor component to
                                      BigARTM v0.5.4)
  --use-dense-bow                     use dense representation of bag-of-words
                                      data in processors

Examples:
        cpp_client -d docword.kos.txt -v vocab.kos.txt
        set GLOG_logtostderr=1 & cpp_client -d docword.kos.txt -v vocab.kos.txt