BigARTM command line utilityΒΆ

This document provides an overview of bigartm command-line utility shipped with BigARTM.

For a detailed description of bigartm command line interface refer to bigartm.exe notebook (in Russian).

In brief, you need to download some input data (a textual collection represented in bag-of-words format). We recommend to download vocab and docword files by links provided in Downloads section of the tutorial. Then you can use bigartm as described by bigartm --help:

BigARTM - library for advanced topic modeling (http://bigartm.org):

Input data:
  -c [ --read-vw-corpus ] arg         Raw corpus in Vowpal Wabbit format
  -d [ --read-uci-docword ] arg       docword file in UCI format
  -v [ --read-uci-vocab ] arg         vocab file in UCI format
  --read-cooc arg                     read co-occurrences format
  --batch-size arg (=500)             number of items per batch
  --use-batches arg                   folder with batches to use

Dictionary:
  --dictionary-min-df arg             filter out tokens present in less than N
                                      documents / less than P% of documents
  --dictionary-max-df arg             filter out tokens present in less than N
                                      documents / less than P% of documents
  --use-dictionary arg                filename of binary dictionary file to use

Model:
  --load-model arg                    load model from file before processing
  -t [ --topics ] arg (=16)           number of topics
  --use-modality arg                  modalities (class_ids) and their weights
  --predict-class arg                 target modality to predict by theta
                                      matrix

Learning:
  -p [ --passes ] arg (=0)            number of outer iterations
  --inner-iterations-count arg (=10)  number of inner iterations
  --update-every arg (=0)             [online algorithm] requests an update of
                                      the model after update_every document
  --tau0 arg (=1024)                  [online algorithm] weight option from
                                      online update formula
  --kappa arg (=0.699999988)          [online algorithm] exponent option from
                                      online update formula
  --reuse-theta                       reuse theta between iterations
  --regularizer arg                   regularizers (SmoothPhi,SparsePhi,SmoothT
                                      heta,SparseTheta,Decorrelation)
  --threads arg (=0)                  number of concurrent processors (default:
                                      auto-detect)
  --async                             invoke asynchronous version of the online
                                      algorithm
  --model-v06                         use legacy model from BigARTM v0.6.4

Output:
  --save-model arg                    save the model to binary file after
                                      processing
  --save-batches arg                  batch folder
  --save-dictionary arg               filename of dictionary file
  --write-model-readable arg          output the model in a human-readable
                                      format
  --write-dictionary-readable arg     output the dictionary in a human-readable
                                      format
  --write-predictions arg             write prediction in a human-readable
                                      format
  --write-class-predictions arg       write class prediction in a
                                      human-readable format
  --write-scores arg                  write scores in a human-readable format
  --force                             force overwrite existing output files
  --csv-separator arg (=;)            columns separator for
                                      --write-model-readable and
                                      --write-predictions. Use \t or TAB to
                                      indicate tab.
  --score-level arg (=2)              score level (0, 1, 2, or 3
  --score arg                         scores (Perplexity, SparsityTheta,
                                      SparsityPhi, TopTokens, ThetaSnippet, or
                                      TopicKernel)
  --final-score arg                   final scores (same as scores)

Other options:
  -h [ --help ]                       display this help message
  --response-file arg                 response file
  --paused                            start paused and waits for a keystroke
                                      (allows to attach a debugger)
  --disk-cache-folder arg             disk cache folder
  --disable-avx-opt                   disable AVX optimization (gives similar
                                      behavior of the Processor component to
                                      BigARTM v0.5.4)
  --use-dense-bow                     use dense representation of bag-of-words
                                      data in processors
  --time-limit arg (=0)               limit execution time in milliseconds

Examples:

* Download input data:
  wget https://s3-eu-west-1.amazonaws.com/artm/docword.kos.txt
  wget https://s3-eu-west-1.amazonaws.com/artm/vocab.kos.txt
  wget https://s3-eu-west-1.amazonaws.com/artm/vw.mmro.txt

* Parse docword and vocab files from UCI bag-of-word format; then fit topic model with 20 topics:
  bigartm -d docword.kos.txt -v vocab.kos.txt -t 20 --passes 10

* Parse VW format; then save the resulting batches and dictionary:
  bigartm --read-vw-corpus vw.mmro.txt --save-batches mmro_batches --save-dictionary mmro.dict

* Parse VW format from standard input; note usage of single dash '-' after --read-vw-corpus:
  cat vw.mmro.txt | bigartm --read-vw-corpus - --save-batches mmro2_batches --save-dictionary mmro2.dict

* Load and filter the dictionary on document frequency; save the result into a new file:
  bigartm --use-dictionary mmro.dict --dictionary-min-df 5 dictionary-max-df 40% --save-dictionary mmro-filter.dict

* Load the dictionary and export it in a human-readable format:
  bigartm --use-dictionary mmro.dict --write-dictionary-readable mmro.dict.txt

* Use batches to fit a model with 20 topics; then save the model in a binary format:
  bigartm --use-batches mmro_batches --passes 10 -t 20 --save-model mmro.model

* Load the model and export it in a human-readable format:
  bigartm --load-model mmro.model --write-model-readable mmro.model.txt

* Load the model and use it to generate predictions:
  bigartm --read-vw-corpus vw.mmro.txt --load-model mmro.model --write-predictions mmro.predict.txt

* Fit model with two modalities (@default_class and @target), and use it to predict @target label:
  bigartm --use-batches <batches> --use-modality @default_class,@target --topics 50 --passes 10 --save-model model.bin
  bigartm --use-batches <batches> --use-modality @default_class,@target --topics 50 --load-model model.bin
          --write-predictions pred.txt --csv-separator=tab
          --predict-class @target --write-class-predictions pred_class.txt --score ClassPrecision

* Fit simple regularized model (increase sparsity up to 60-70%):
  bigartm -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
          --passes 10 --batch-size 50 --topics 20 --write-model-readable model.txt
          --regularizer "0.05 SparsePhi" "0.05 SparseTheta"

* Fit more advanced regularize model, with 10 sparse objective topics, and 2 smooth background topics:
  bigartm -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
          --passes 10 --batch-size 50 --topics obj:10;background:2 --write-model-readable model.txt
          --regularizer "0.05 SparsePhi #obj"
          --regularizer "0.05 SparseTheta #obj"
          --regularizer "0.25 SmoothPhi #background"
          --regularizer "0.25 SmoothTheta #background"

* Configure logger to output into stderr:
  tset GLOG_logtostderr=1 & bigartm -d docword.kos.txt -v vocab.kos.txt -t 20 --passes 10