BigARTM Command Line Utility¶

This document provides an overview of bigartm command-line utility shipped with BigARTM.

For a detailed description of bigartm command line interface refer to bigartm.exe notebook (in Russian).

In brief, you need to download some input data (a textual collection represented in bag-of-words format). We recommend to download sample colections in vowpal wabbit format by links provided in Downloads section of the tutorial. Then you can use bigartm as described by bigartm --help. You may also get more information about builtin regularizers by typing bigartm --help --regularizer.

BigARTM v0.8.2 - library for advanced topic modeling (http://bigartm.org):

Input data:
  -c [ --read-vw-corpus ] arg           Raw corpus in Vowpal Wabbit format
  -d [ --read-uci-docword ] arg         docword file in UCI format
  -v [ --read-uci-vocab ] arg           vocab file in UCI format
  --read-cooc arg                       read co-occurrences format
  --batch-size arg (=500)               number of items per batch
  --use-batches arg                     folder with batches to use

Dictionary:
  --dictionary-min-df arg               filter out tokens present in less than
                                        N documents / less than P% of documents
  --dictionary-max-df arg               filter out tokens present in less than
                                        N documents / less than P% of documents
  --dictionary-size arg (=0)            limit dictionary size by filtering out
                                        tokens with high document frequency
  --use-dictionary arg                  filename of binary dictionary file to
                                        use

Model:
  --load-model arg                      load model from file before processing
  -t [ --topics ] arg (=16)             number of topics
  --use-modality arg                    modalities (class_ids) and their
                                        weights
  --predict-class arg                   target modality to predict by theta
                                        matrix

Learning:
  -p [ --num-collection-passes ] arg (=0)
                                        number of outer iterations (passes
                                        through the collection)
  --num-document-passes arg (=10)       number of inner iterations (passes
                                        through the document)
  --update-every arg (=0)               [online algorithm] requests an update
                                        of the model after update_every
                                        document
  --tau0 arg (=1024)                    [online algorithm] weight option from
                                        online update formula
  --kappa arg (=0.699999988)            [online algorithm] exponent option from
                                        online update formula
  --reuse-theta                         reuse theta between iterations
  --regularizer arg                     regularizers (SmoothPhi,SparsePhi,Smoot
                                        hTheta,SparseTheta,Decorrelation)
  --threads arg (=-1)                   number of concurrent processors
                                        (default: auto-detect)
  --async                               invoke asynchronous version of the
                                        online algorithm

Output:
  --save-model arg                      save the model to binary file after
                                        processing
  --save-batches arg                    batch folder
  --save-dictionary arg                 filename of dictionary file
  --write-model-readable arg            output the model in a human-readable
                                        format
  --write-dictionary-readable arg       output the dictionary in a
                                        human-readable format
  --write-predictions arg               write prediction in a human-readable
                                        format
  --write-class-predictions arg         write class prediction in a
                                        human-readable format
  --write-scores arg                    write scores in a human-readable format
  --write-vw-corpus arg                 convert batches into plain text file in
                                        Vowpal Wabbit format
  --force                               force overwrite existing output files
  --csv-separator arg (=;)              columns separator for
                                        --write-model-readable and
                                        --write-predictions. Use \t or TAB to
                                        indicate tab.
  --score-level arg (=2)                score level (0, 1, 2, or 3
  --score arg                           scores (Perplexity, SparsityTheta,
                                        SparsityPhi, TopTokens, ThetaSnippet,
                                        or TopicKernel)
  --final-score arg                     final scores (same as scores)

Other options:
  -h [ --help ]                         display this help message
  --rand-seed arg                       specify seed for random number
                                        generator, use system timer when not
                                        specified
  --guid-batch-name                     applies to save-batches and indicate
                                        that batch names should be guids (not
                                        sequential codes)
  --response-file arg                   response file
  --paused                              start paused and waits for a keystroke
                                        (allows to attach a debugger)
  --disk-cache-folder arg               disk cache folder
  --disable-avx-opt                     disable AVX optimization (gives similar
                                        behavior of the Processor component to
                                        BigARTM v0.5.4)
  --time-limit arg (=0)                 limit execution time in milliseconds
  --log-dir arg                         target directory for logging
                                        (GLOG_log_dir)
  --log-level arg                       min logging level (GLOG_minloglevel;
                                        INFO=0, WARNING=1, ERROR=2, and
                                        FATAL=3)

Examples:

* Download input data:
  wget https://s3-eu-west-1.amazonaws.com/artm/docword.kos.txt
  wget https://s3-eu-west-1.amazonaws.com/artm/vocab.kos.txt
  wget https://s3-eu-west-1.amazonaws.com/artm/vw.mmro.txt
  wget https://s3-eu-west-1.amazonaws.com/artm/vw.wiki-enru.txt.zip

* Parse docword and vocab files from UCI bag-of-word format; then fit topic model with 20 topics:
  bigartm -d docword.kos.txt -v vocab.kos.txt -t 20 --num_collection_passes 10

* Parse VW format; then save the resulting batches and dictionary:
  bigartm --read-vw-corpus vw.mmro.txt --save-batches mmro_batches --save-dictionary mmro.dict

* Parse VW format from standard input; note usage of single dash '-' after --read-vw-corpus:
  cat vw.mmro.txt | bigartm --read-vw-corpus - --save-batches mmro2_batches --save-dictionary mmro2.dict

* Re-save batches back into VW format:
  bigartm --use-batches mmro_batches --write-vw-corpus vw.mmro.txt

* Parse only specific modalities from VW file, and save them as a new VW file:
  bigartm --read-vw-corpus vw.wiki-enru.txt --use-modality @russian --write-vw-corpus vw.wiki-ru.txt

* Load and filter the dictionary on document frequency; save the result into a new file:
  bigartm --use-dictionary mmro.dict --dictionary-min-df 5 dictionary-max-df 40% --save-dictionary mmro-filter.dict

* Load the dictionary and export it in a human-readable format:
  bigartm --use-dictionary mmro.dict --write-dictionary-readable mmro.dict.txt

* Use batches to fit a model with 20 topics; then save the model in a binary format:
  bigartm --use-batches mmro_batches --num_collection_passes 10 -t 20 --save-model mmro.model

* Load the model and export it in a human-readable format:
  bigartm --load-model mmro.model --write-model-readable mmro.model.txt

* Load the model and use it to generate predictions:
  bigartm --read-vw-corpus vw.mmro.txt --load-model mmro.model --write-predictions mmro.predict.txt

* Fit model with two modalities (@default_class and @target), and use it to predict @target label:
  bigartm --use-batches <batches> --use-modality @default_class,@target --topics 50 --num_collection_passes 10 --save-model model.bin
  bigartm --use-batches <batches> --use-modality @default_class,@target --topics 50 --load-model model.bin
          --write-predictions pred.txt --csv-separator=tab
          --predict-class @target --write-class-predictions pred_class.txt --score ClassPrecision

* Fit simple regularized model (increase sparsity up to 60-70%):
  bigartm -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
          --num_collection_passes 10 --batch-size 50 --topics 20 --write-model-readable model.txt
          --regularizer "0.05 SparsePhi" "0.05 SparseTheta"

* Fit more advanced regularize model, with 10 sparse objective topics, and 2 smooth background topics:
  bigartm -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
          --num_collection_passes 10 --batch-size 50 --topics obj:10;background:2 --write-model-readable model.txt
          --regularizer "0.05 SparsePhi #obj"
          --regularizer "0.05 SparseTheta #obj"
          --regularizer "0.25 SmoothPhi #background"
          --regularizer "0.25 SmoothTheta #background"

* Upgrade batches in the old format (from folder 'old_folder' into 'new_folder'):
  bigartm --use-batches old_folder --save-batches new_folder

* Configure logger to output into stderr:
  tset GLOG_logtostderr=1 & bigartm -d docword.kos.txt -v vocab.kos.txt -t 20 --num_collection_passes 10

Additional information about regularizers:

>bigartm.exe --regularizer --help
List of regularizers available in BigARTM CLI:

        --regularizer "tau SmoothTheta #topics"
        --regularizer "tau SparseTheta #topics"
        --regularizer "tau SmoothPhi #topics @class_ids !dictionary"
        --regularizer "tau SparsePhi #topics @class_ids !dictionary"
        --regularizer "tau Decorrelation #topics @class_ids"
        --regularizer "tau TopicSelection #topics"
        --regularizer "tau LabelRegularization #topics @class_ids !dictionary"
        --regularizer "tau ImproveCoherence #topics @class_ids !dictionary"
        --regularizer "tau Biterms #topics @class_ids !dictionary"

List of regularizers available in BigARTM, but not exposed in CLI:

        --regularizer "tau SpecifiedSparsePhi"
        --regularizer "tau SmoothPtdw"
        --regularizer "tau HierarchySparsingTheta"

If you are interested to see any of these regularizers in BigARTM CLI please send a message to
        bigartm-users@googlegroups.com.

By default all regularizers act on the full set of topics and modalities.
To limit action onto specific set of topics use hash sign (#), followed by
list of topics (for example, #topic1;topic2) or topic groups (#obj).
Similarly, to limit action onto specific set of class ids use at sign (@),
by the list of class ids (for example, @default_class).
Some regularizers accept a dictionary. To specify the dictionary use exclamation mark (!),
followed by the path to the dictionary(.dict file in your file system).
Depending on regularizer the dictinoary can be either optional or required.
Some regularizers expect an dictinoary with tokens and their frequencies;
Other regularizers expect an dictinoary with tokens co-occurencies;
For more information about regularizers refer to wiki-page:

        https://github.com/bigartm/bigartm/wiki/Implemented-regularizers

To get full help run `bigartm --help` without --regularizer switch.