BigARTM Command Line Utility

This document provides an overview of bigartm command-line utility shipped with BigARTM.

For a detailed description of bigartm command line interface refer to bigartm.exe notebook (in Russian).

In brief, you need to download some input data (a textual collection represented in bag-of-words format). We recommend to download sample colections in vowpal wabbit format by links provided in Downloads section of the tutorial. Then you can use bigartm as described by bigartm --help. You may also get more information about builtin regularizers by typing bigartm --help --regularizer.

  • Gathering of Co-occurrence Statistics File

In order to gather co-ooccurrence statistics files you need to have 2 files: collection in Vowpal Wabbit format and file of tokens (so called “vocab”) in UCI format. Vocab is needed to filter tokens of collection, which means that co-occurrence won’t be calculated if some pair isn’t presented in vocab. There are 2 types of co-occurrences available now: tf and df (see description below). Also you may want to calculate Positive PMI of gathered co-occurrence values. This information can be usefull if you want to use co-occurrence dictionary in coherence computation. The utility produces files with pointwise information in as pseudocollection in Vowpal Wabbit format.

Note

If you want to compute co-occurrences of tokens of non-default modality, you should specify that modalities in vocab file. For more information about UCI files format, please visit Input Data Formats and Datasets.

Here is a combination of comand line keys that allows you to build co-occurrence dictionaries:

bigartm -c vw -v vocab --cooc-window 10 --cooc-min-tf 200 --write-cooc-tf cooc_tf_ --cooc-min-df 200 --write-cooc-df cooc_df_ --write-ppmi-tf ppmi_tf_ --write-ppmi-df ppmi_df_

Numbers and names of files can be changed, it’s just an example. To know the description of each key, see below in table of all available keys.

Also you can look at launchable examples and common mistakes in bigartm-book (in russian).

  • BigARTM CLI keys
BigARTM v0.9.0 - library for advanced topic modeling (http://bigartm.org):

Input data:
  -c [ --read-vw-corpus ] arg           Raw corpus in Vowpal Wabbit format
  -d [ --read-uci-docword ] arg         docword file in UCI format
  -v [ --read-uci-vocab ] arg           vocab file in UCI format
  --read-cooc arg                       read co-occurrences format
  --batch-size arg (=500)               number of items per batch
  --use-batches arg                     folder with batches to use

Dictionary:
  --cooc-min-tf arg (=0)                minimal value of cooccurrences of a
                                        pair of tokens that are saved in
                                        dictionary of cooccurrences
  --cooc-min-df arg (=0)                minimal value of documents in which a
                                        specific pair of tokens occurred
                                        together closely
  --cooc-window arg (=5)                number of tokens around specific token,
                                        which are used in calculation of
                                        cooccurrences
  --dictionary-min-df arg               filter out tokens present in less than
                                        N documents / less than P% of documents
  --dictionary-max-df arg               filter out tokens present in less than
                                        N documents / less than P% of documents
  --dictionary-size arg (=0)            limit dictionary size by filtering out
                                        tokens with high document frequency
  --use-dictionary arg                  filename of binary dictionary file to
                                        use

Model:
  --load-model arg                      load model from file before processing
  -t [ --topics ] arg (=16)             number of topics
  --use-modality arg                    modalities (class_ids) and their
                                        weights
  --predict-class arg                   target modality to predict by theta
                                        matrix

Learning:
  -p [ --num-collection-passes ] arg (=0)
                                        number of outer iterations (passes
                                        through the collection)
  --num-document-passes arg (=10)       number of inner iterations (passes
                                        through the document)
  --update-every arg (=0)               [online algorithm] requests an update
                                        of the model after update_every
                                        document
  --tau0 arg (=1024)                    [online algorithm] weight option from
                                        online update formula
  --kappa arg (=0.699999988)            [online algorithm] exponent option from
                                        online update formula
  --reuse-theta                         reuse theta between iterations
  --regularizer arg                     regularizers (SmoothPhi,SparsePhi,Smoot
                                        hTheta,SparseTheta,Decorrelation)
  --threads arg (=-1)                   number of concurrent processors
                                        (default: auto-detect)
  --async                               invoke asynchronous version of the
                                        online algorithm

Output:
  --write-cooc-tf arg                   save dictionary of co-occurrences with
                                        frequencies of co-occurrences of every
                                        specific pair of tokens in whole
                                        collection
  --write-cooc-df arg                   save dictionary of co-occurrences with
                                        number of documents in which every
                                        specific pair occured together
  --write-ppmi-tf arg                   save values of positive pmi of pairs of
                                        tokens from cooc_tf dictionary
  --write-ppmi-df arg                   save values of positive pmi of pairs of
                                        tokens from cooc_df dictionary
  --save-model arg                      save the model to binary file after
                                        processing
  --save-batches arg                    batch folder
  --save-dictionary arg                 filename of dictionary file
  --write-model-readable arg            output the model in a human-readable
                                        format
  --write-dictionary-readable arg       output the dictionary in a
                                        human-readable format
  --write-predictions arg               write prediction in a human-readable
                                        format
  --write-class-predictions arg         write class prediction in a
                                        human-readable format
  --write-scores arg                    write scores in a human-readable format
  --write-vw-corpus arg                 convert batches into plain text file in
                                        Vowpal Wabbit format
  --force                               force overwrite existing output files
  --csv-separator arg (=;)              columns separator for
                                        --write-model-readable and
                                        --write-predictions. Use \t or TAB to
                                        indicate tab.
  --score-level arg (=2)                score level (0, 1, 2, or 3
  --score arg                           scores (Perplexity, SparsityTheta,
                                        SparsityPhi, TopTokens, ThetaSnippet,
                                        or TopicKernel)
  --final-score arg                     final scores (same as scores)

Other options:
  -h [ --help ]                         display this help message
  --rand-seed arg                       specify seed for random number
                                        generator, use system timer when not
                                        specified
  --guid-batch-name                     applies to save-batches and indicate
                                        that batch names should be guids (not
                                        sequential codes)
  --response-file arg                   response file
  --paused                              start paused and waits for a keystroke
                                        (allows to attach a debugger)
  --disk-cache-folder arg               disk cache folder
  --disable-avx-opt                     disable AVX optimization (gives similar
                                        behavior of the Processor component to
                                        BigARTM v0.5.4)
  --time-limit arg (=0)                 limit execution time in milliseconds
  --log-dir arg                         target directory for logging
                                        (GLOG_log_dir)
  --log-level arg                       min logging level (GLOG_minloglevel;
                                        INFO=0, WARNING=1, ERROR=2, and
                                        FATAL=3)

Examples:

* Download input data:
  wget https://s3-eu-west-1.amazonaws.com/artm/docword.kos.txt
  wget https://s3-eu-west-1.amazonaws.com/artm/vocab.kos.txt
  wget https://s3-eu-west-1.amazonaws.com/artm/vw.mmro.txt
  wget https://s3-eu-west-1.amazonaws.com/artm/vw.wiki-enru.txt.zip

* Parse docword and vocab files from UCI bag-of-word format; then fit topic model with 20 topics:
  bigartm -d docword.kos.txt -v vocab.kos.txt -t 20 --num_collection_passes 10

* Parse VW format; then save the resulting batches and dictionary:
  bigartm --read-vw-corpus vw.mmro.txt --save-batches mmro_batches --save-dictionary mmro.dict

* Parse VW format from standard input; note usage of single dash '-' after --read-vw-corpus:
  cat vw.mmro.txt | bigartm --read-vw-corpus - --save-batches mmro2_batches --save-dictionary mmro2.dict

* Re-save batches back into VW format:
  bigartm --use-batches mmro_batches --write-vw-corpus vw.mmro.txt

* Parse only specific modalities from VW file, and save them as a new VW file:
  bigartm --read-vw-corpus vw.wiki-enru.txt --use-modality @russian --write-vw-corpus vw.wiki-ru.txt

* Load and filter the dictionary on document frequency; save the result into a new file:
  bigartm --use-dictionary mmro.dict --dictionary-min-df 5 dictionary-max-df 40% --save-dictionary mmro-filter.dict

* Load the dictionary and export it in a human-readable format:
  bigartm --use-dictionary mmro.dict --write-dictionary-readable mmro.dict.txt

* Use batches to fit a model with 20 topics; then save the model in a binary format:
  bigartm --use-batches mmro_batches --num_collection_passes 10 -t 20 --save-model mmro.model

* Load the model and export it in a human-readable format:
  bigartm --load-model mmro.model --write-model-readable mmro.model.txt

* Load the model and use it to generate predictions:
  bigartm --read-vw-corpus vw.mmro.txt --load-model mmro.model --write-predictions mmro.predict.txt

* Fit model with two modalities (@default_class and @target), and use it to predict @target label:
  bigartm --use-batches <batches> --use-modality @default_class,@target --topics 50 --num_collection_passes 10 --save-model model.bin
  bigartm --use-batches <batches> --use-modality @default_class,@target --topics 50 --load-model model.bin
          --write-predictions pred.txt --csv-separator=tab
          --predict-class @target --write-class-predictions pred_class.txt --score ClassPrecision

* Fit simple regularized model (increase sparsity up to 60-70%):
  bigartm -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
          --num_collection_passes 10 --batch-size 50 --topics 20 --write-model-readable model.txt
          --regularizer "0.05 SparsePhi" "0.05 SparseTheta"

* Fit more advanced regularize model, with 10 sparse objective topics, and 2 smooth background topics:
  bigartm -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
          --num_collection_passes 10 --batch-size 50 --topics obj:10;background:2 --write-model-readable model.txt
          --regularizer "0.05 SparsePhi #obj"
          --regularizer "0.05 SparseTheta #obj"
          --regularizer "0.25 SmoothPhi #background"
          --regularizer "0.25 SmoothTheta #background"

* Upgrade batches in the old format (from folder 'old_folder' into 'new_folder'):
  bigartm --use-batches old_folder --save-batches new_folder

* Configure logger to output into stderr:
  tset GLOG_logtostderr=1 & bigartm -d docword.kos.txt -v vocab.kos.txt -t 20 --num_collection_passes 10

Additional information about regularizers:

>bigartm.exe --regularizer --help
List of regularizers available in BigARTM CLI:

        --regularizer "tau SmoothTheta #topics"
        --regularizer "tau SparseTheta #topics"
        --regularizer "tau SmoothPhi #topics @class_ids !dictionary"
        --regularizer "tau SparsePhi #topics @class_ids !dictionary"
        --regularizer "tau Decorrelation #topics @class_ids"
        --regularizer "tau TopicSelection #topics"
        --regularizer "tau LabelRegularization #topics @class_ids !dictionary"
        --regularizer "tau ImproveCoherence #topics @class_ids !dictionary"
        --regularizer "tau Biterms #topics @class_ids !dictionary"

List of regularizers available in BigARTM, but not exposed in CLI:

        --regularizer "tau SpecifiedSparsePhi"
        --regularizer "tau SmoothPtdw"
        --regularizer "tau HierarchySparsingTheta"

If you are interested to see any of these regularizers in BigARTM CLI please send a message to
        bigartm-users@googlegroups.com.

By default all regularizers act on the full set of topics and modalities.
To limit action onto specific set of topics use hash sign (#), followed by
list of topics (for example, #topic1;topic2) or topic groups (#obj).
Similarly, to limit action onto specific set of class ids use at sign (@),
by the list of class ids (for example, @default_class).
Some regularizers accept a dictionary. To specify the dictionary use exclamation mark (!),
followed by the path to the dictionary(.dict file in your file system).
Depending on regularizer the dictinoary can be either optional or required.
Some regularizers expect an dictinoary with tokens and their frequencies;
Other regularizers expect an dictinoary with tokens co-occurencies;
For more information about regularizers refer to wiki-page:

        https://github.com/bigartm/bigartm/wiki/Implemented-regularizers

To get full help run `bigartm --help` without --regularizer switch.