BigARTM Command Line Utility¶
This document provides an overview of bigartm
command-line utility shipped with BigARTM.
For a detailed description of bigartm
command line interface refer to
bigartm.exe notebook (in Russian).
In brief, you need to download some input data (a textual collection represented in bag-of-words format).
We recommend to download sample colections in vowpal wabbit format by links provided in Downloads section of the tutorial.
Then you can use bigartm
as described by bigartm --help
.
You may also get more information about builtin regularizers by typing bigartm --help --regularizer
.
BigARTM v0.8.2 - library for advanced topic modeling (http://bigartm.org):
Input data:
-c [ --read-vw-corpus ] arg Raw corpus in Vowpal Wabbit format
-d [ --read-uci-docword ] arg docword file in UCI format
-v [ --read-uci-vocab ] arg vocab file in UCI format
--read-cooc arg read co-occurrences format
--batch-size arg (=500) number of items per batch
--use-batches arg folder with batches to use
Dictionary:
--dictionary-min-df arg filter out tokens present in less than
N documents / less than P% of documents
--dictionary-max-df arg filter out tokens present in less than
N documents / less than P% of documents
--dictionary-size arg (=0) limit dictionary size by filtering out
tokens with high document frequency
--use-dictionary arg filename of binary dictionary file to
use
Model:
--load-model arg load model from file before processing
-t [ --topics ] arg (=16) number of topics
--use-modality arg modalities (class_ids) and their
weights
--predict-class arg target modality to predict by theta
matrix
Learning:
-p [ --num-collection-passes ] arg (=0)
number of outer iterations (passes
through the collection)
--num-document-passes arg (=10) number of inner iterations (passes
through the document)
--update-every arg (=0) [online algorithm] requests an update
of the model after update_every
document
--tau0 arg (=1024) [online algorithm] weight option from
online update formula
--kappa arg (=0.699999988) [online algorithm] exponent option from
online update formula
--reuse-theta reuse theta between iterations
--regularizer arg regularizers (SmoothPhi,SparsePhi,Smoot
hTheta,SparseTheta,Decorrelation)
--threads arg (=-1) number of concurrent processors
(default: auto-detect)
--async invoke asynchronous version of the
online algorithm
Output:
--save-model arg save the model to binary file after
processing
--save-batches arg batch folder
--save-dictionary arg filename of dictionary file
--write-model-readable arg output the model in a human-readable
format
--write-dictionary-readable arg output the dictionary in a
human-readable format
--write-predictions arg write prediction in a human-readable
format
--write-class-predictions arg write class prediction in a
human-readable format
--write-scores arg write scores in a human-readable format
--write-vw-corpus arg convert batches into plain text file in
Vowpal Wabbit format
--force force overwrite existing output files
--csv-separator arg (=;) columns separator for
--write-model-readable and
--write-predictions. Use \t or TAB to
indicate tab.
--score-level arg (=2) score level (0, 1, 2, or 3
--score arg scores (Perplexity, SparsityTheta,
SparsityPhi, TopTokens, ThetaSnippet,
or TopicKernel)
--final-score arg final scores (same as scores)
Other options:
-h [ --help ] display this help message
--rand-seed arg specify seed for random number
generator, use system timer when not
specified
--guid-batch-name applies to save-batches and indicate
that batch names should be guids (not
sequential codes)
--response-file arg response file
--paused start paused and waits for a keystroke
(allows to attach a debugger)
--disk-cache-folder arg disk cache folder
--disable-avx-opt disable AVX optimization (gives similar
behavior of the Processor component to
BigARTM v0.5.4)
--time-limit arg (=0) limit execution time in milliseconds
--log-dir arg target directory for logging
(GLOG_log_dir)
--log-level arg min logging level (GLOG_minloglevel;
INFO=0, WARNING=1, ERROR=2, and
FATAL=3)
Examples:
* Download input data:
wget https://s3-eu-west-1.amazonaws.com/artm/docword.kos.txt
wget https://s3-eu-west-1.amazonaws.com/artm/vocab.kos.txt
wget https://s3-eu-west-1.amazonaws.com/artm/vw.mmro.txt
wget https://s3-eu-west-1.amazonaws.com/artm/vw.wiki-enru.txt.zip
* Parse docword and vocab files from UCI bag-of-word format; then fit topic model with 20 topics:
bigartm -d docword.kos.txt -v vocab.kos.txt -t 20 --num_collection_passes 10
* Parse VW format; then save the resulting batches and dictionary:
bigartm --read-vw-corpus vw.mmro.txt --save-batches mmro_batches --save-dictionary mmro.dict
* Parse VW format from standard input; note usage of single dash '-' after --read-vw-corpus:
cat vw.mmro.txt | bigartm --read-vw-corpus - --save-batches mmro2_batches --save-dictionary mmro2.dict
* Re-save batches back into VW format:
bigartm --use-batches mmro_batches --write-vw-corpus vw.mmro.txt
* Parse only specific modalities from VW file, and save them as a new VW file:
bigartm --read-vw-corpus vw.wiki-enru.txt --use-modality @russian --write-vw-corpus vw.wiki-ru.txt
* Load and filter the dictionary on document frequency; save the result into a new file:
bigartm --use-dictionary mmro.dict --dictionary-min-df 5 dictionary-max-df 40% --save-dictionary mmro-filter.dict
* Load the dictionary and export it in a human-readable format:
bigartm --use-dictionary mmro.dict --write-dictionary-readable mmro.dict.txt
* Use batches to fit a model with 20 topics; then save the model in a binary format:
bigartm --use-batches mmro_batches --num_collection_passes 10 -t 20 --save-model mmro.model
* Load the model and export it in a human-readable format:
bigartm --load-model mmro.model --write-model-readable mmro.model.txt
* Load the model and use it to generate predictions:
bigartm --read-vw-corpus vw.mmro.txt --load-model mmro.model --write-predictions mmro.predict.txt
* Fit model with two modalities (@default_class and @target), and use it to predict @target label:
bigartm --use-batches <batches> --use-modality @default_class,@target --topics 50 --num_collection_passes 10 --save-model model.bin
bigartm --use-batches <batches> --use-modality @default_class,@target --topics 50 --load-model model.bin
--write-predictions pred.txt --csv-separator=tab
--predict-class @target --write-class-predictions pred_class.txt --score ClassPrecision
* Fit simple regularized model (increase sparsity up to 60-70%):
bigartm -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--num_collection_passes 10 --batch-size 50 --topics 20 --write-model-readable model.txt
--regularizer "0.05 SparsePhi" "0.05 SparseTheta"
* Fit more advanced regularize model, with 10 sparse objective topics, and 2 smooth background topics:
bigartm -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--num_collection_passes 10 --batch-size 50 --topics obj:10;background:2 --write-model-readable model.txt
--regularizer "0.05 SparsePhi #obj"
--regularizer "0.05 SparseTheta #obj"
--regularizer "0.25 SmoothPhi #background"
--regularizer "0.25 SmoothTheta #background"
* Upgrade batches in the old format (from folder 'old_folder' into 'new_folder'):
bigartm --use-batches old_folder --save-batches new_folder
* Configure logger to output into stderr:
tset GLOG_logtostderr=1 & bigartm -d docword.kos.txt -v vocab.kos.txt -t 20 --num_collection_passes 10
Additional information about regularizers:
>bigartm.exe --regularizer --help
List of regularizers available in BigARTM CLI:
--regularizer "tau SmoothTheta #topics"
--regularizer "tau SparseTheta #topics"
--regularizer "tau SmoothPhi #topics @class_ids !dictionary"
--regularizer "tau SparsePhi #topics @class_ids !dictionary"
--regularizer "tau Decorrelation #topics @class_ids"
--regularizer "tau TopicSelection #topics"
--regularizer "tau LabelRegularization #topics @class_ids !dictionary"
--regularizer "tau ImproveCoherence #topics @class_ids !dictionary"
--regularizer "tau Biterms #topics @class_ids !dictionary"
List of regularizers available in BigARTM, but not exposed in CLI:
--regularizer "tau SpecifiedSparsePhi"
--regularizer "tau SmoothPtdw"
--regularizer "tau HierarchySparsingTheta"
If you are interested to see any of these regularizers in BigARTM CLI please send a message to
bigartm-users@googlegroups.com.
By default all regularizers act on the full set of topics and modalities.
To limit action onto specific set of topics use hash sign (#), followed by
list of topics (for example, #topic1;topic2) or topic groups (#obj).
Similarly, to limit action onto specific set of class ids use at sign (@),
by the list of class ids (for example, @default_class).
Some regularizers accept a dictionary. To specify the dictionary use exclamation mark (!),
followed by the path to the dictionary(.dict file in your file system).
Depending on regularizer the dictinoary can be either optional or required.
Some regularizers expect an dictinoary with tokens and their frequencies;
Other regularizers expect an dictinoary with tokens co-occurencies;
For more information about regularizers refer to wiki-page:
https://github.com/bigartm/bigartm/wiki/Implemented-regularizers
To get full help run `bigartm --help` without --regularizer switch.