Tutorial¶

This tutorial provides a basic Python programmer’s introduction to BigARTM. It demonstrates how to

install BigARTM library on your computer,
configure basic BigARTM parameters,
load the text collection into BigARTM,
infer topic model and retrieve the results.

Installation on Windows¶

Download and install Python 2.7 (https://www.python.org/downloads/).
Download and unpack the latest BigARTM release (https://github.com/bigartm/bigartm/releases). Choose carefully between win32 and x64 version. The version of BigARTM package must match your version Python installed on your machine.
Add BigARTM to your PATH and PYTHONPATH system variables as follows:
```
set PATH=%PATH%;C:\BigARTM\bin
set PYTHONPATH=%PYTHONPATH%;C:\BigARTM\Python
```
Remebmer to change C:\BigARTM if you unpacked BigARTM to a different location.
Setup Google Protocol Buffers library, included in the BigARTM release package. To do so, follow the instructions in protobuf/python/README.

The BigARTM package will contain the following files:

`bin/`	Precompiled binaries of BigARTM for Windows. This folder must be added to `PATH` system variable.
`bin/artm.dll`	Core functionality of the BigARTM library.
`bin/node_controller.exe`	Executable that hosts BigARTM nodes in a distributed setting.
`bin/cpp_client.exe`	Command line utility allows to perform simple experiments with BigARTM. Remember that not all BigARTM features are available through cpp_client, but it can serve as a good starting point to learn basic functionality. For further details refer to BigARTM command line utility.
`protobuf/`	A minimalistic version of Google Protocol Buffers (https://code.google.com/p/protobuf/) library, required to run BigARTM from Python. To setup this package follow the instructions in `protobuf/python/README` file.
`python/artm/`	Python programming interface to BigARTM library. This folder must be added to `PYTHONPATH` system variable.
`` library.py``	Implements all classes of BigARTM python interface.
`` messages_pb2.py``	Contains all protobuf messages that can be transfered in and out BigARTM core library. Most common features are exposed with their own API methods, so normally you do not use python protobuf messages to operate BigARTM.
`python/examples/`	Python examples of how to use BigARTM: example01_synthetic_collection.py example02_parse_collection.py example03_concurrency.py example04_online_algorithm.py example05_train_and_test_stream.py example06_use_dictionaries.py example07_master_component_proxy.py example08_network_modus_operandi.py Files `docword.kos.txt` and `vocab.kos.txt` represent a simple collection of text files in Bag-Of-Words format. The files are taken from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Bag+of+Words).
`src/`	Several programming interfaces to BigARTM library.
`src/c_interface.h`	Low-level BigARTM interface in C.
`cpp_interface.h,cc`	C++ interface of BigARTM
`messages.pb.h,cc`	Protobuf messages for C++ interface
`messages.proto`	Protobuf description for all messages that appear in the API of BigARTM. Ddocumented here.
`LICENSE`	License file of BigARTM.

Installation on Linux¶

Currently there is no distribution package of BigARTM for Linux. BigARTM had been tested on several Linux OS, and it is known to work well, but you have to get the source code and compile it locally on your machine. Please, refer to Developer’s Guide for further instructions.

To get a live usage example of BigARTM you may check BigARTM’s .travis.yml script and the latest continuous integration build.

Intel Math Kernel Library¶

BigARTM can utilize Intel Math Kernel Library to achieve better performance.

To enable MKL usage on Windows add the path to MKL library to your PATH system variable

set PATH=%PATH%;"C:\Program Files (x86)\Intel\Composer XE 2013 SP1\redist\intel64\mkl"

To enable MKL usage on Linux create a new system variable MKL_PATH and set it as follows

export MKL_PATH="/opt/intel/mkl/lib/intel64/"

First steps¶

Run example02_parse_collection.py script from BigARTM distributive. It will load a text collection from disk and use iterative scans over the collection to infer some topic models. Then it outputs top works in each topic and topic classification of some random documents. Running the script produces the following output:

>python example02_parse_collection.py

 No batches found, parsing them from textual collection...  OK.
 Iter#0 : Perplexity = 6921.336 , Phi sparsity = 0.046  , Theta sparsity = 0.050
 Iter#1 : Perplexity = 2538.800 , Phi sparsity = 0.101  , Theta sparsity = 0.082
 Iter#2 : Perplexity = 2208.745 , Phi sparsity = 0.173  , Theta sparsity = 0.156
 Iter#3 : Perplexity = 1953.304 , Phi sparsity = 0.259  , Theta sparsity = 0.229
 Iter#4 : Perplexity = 1776.102 , Phi sparsity = 0.337  , Theta sparsity = 0.296
 Iter#5 : Perplexity = 1693.438 , Phi sparsity = 0.395  , Theta sparsity = 0.322
 Iter#6 : Perplexity = 1650.383 , Phi sparsity = 0.442  , Theta sparsity = 0.334
 Iter#7 : Perplexity = 1624.210 , Phi sparsity = 0.478  , Theta sparsity = 0.341

 Top tokens per topic:
 Topic#1: democratic campaign dean poll general edwards party voters john republicans
 Topic#2: iraq administration war white bushs officials time people attacks news
 Topic#3: military iraqi abu iraqis fallujah soldiers truth ghraib army forces
 Topic#4: state republican race elections district percent gop election candidate house
 Topic#5: planned soldier cities heart stolen city husband christopher view amp
 Topic#6: cheney debate union politics unions local endorsement space black labor
 Topic#7: president war states united years government jobs tax people health
 Topic#8: delay law court texas committee ballot donors investigation records federal
 Topic#9: november electoral account governor polls republicans senate vote poll contact

 Snippet of theta matrix:
 Item#1: 0.054 0.108  0.017  0.282  0.000  0.000  0.528  0.000  0.011
 Item#2: 0.174 0.060  0.686  0.000  0.000  0.000  0.000  0.081  0.000
 Item#3: 0.000 0.000  0.000  0.000  0.117  0.000  0.000  0.000  0.883
 Item#4: 0.225 0.128  0.058  0.078  0.012  0.455  0.010  0.027  0.008
 Item#5: 0.455 0.145  0.083  0.124  0.009  0.031  0.136  0.017  0.000
 Item#6: 0.455 0.000  0.000  0.518  0.027  0.000  0.000  0.000  0.000
 Item#7: 0.573 0.023  0.341  0.041  0.000  0.000  0.012  0.000  0.010
 Item#8: 0.759 0.000  0.229  0.013  0.000  0.000  0.000  0.000  0.000
 Item#9: 0.258 0.000  0.070  0.453  0.000  0.000  0.218  0.000  0.000

Parse collection¶

The following python script parses docword.kos.txt and vocab.kos.txt files and converts them into a set of binary-serialized batches, stored on disk. In addition the script creates a dictionary with all unique tokens in the collection and stored it on disk. The script also detects if it had been already executed, and in this case it just loads the dictionary and save it in unique_tokens variable.

The same logic is implemented in a helper-method ParseCollectionOrLoadDictionary method.

data_folder = sys.argv[1] if (len(sys.argv) >= 2) else ''
target_folder = 'kos'
collection_name = 'kos'

batches_found = len(glob.glob(target_folder + "/*.batch"))
if batches_found == 0:
  print "No batches found, parsing them from textual collection...",
  parser_config = artm.messages_pb2.CollectionParserConfig();
  parser_config.format = artm.library.CollectionParserConfig_Format_BagOfWordsUci

  parser_config.docword_file_path = data_folder + 'docword.'+ collection_name + '.txt'
  parser_config.vocab_file_path = data_folder + 'vocab.'+ collection_name + '.txt'
  parser_config.target_folder = target_folder
  parser_config.dictionary_file_name = 'dictionary'
  unique_tokens = artm.library.Library().ParseCollection(parser_config);
  print " OK."
else:
  print "Found " + str(batches_found) + " batches, using them."
  unique_tokens  = artm.library.Library().LoadDictionary(target_folder + '/dictionary');

You may also download larger collections from the following links. You can get the original collection (docword file and vocab file) or an already precompiled batches and dictionary.

Task	Source	#Words	#Items	Files
kos	UCI	6906	3430	docword.kos.txt.gz (1 MB) vocab.kos.txt (54 KB) kos_1k (700 KB) kos_dictionary
nips	UCI	12419	1500	docword.nips.txt.gz (2.1 MB) vocab.nips.txt (98 KB) nips_200 (1.5 MB) nips_dictionary
enron	UCI	28102	39861	docword.enron.txt.gz (11.7 MB) vocab.enron.txt (230 KB) enron_1k (7.1 MB) enron_dictionary
nytimes	UCI	102660	300000	docword.nytimes.txt.gz (223 MB) vocab.nytimes.txt (1.2 MB) nytimes_1k (131 MB) nytimes_dictionary
pubmed	UCI	141043	8200000	docword.pubmed.txt.gz (1.7 GB) vocab.pubmed.txt (1.3 MB) pubmed_10k (1 GB) pubmed_dictionary
wiki	Gensim	100000	3665223	wiki_10k (1.1 GB) wiki_dictionary

MasterComponent¶

Master component is you main entry-point to all BigARTM functionality. The following script creates master component and configures it with several regularizers and score calculators.

with artm.library.MasterComponent(disk_path = target_folder) as master:
  perplexity_score     = master.CreatePerplexityScore()
  sparsity_theta_score = master.CreateSparsityThetaScore()
  sparsity_phi_score   = master.CreateSparsityPhiScore()
  top_tokens_score     = master.CreateTopTokensScore()
  theta_snippet_score  = master.CreateThetaSnippetScore()

  dirichlet_theta_reg  = master.CreateDirichletThetaRegularizer()
  dirichlet_phi_reg    = master.CreateDirichletPhiRegularizer()
  decorrelator_reg     = master.CreateDecorrelatorPhiRegularizer()

Master component must be configured with a disk path, which should contain a set of batches produced in the previous step of this tutorial.

Score calculators allows you to retrieve important quality measures for your topic model. Perplexity, sparsity of theta and phi matrices, lists of tokens with highest probability within each topic are all examples of such scores. By default BigARTM does not calculate any scores, so you have to create in master component. The same is true for regularizers, that allow you to customize your topic model.

For further details about master component refer to MasterComponentConfig.

Configure Topic Model¶

Topic model configuration defins the number of topics in the model, the list of scores to be calculated, and the list of regularizers to apply to the model. For further details about model configuration refer to ModelConfig.

model = master.CreateModel(topics_count = 10, inner_iterations_count = 10)
model.EnableScore(perplexity_score)
model.EnableScore(sparsity_phi_score)
model.EnableScore(sparsity_theta_score)
model.EnableScore(top_tokens_score)
model.EnableScore(theta_snippet_score)
model.EnableRegularizer(dirichlet_theta_reg, -0.1)
model.EnableRegularizer(dirichlet_phi_reg, -0.2)
model.EnableRegularizer(decorrelator_reg, 1000000)
model.Initialize(unique_tokens)    # Setup initial approximation for Phi matrix.

Note that on the last step we configured the initial approximation of Phi matrix. This step is optional — BigARTM is able to collect all tokens dynamically during first scan of the collection. However, a deterministic initial approximation helps to reproduce the same results from run to run.

Invoke Iterations¶

The following script performs several scans over the set of batches. Depending on the size of the collection this step might be quite time-consuming. It is good idea to output some information after every step.

for iter in range(0, 8):
  master.InvokeIteration(1)        # Invoke one scan of the entire collection...
  master.WaitIdle();               # and wait until it completes.
  model.Synchronize();             # Synchronize topic model.
  print "Iter#" + str(iter),
  print ": Perplexity = %.3f" % perplexity_score.GetValue(model).value,
  print ", Phi sparsity = %.3f" % sparsity_phi_score.GetValue(model).value,
  print ", Theta sparsity = %.3f" % sparsity_theta_score.GetValue(model).value

If your collection is very large you may want to utilize online algorithm that updates topic model several times during each iteration, as it is demonstrated by the following script:

master.InvokeIteration(1)        # Invoke one scan of the entire collection...
while True:
  done = master.WaitIdle(100)    # wait 100 ms
  model.Synchronize(0.9)         # decay weights in current topic model by 0.9,
  if (done):                     # append all increments and invoke all regularizers.
        break;

Retrieve and visualize scores¶

Finally, you are interested in retrieving and visualizing all collected scores.

artm.library.Visualizers.PrintTopTokensScore(top_tokens_score.GetValue(model))
artm.library.Visualizers.PrintThetaSnippetScore(theta_snippet_score.GetValue(model))