Formats¶

This page describes input data formats compatible with BigARTM. Currently all formats correspond to Bag-of-words representation, meaning that all linguistic processing (lemmatization, tokenization, detection of n-grams, etc) needs to be done outside BigARTM.

Vowpal Wabbit is a single-format file, based on the following principles:

each document is depresented in a single line
all tokens are represented as strings (no need to convert them into an integer identifier)
token frequency defaults to 1.0, and can be optionally specified after a colon (:)
namespaces (Batch.class_id) can be identified by a pipe (|)

Example 1

doc1 Alpha Bravo:10 Charlie:5 |author Ola_Nordmann
doc2 Bravo:5 Delta Echo:3 |author Ivan_Ivanov

Example 2

user123 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20
user345 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20

UCI Bag-of-words format consists of two files - vocab.*.txt and docword.*.txt. The format of the docword.*.txt file is 3 header lines, followed by NNZ triples:
```
D
W
NNZ
docID wordID count
docID wordID count
...
docID wordID count
```
The file must be sorted on docID. Values of wordID must be unity-based (not zero-based). The format of the vocab.*.txt file is line containing wordID=n. Note that words must not have spaces or tabs. In vocab.*.txt file it is also possible to specify the namespace (Batch.class_id) for tokens, as it is shown in this example:
```
token1 @default_class
token2 custom_class
token3 @default_class
token4
```
Use space or tab to separate token from its class. Token that are not followed by class label automatically get ''@default_class‘’ as a label (see ‘’token4’’ in the example).

Unicode support. For non-ASCII characters save vocab.*.txt file in UTF-8 format.

Batches (binary BigARTM-specific format).

This is compact and efficient format, based on several protobuf messages in public BigARTM interface (Batch, Item and Field).

A batch is a collection of several items
An item is a collection of several fields
A field is a collection of pairs (token_id, token_weight).

The following example shows a Python code that generates a synthetic batch.

import artm.messages, random, uuid

num_tokens = 60
num_items = 100
batch = artm.messages.Batch()
batch.id = str(uuid.uuid4())
for token_id in range(0, num_tokens):
    batch.token.append('token' + str(token_id))

for item_id in range(0, num_items):
    item = batch.item.add()
    item.id = item_id
    field = item.field.add()
    for token_id in range(0, num_tokens):
        field.token_id.append(token_id)
        background_count = random.randint(1, 5) if (token_id >= 40) else 0
        topical_count = 10 if (token_id < 40) and ((token_id % 10) == (item_id % 10)) else 0
        field.token_weight.append(background_count + topical_count)

Note that the batch has its local dictionary, batch.token. This dictionary which maps token_id into the actual token. In order to create a batch from textual files involve one needs to find all distinct words, and map them into sequential indices.

batch.id must be set to a unique GUID in a format of 00000000-0000-0000-0000-000000000000.