Input Data Formats and Datasets¶

Formats

This page describes input data formats compatible with BigARTM. Currently all formats support Bag-of-words representation, meaning that all linguistic processing (lemmatization, tokenization, detection of n-grams, etc) needs to be done outside BigARTM.

Vowpal Wabbit is a single-format file, based on the following principles:

each document is represented in a single line
all tokens are represented as strings (no need to convert them into an integer identifier)
token frequency defaults to 1.0, and can be optionally specified after a colon (:)
namespaces (Batch.class_id) can be identified by a pipe (|)

Example 1

doc1 Alpha Bravo:10 Charlie:5 |author Ola_Nordmann
doc2 Bravo:5 Delta Echo:3 |author Ivan_Ivanov

Example 2

user123 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20
user345 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20

putting tokens in each document in their natural order without specifying token frequencies will lead to model with sequential texts (not Bag-of-words)

Example 3

doc1 this text will be processed not as bag of words | Some_Author

UCI Bag-of-words format consists of two files - vocab.*.txt and docword.*.txt. The format of the docword.*.txt file is 3 header lines, followed by NNZ triples:
```
D
W
NNZ
docID wordID count
docID wordID count
...
docID wordID count
```
The file must be sorted on docID. Values of wordID must be unity-based (not zero-based). The format of the vocab.*.txt file is line containing wordID=n. Note that words must not have spaces or tabs. In vocab.*.txt file it is also possible to specify the namespace (Batch.class_id) for tokens, as it is shown in this example:
```
token1 @default_class
token2 custom_class
token3 @default_class
token4
```
Use space or tab to separate token from its class. Token that are not followed by class label automatically get ‘’@default_class’’ as a label (see ‘’token4’’ in the example).

Unicode support. For non-ASCII characters save vocab.*.txt file in UTF-8 format.
Batches (binary BigARTM-specific format).

This is compact and efficient format, based on several protobuf messages in public BigARTM interface (Batch and Item).
- A batch is a collection of several items
- An item is a collection of pairs (token_id, token_weight).
Note that the batch has its local dictionary, batch.token. This dictionary which maps token_id into the actual token. In order to create a batch from textual files involve one needs to find all distinct words, and map them into sequential indices.

batch.id must be set to a unique GUID in a format of 00000000-0000-0000-0000-000000000000.

Datasets

Download one of the following datasets to start experimenting with BigARTM. Note that docword.* and vocab.* files indicate UCI BOW format, while vw.* file indicate Vowpal Wabbit format.

Task Source #Words #Items Files

kos UCI 6906 3430

docword.kos.txt.gz (1 MB)

vocab.kos.txt (54 KB)

nips UCI 12419 1500

docword.nips.txt.gz (2.1 MB)

vocab.nips.txt (98 KB)

enron UCI 28102 39861

docword.enron.txt.gz (11.7 MB)

vocab.enron.txt (230 KB)

nytimes UCI 102660 300000

docword.nytimes.txt.gz (223 MB)

vocab.nytimes.txt (1.2 MB)

pubmed UCI 141043 8200000

docword.pubmed.txt.gz (1.7 GB)

vocab.pubmed.txt (1.3 MB)

wiki Gensim 100000 3665223

vw.wiki-en.txt.zip (1.8 GB)

wiki_enru Wiki 196749 216175

vw.wiki_enru.txt.zip (285 MB)

eurlex eurlex 19800 21000

vw.eurlex.txt.zip (13 MB)

vw.eurlex-test.txt.zip (13 MB)

lastfm lastfm 1k, 360k

vw.lastfm_1k.txt.zip (100 MB)

vw.lastfm_360k.txt.zip (330 MB)

mmro mmro 7805 1061

docword.mmro.txt.gz (500 KB)

vocab.mmro.txt (150 KB)

pPMI_w100.mmro.txt.7z (23 MB)

vw.mmro.txt.7z (1.4 MB)