Input Data Formats and Datasets¶
This page describes input data formats compatible with BigARTM. Currently all formats support Bag-of-words representation, meaning that all linguistic processing (lemmatization, tokenization, detection of n-grams, etc) needs to be done outside BigARTM.
Vowpal Wabbit is a single-format file, based on the following principles:
- each document is depresented in a single line
- all tokens are represented as strings (no need to convert them into an integer identifier)
- token frequency defaults to
1.0, and can be optionally specified after a colon (:)
- namespaces (
Batch.class_id) can be identified by a pipe (|)
doc1 Alpha Bravo:10 Charlie:5 |author Ola_Nordmann doc2 Bravo:5 Delta Echo:3 |author Ivan_Ivanov
user123 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20 user345 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20
- putting tokens in each document in their natural order without specifying token frequencies will lead to model with sequential texts (not Bag-of-words)
doc1 this text will be processed not as bag of words | Some_Author
UCI Bag-of-words format consists of two files -
docword.*.txt. The format of the
docword.*.txtfile is 3 header lines, followed by NNZ triples:
D W NNZ docID wordID count docID wordID count ... docID wordID count
The file must be sorted on docID. Values of wordID must be unity-based (not zero-based). The format of the
vocab.*.txtfile is line containing wordID=n. Note that words must not have spaces or tabs. In
vocab.*.txtfile it is also possible to specify the namespace (
Batch.class_id) for tokens, as it is shown in this example:
token1 @default_class token2 custom_class token3 @default_class token4
Use space or tab to separate token from its class. Token that are not followed by class label automatically get ''@default_class‘’ as a label (see ‘’token4’’ in the example).
Unicode support. For non-ASCII characters save
vocab.*.txtfile in UTF-8 format.
Batches (binary BigARTM-specific format).
- A batch is a collection of several items
- An item is a collection of several fields
- A field is a collection of pairs
The following example shows a Python code that generates a synthetic batch.
import artm.messages, random, uuid num_tokens = 60 num_items = 100 batch = artm.messages.Batch() batch.id = str(uuid.uuid4()) for token_id in range(0, num_tokens): batch.token.append('token' + str(token_id)) for item_id in range(0, num_items): item = batch.item.add() item.id = item_id field = item.field.add() for token_id in range(0, num_tokens): field.token_id.append(token_id) background_count = random.randint(1, 5) if (token_id >= 40) else 0 topical_count = 10 if (token_id < 40) and ((token_id % 10) == (item_id % 10)) else 0 field.token_weight.append(background_count + topical_count)
Note that the batch has its local dictionary,
batch.token. This dictionary which maps
token_idinto the actual token. In order to create a batch from textual files involve one needs to find all distinct words, and map them into sequential indices.
batch.idmust be set to a unique GUID in a format of
Download one of the following datasets to start experimenting with BigARTM. Note that
UCI BOWformat, while