Input Data Formats and Datasets

  • Formats

This page describes input data formats compatible with BigARTM. Currently all formats support Bag-of-words representation, meaning that all linguistic processing (lemmatization, tokenization, detection of n-grams, etc) needs to be done outside BigARTM.

  1. Vowpal Wabbit is a single-format file, based on the following principles:

    • each document is represented in a single line
    • all tokens are represented as strings (no need to convert them into an integer identifier)
    • token frequency defaults to 1.0, and can be optionally specified after a colon (:)
    • namespaces (Batch.class_id) can be identified by a pipe (|)

    Example 1

    doc1 Alpha Bravo:10 Charlie:5 |author Ola_Nordmann
    doc2 Bravo:5 Delta Echo:3 |author Ivan_Ivanov
    

    Example 2

    user123 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20
    user345 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20
    
    • putting tokens in each document in their natural order without specifying token frequencies will lead to model with sequential texts (not Bag-of-words)

    Example 3

    doc1 this text will be processed not as bag of words | Some_Author
    
  2. UCI Bag-of-words format consists of two files - vocab.*.txt and docword.*.txt. The format of the docword.*.txt file is 3 header lines, followed by NNZ triples:

    D
    W
    NNZ
    docID wordID count
    docID wordID count
    ...
    docID wordID count
    

    The file must be sorted on docID. Values of wordID must be unity-based (not zero-based). The format of the vocab.*.txt file is line containing wordID=n. Note that words must not have spaces or tabs. In vocab.*.txt file it is also possible to specify the namespace (Batch.class_id) for tokens, as it is shown in this example:

    token1 @default_class
    token2 custom_class
    token3 @default_class
    token4
    

    Use space or tab to separate token from its class. Token that are not followed by class label automatically get ‘’@default_class’’ as a label (see ‘’token4’’ in the example).

    Unicode support. For non-ASCII characters save vocab.*.txt file in UTF-8 format.

  3. Batches (binary BigARTM-specific format).

    This is compact and efficient format, based on several protobuf messages in public BigARTM interface (Batch and Item).

    • A batch is a collection of several items
    • An item is a collection of pairs (token_id, token_weight).

    Note that the batch has its local dictionary, batch.token. This dictionary which maps token_id into the actual token. In order to create a batch from textual files involve one needs to find all distinct words, and map them into sequential indices.

    batch.id must be set to a unique GUID in a format of 00000000-0000-0000-0000-000000000000.