BigARTM as a ServiceΒΆ

The following diagram shows a suggested topology for a query service that involve topic modelling on Big Data.

cloud_service

Here the main use for Hadoop / MapReduce is to process your Big Unstructured Data into a compact bag-of-words representation. Due to out-of-core design and extreme performance BigARTM will be able to handle this data on a single compute-optimized node. The resulting topic model should be replicated on all query instances that serve user requests.

To avoid query-time dependency on BigARTM component you may want to infer topic distributions theta_{td} for new documents in your code. This can be done as follows. Start from uniform topic assigment theta_{td} = 1 / |T| and update it in the following loop:

theta_update

where n_dw is the number of word w occurences in document d, phi_wt is an element of the Phi matrix. In BigARTM the loop is repeated ModelConfig.inner_iterations_count times (defaulst to 10). To precisely replicate BigARTM behavior one needs to account for class weights and include regularizers. Please contact us if you need more details.