Programming Topic Models

Posted by

In this post, the fourth one of our Topic Model series, we will briefly demonstrate how you can create a Topic Model by using the BigML API. As mentioned in our introductory post, Topic Modeling is an unsupervised learning method to discover the different topics underlying a collection of documents. You can also read the detailed process to create Topic Models using the BigML Dashboard and a real use case to predict the sentiment of movie reviews in the second and third posts respectively.

The API workflow to create a Topic Model is composed of four steps:

Untitled.png

Any resource created with the API will automatically be created in your Dashboard too, so you can take advantage of BigML’s intuitive visualizations at any time. In case you never used the BigML API before, all requests to manage your resources must use HTTPS and be first authenticated by using your username and API key to verify your identity. For instance, here is a base URL example to manage Topic Models.

https://bigml.io/topicmodel?username=$BIGML_USERNAME;api_key=$BIGML_API_KEY

For more details, check the API documentation related to Topic Models here.

1. Upload your Data

Upload your data in your preferred format from a local file, a remote file (using a URL) or from your cloud repository e.g., AWS, Azure etc. This will automatically create a source in your BigML account.

To do this, you need to open up a terminal with curl or any other command-line tool that implements standard HTTPS methods. In the example below we are creating a source from a remote file containing almost 110,000 Airbnb reviews of accommodations in Portland, Oregon.

curl "https://bigml.io/source?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"remote":"http://data.insideairbnb.com/united-states/or/portland/2016-07-04/data/reviews.csv.gz"}'

Topic Models only accept text fields as inputs so your source should always contain at least one text field. To find out how Topic Models tokenize and analyze the text in your data, please read our previous post.

2. Create a Dataset

After the source is created, you need to build a dataset, which computes basic statistics for your fields and gets them ready for the Machine Learning algorithm to take over.

curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"source":"source/68b5627b3c1920186f123478900"}'

3. Create a Topic Model

When your dataset has been created, you need its ID to create your Topic Model. Once again, although you may have many different field types in your dataset, the Topic Model will only use the text fields.

curl "https://bigml.io/topicmodel?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/98b5527c3c1920386a000467"}'

If you don’t want to use all the text fields in your dataset you can use either the argument input_fields (to indicate which fields you want to use as inputs) or the argument excluded_fields (to indicate the fields that you don’t want to use). In this case, since we don’t want to use the text field that contains the reviewer name, we define as our input data just the field containing the text of the reviews.

curl "https://bigml.io/topicmodel?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/98b5527c3c1920386a000467", 
            "input_fields":"comments"}'

Apart from the dataset and the input fields, you can also include additional arguments like the parameters that we explained in our previous post to configure your Topic Model.

4. Make Predictions

The main goal of creating a Topic Model is to find the topics in your dataset instances. Predictions for Topic Models are called Topic Distributions in BigML since they return a set of probabilities (one per topic) for a given instance. The sum of all topic probabilities for a given instance is always 100%.

BigML also allows you to perform a Topic Distribution for one single instance or for several instances simultaneously.

Topic Distribution

To get the topic probability distributions for a single instance, you just need the ID of the Topic Model and the values for the input fields used to create the Topic Model. In most cases, it may be just a text fragment.

curl "https://bigml.io/topicdistribution?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"topicmodel":"topicmodel/58231122983efc15d400002a",
            "input_data":{
            "000005": "Lovely hosts, very accommodating - I was unable to 
            meet at the original check-in time so were flexible and let me 
            come an hour earlier. Clean, tidy, very cute room! Perfect - 
            thanks very much!"
            }
           }'

Batch Topic Distribution

To get the topic probability distributions for multiple instances, you need the ID of the Topic Model and the ID of the dataset containing the values for the instances you want to predict.

curl "https://bigml.io/batchtopicdistribution?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"topicmodel":"topicmodel/58231122983efc15d400002a",
            "dataset":"dataset/98b5527c3c1920386a000467"}'

When the Batch Topic Distribution has been performed, you can download it as a CSV file simply by appending “download” to the Batch Topic Distribution URL.

If you want to use the topics as inputs to build another model (as we explain in the third post of our series), we recommend that you create a dataset from the Batch Topic Distribution. You can easily do so by using the argument output_dataset at the time of the Batch Topic Distribution creation as indicated in the snippet below.

curl "https://bigml.io/batchtopicdistribution?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"topicmodel":"topicmodel/58231122983efc15d400002a",
            "dataset":"dataset/98b5527c3c1920386a000467", 
            "output_dataset": true}'

In the next post, we will explain how to use Topic Models with WhizzML.

Would you like to know more about Topic Models? Visit the Topic Models release page and check our documentation to create Topic Models, interpret them, and predict with them through the BigML Dashboard and API, as well as the six blog posts of this series.

Leave a comment