Skip to content

BigML Fall 2016 Release Webinar Video is Here!

Thank you to all webinar attendees for the active feedback and questions about BigML’s Fall 2016 Release that includes Topic Models, our latest resource that helps you find thematically related terms in your text data. Our implementation of the underlying Latent Dirichlet Allocation (LDA) technique, one of the most popular probabilistic methods for topic modeling tasks, is now available from the BigML Dashboard and API. As is the case for any Machine Learning workflow, you can also automate your Topic Model workflows with WhizzML!

If you missed the webinar, it’s not a problem. Now you can watch the complete session on the BigML Youtube channel.

Please visit our dedicated Fall 2016 Release page for more resources, including:

  • The Topic Models documentation to learn how to create, interpret and make predictions from the BigML Dashboard and the BigML API.

  • The series of six blog posts that explain Topic Models step by step, starting with the basics and wrapping up with the mathematical insights of the LDA algorithm.

Many thanks for your time and attention. We are looking forward to bringing you our next release!

Who Wants to Know the Inner Workings of LDA?

In our recent series of blog posts on Topic Models, we’ve tried to explore this powerful new resource in the BigML Dashboard, in the API, using WhizzML, and we have also suggested some uses for it. But we’ve left a nuts and bolts description of how Latent Dirichlet Allocation (LDA) works until the end. Within this post, the last of a series of six posts, we’ll try here to give you exactly that: A high-level overview of the internal mathematics that underlies Topic Models, and what that mathematics might imply for you, the modeler.

david-blei-lda

David M. Blei – One of the creators of LDA, together w. Andrew Y. Ng and Michael I. Jordan.

While I’ll explain a few things here, a more precise and technical explanation given by the inventor of the technique, David Blei, is available. Where there seems to be conflict between his explanation and mine, rest assured, his is correct!

My Generation

A crucially important aspect of the topic models learned by Latent Dirichlet Allocation is that they are generative models. This is different from many of our models at BigML.  Decision trees and logistic regressions are discriminative models.  This means essentially that they spend their effort using the data to model the boundary between classes, directly approximating the function of interest without much concern about exactly why the data is the way it is.

Generative models are a bit different. Typically, generative models posit a statistical structure that is said to have generated the data. The modeling process is then a process of using the data to fit the parameters of that structure so that the structure is likely to have generated the data that we see.

More concretely, Latent Dirichlet Allocation imagines that each document is a distribution over topics in your dataset, like {Topic 1: 0.8, Topic 2: 0.0, Topic 3: 0.1, Topic 4: 0.1}. Each of those topics, as we know, is a distribution over words, like {President: 0.5, united: 0.25, states: 0.25}.  Now, to generate a document from that topic distribution, we first choose a random topic according to those probabilities (so we’d be very likely to choose topic 1, unlikely to choose topics 3 or 4, and would never choose topic 2). Once we’ve chosen our topic, we choose a word from that topic (so we’d choose “President” with high probability, and “united” or “states” with lower probability).  That gives us a single word in our document. Then we repeat the process over and over again until we have however long of a document we’d like. Et Voilà, we’ve generated a document from a topic distribution!

The astute reader will notice we’ve glossed over where the topic distribution for our document came from. There’s actually another meta-distribution from which we choose these topic distributions. So each document in our collection is a random choice of a topic distribution from the overlying distribution over topic distributions. With that, we’ve got a structure that could possibly generate our entire document collection, if all of those distributions are just right.

Another Tricky Day

As we mentioned earlier, learning proceeds in a somewhat backwards way from how we think about the generative structure: That is, the model gives a a procedure to generate documents given the distributions, but we already have the documents and need to infer the parameters of the model that probably generated them.

This is a tricky proposition so I won’t bore you with the details, but there are techniques like collapsed Gibbs sampling and variational methods that allow you to do this sort of “reverse inference” in generative models, and that’s exactly what we use in practice for our Topic Models.

The Seeker

So how can this knowledge affect the way you use BigML topic models? One of the key things to remember when using Topic Models is that the learning process is trying to use the generative structure above to explain how your document collection came to be. This means that if there’s a certain bunch of terms that occur all the time in your documents, the model will spend a lot of effort trying to explain how they got there, and might end up ignoring terms that are less frequent.

Why might this be bad?  Mostly because text data often tends to be “dirty”, with a lot of cruft that you don’t care about modeling. A great example is web pages. There’s a lot of nice information on web pages, but you’ll find if you pass raw web page source to Topic Models, the terms it finds most important will be things like “html”, “span”, “div” and “href”: Because these formatting directives appear all the time in web pages and in different quantities, the model will spend lots of effort trying to explain the differences in the occurrence rates of these tokens. Maybe that’s what you’re looking for; if you’ve got a dataset that is half web pages and half e-mail messages, the tokens that denote web pages might be useful indeed. It might also be nice to know which pages have more links. Then again, maybe you don’t care about HTML tags at all.

BigML attempts to remove tokens that occur “too frequently to be useful” when doing topic modeling, but it’s often specific to the use case whether or not something that’s frequently occurring is very important or just noise. But this is a problem that’s easy enough for you to fix: You can just exclude those useless terms from your dataset, either by pre-processing before you upload it, using Flatline, or using the excluded_terms parameter during the modeling process. You’ll often find that eliminating the half-dozen or so most common terms will yield a very different model; you’ll have changed the composition of your collection so much that other terms will have become far more relevant on a relative basis than they were the first time around.

Eyesight to the Blind

With a little understanding of how topic models work, you can create them with open eyes and maybe even improve your results by tweaking the data just a bit. Give them a try and, as usual, please drop us a line at support@bigml.com if you’re having trouble, or even if you just have some interesting data and want to show off the results.  Good luck!

If you want to know more about this new resource, please visit the Topic Models release page and check our documentation to create Topic Models, interpret them, and predict with them through the BigML Dashboard and API, as well as the six blog posts of this series.

P.S. – Extra credit trivia question: What is notable about the names of all the subsections in this post?

BigML and CICE Join Forces to Revolutionalize Machine Learning Education

Democratizing Machine Learning has always been BigML’s founding mission, so we are continually searching for new opportunities. As such, when a company is interested in our technology and is willing to help us further our cause of “Machine Learning for everyone”, we feel the urge to collaborate. This is exactly what happened with our new education partner. Today we are happy to announce our educational collaboration with CICE, the Leading School in New Technologies Training in Madrid, Spain.

CICE, the only Official Training Center in Spain for more than 20 multinational companies, is already a community of 70,000+ students from 30 different countries. With 35 years of experience, the school provides High Quality and Official Training Programs and the homologations from the leading companies thanks to certified teaching professionals recruited from the most prestigious Spanish production companies. In fact, the CICE Team is currently going through BigML’s new certifications program and they will soon be among the first batch of BigML Certified Engineers.

cice-team-ok

CICE aims to create efficacy through constant educational innovation by being protagonists and agents of the deep digital transformation our society is going through. They have already set exemplary standards for the fast-growing market segment of New Technologies Education through a mix of:

  • The best instructors: certified professionals with wide experience and proven ability to teach.
  • The best facilities: a heterogeneous system of professional Apple and DELL workstations along with a set of cloud services that provide the best educational experience.
  • The best homologations: a set of alliances that make CICE to provide a unique learning environment that guarantees a learning process that will fulfill the student’s expectations, in a way only the leaders of the Educational Sector can do.
  • The greatest quantity and quality of production: facts to demonstrate objectively that an excellent education at CICE plus the effort and talent of the students they train, turns into a harvest of projects that make CICE the winner of the high profile national and international competitions.

CICE’s enthusiasm for the BigML platform and their eagerness to improve Machine Learning education were big reasons why we agreed to collaborate. CICE has already joined our education program, which has more than 100 ambassadors and 620+ universities affiliated around the world that actively promote BigML.  They have started spreading the word through their students, social media, and at the education events they run in Madrid.

foto-2

We are excited about the official partnership and are looking forward to make a big difference together in graduating the future data-driven leaders of the 21st century ‘s global digital economy.

Automated Topic Modeling Workflows Done Right

by

This series of posts started by introducing Topic Models as BigML’s implementation of Latent Dirichlet Allocation (LDA) to help discover thematically related terms in unstructured text data. We later explained how to use it through the BigML Dashboard, showed how to apply Topic Models in a real-life use case, and how to program Topic Models using the BigML API. This post will focus on automating LDA workflows by using WhizzML, a DSL for Machine Learning that helps automate workflows, program high-level algorithms, and share workflows and algorithm with others.

Let’s dive in by creating a Topic Model and making a prediction with it. In BigML, you can perform single instance predictions (referred to as a Topic Distribution) or in batch mode, which is called Batch Topic Distribution. 

Firstly, we will create a Topic Model without specifying any particular configuration option, that is, relying on default settings. For that, you just need to create a script with the source code below:

screen-shot-2016-11-18-at-08-44-13

BigML’s API is mostly asynchronous, so the above creation function will return a response before the Topic Model creation is completed. This implies that the Topic Model information is not ready to make predictions right after the above code snippet is executed, so it is convenient to wait for its completion before you can predict with it. You can do just that by using the directive “create-and-wait-topicmodel”. See the example below:

screen-shot-2016-11-18-at-08-47-07

Now let’s try configuring a Topic Model via WhizzML. The properties to configure can be easily added in the mapping by using property pairs like <property_name> and <property_value>. For instance, in order to calculate a Topic Model with a dataset that contains one or more text fields, BigML automatically determines the number of topics, but if you prefer the maximum number of topics that BigML allows, you should add the property “number_of_topics” and set it to 64. Additionally, if you want your Topic Model to be case sensitive, then you need to set thecase_sensitive” property as true. Property names always need to be between quotes and the value should be expressed in the appropriate type. The code for our example can be seen below:

screen-shot-2016-11-18-at-08-56-47

For more details about all available properties please check the API documentation.

You now know how to create a Topic Model, let’s see how to create predictions from it. The code is similar to the one used to create any resource. You just need the ID of the Topic Model you want to predict with and provide the input data as a map with the new text (or texts) that you want to predict for. The input_data property is a map that uses the field ID as key. Here’s an example:

Screen Shot 2016-11-18 at 09.19.27.png

This is one of the exceptions to the asynchronous behavior of BigML’s API, therefore the Topic Distribution gets ready without stopping the source code progress in this case.

In many working scenarios, a batch prediction that allows predictions from a set of new data is more useful than a single prediction. In other words, a Batch Topic Distribution is usually preferred to a Topic Distribution.

It is pretty straightforward to create a Batch Topic Distribution from an existing Topic Model, where the dataset with name ds1 represents a set of rows with the text data to analyze:

screen-shot-2016-11-18-at-09-36-48

It’s likely that before creating a Batch Topic Distribution you will need to configure properties including fields mapping between Topic Model and the Dataset with the input data, or you may need to set another property to include the importance of each field as columns in your results. The way you should configure these properties is the same as for creating a Topic Model. The full list of all available properties is listed under API documentation. It’s also important to know that contrary to single predictions, batch predictions are asynchronous in nature. Below is such an example:

screen-shot-2016-11-18-at-10-25-34

Up to this point we were covering source code of scripts. In order to write the code in the BigML Dashboard, you can also use a pretty editor that supports colored syntax highlighting, auto-complete functions, and code format utility. Take it for a spin and build your scripts quickly.

screen-shot-2016-11-18-at-10-32-47

Nevertheless, if you are at home with APIs and you can find more script creation answers here. To obtain results from WhizzML scripts we need to first run them, so let’s see how to execute a Script in the Dashboard and how to carry out the same process by calling the right endpoint directly in the API.

First, the BigML Dashboard option: Look for your new script in the scripts list and click on it.

screen-shot-2016-11-18-at-09-47-54

You will see a page like the one below, that is, the inputs you need to fill before you run. For instance, in the Topic Model script creation that we described at the beginning of this post, you just need to select the dataset you want to use to build your Topic Model from the dropdown.

screen-shot-2016-11-18-at-09-50-10

It’s mandatory to fill all the input fields with a grey icon so they are validated with a green icon (empty values are not accepted except for text inputs).

screen-shot-2016-11-18-at-10-04-41

Finally, we will focus on how to execute a script through the API. To do this you need to compose a POST request with JSON content to /execute endpoint with two parameters: one will be the ID of the script you previously created, and the other one will be the “inputs”, which is a list of pairs that follow the schema <input_name> <input_value>. These include all the inputs without a defined default value in your script. Let’s see an example to showcase this idea: for the first script we used above, the input you need to fill is “ds1” with the dataset identifier you want to use to create a Topic Model. The complete request to the BigML API should be similar to this curl example:

 curl "https://bigml.io/execution?$BIGML_AUTH"
              -X POST
              -H 'content-type: application/json'
              -d '{"script": "script/55f007d21f386f5199000003",
                 "inputs": [["ds1", "dataset/55f007d21f386f5199000000"]]}'

We hope you enjoyed reading this quick tour on executing a script using the BigML API. For a more extensive list of execution parameters and how to access to the execution results, please visit the corresponding section in the API documentation. Notice that we didn’t dive into the authentication description, but it is described here. Finally, for an extensive description of WhizzML you can visit the WhizzML page.

In the next blog post we will discover the internal mathematics that underlies Topic Models and what these mathematics might imply for you, the modeler.

Would you like to know more about Topic Models? Visit the Topic Models release page and check our documentation to create Topic Models, interpret them, and predict with them through the BigML Dashboard and API, as well as the six blog posts of this series.

Programming Topic Models

In this post, the fourth one of our Topic Model series, we will briefly demonstrate how you can create a Topic Model by using the BigML API. As mentioned in our introductory post, Topic Modeling is an unsupervised learning method to discover the different topics underlying a collection of documents. You can also read the detailed process to create Topic Models using the BigML Dashboard and a real use case to predict the sentiment of movie reviews in the second and third posts respectively.

The API workflow to create a Topic Model is composed of four steps:

Untitled.png

Any resource created with the API will automatically be created in your Dashboard too, so you can take advantage of BigML’s intuitive visualizations at any time. In case you never used the BigML API before, all requests to manage your resources must use HTTPS and be first authenticated by using your username and API key to verify your identity. For instance, here is a base URL example to manage Topic Models.

https://bigml.io/topicmodel?username=$BIGML_USERNAME;api_key=$BIGML_API_KEY

For more details, check the API documentation related to Topic Models here.

1. Upload your Data

Upload your data in your preferred format from a local file, a remote file (using a URL) or from your cloud repository e.g., AWS, Azure etc. This will automatically create a source in your BigML account.

To do this, you need to open up a terminal with curl or any other command-line tool that implements standard HTTPS methods. In the example below we are creating a source from a remote file containing almost 110,000 Airbnb reviews of accommodations in Portland, Oregon.

curl "https://bigml.io/source?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"remote":"http://data.insideairbnb.com/united-states/or/portland/2016-07-04/data/reviews.csv.gz"}'

Topic Models only accept text fields as inputs so your source should always contain at least one text field. To find out how Topic Models tokenize and analyze the text in your data, please read our previous post.

2. Create a Dataset

After the source is created, you need to build a dataset, which computes basic statistics for your fields and gets them ready for the Machine Learning algorithm to take over.

curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"source":"source/68b5627b3c1920186f123478900"}'

3. Create a Topic Model

When your dataset has been created, you need its ID to create your Topic Model. Once again, although you may have many different field types in your dataset, the Topic Model will only use the text fields.

curl "https://bigml.io/topicmodel?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/98b5527c3c1920386a000467"}'

If you don’t want to use all the text fields in your dataset you can use either the argument input_fields (to indicate which fields you want to use as inputs) or the argument excluded_fields (to indicate the fields that you don’t want to use). In this case, since we don’t want to use the text field that contains the reviewer name, we define as our input data just the field containing the text of the reviews.

curl "https://bigml.io/topicmodel?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/98b5527c3c1920386a000467", 
            "input_fields":"comments"}'

Apart from the dataset and the input fields, you can also include additional arguments like the parameters that we explained in our previous post to configure your Topic Model.

4. Make Predictions

The main goal of creating a Topic Model is to find the topics in your dataset instances. Predictions for Topic Models are called Topic Distributions in BigML since they return a set of probabilities (one per topic) for a given instance. The sum of all topic probabilities for a given instance is always 100%.

BigML also allows you to perform a Topic Distribution for one single instance or for several instances simultaneously.

Topic Distribution

To get the topic probability distributions for a single instance, you just need the ID of the Topic Model and the values for the input fields used to create the Topic Model. In most cases, it may be just a text fragment.

curl "https://bigml.io/topicdistribution?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"topicmodel":"topicmodel/58231122983efc15d400002a",
            "input_data":{
            "000005": "Lovely hosts, very accommodating - I was unable to 
            meet at the original check-in time so were flexible and let me 
            come an hour earlier. Clean, tidy, very cute room! Perfect - 
            thanks very much!"
            }
           }'

Batch Topic Distribution

To get the topic probability distributions for multiple instances, you need the ID of the Topic Model and the ID of the dataset containing the values for the instances you want to predict.

curl "https://bigml.io/batchtopicdistribution?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"topicmodel":"topicmodel/58231122983efc15d400002a",
            "dataset":"dataset/98b5527c3c1920386a000467"}'

When the Batch Topic Distribution has been performed, you can download it as a CSV file simply by appending “download” to the Batch Topic Distribution URL.

If you want to use the topics as inputs to build another model (as we explain in the third post of our series), we recommend that you create a dataset from the Batch Topic Distribution. You can easily do so by using the argument output_dataset at the time of the Batch Topic Distribution creation as indicated in the snippet below.

curl "https://bigml.io/batchtopicdistribution?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"topicmodel":"topicmodel/58231122983efc15d400002a",
            "dataset":"dataset/98b5527c3c1920386a000467", 
            "output_dataset": true}'

In the next post, we will explain how to use Topic Models with WhizzML.

Would you like to know more about Topic Models? Visit the Topic Models release page and check our documentation to create Topic Models, interpret them, and predict with them through the BigML Dashboard and API, as well as the six blog posts of this series.

Predicting Movie Review Sentiment with Topic Models

In this blog post, the third one of our Topic Models series, we are showcasing how you can use BigML Topic Models to improve your model performance. We are using movie reviews extracted from the IMBD database to predict if a given review has a positive or a negative sentiment. Notice that in this post we will not dive into all the configuration options that BigML offers for Topic Models, for that we recommend that you read our previous post.

amelie.png

The Data

The dataset contains 50,000 highly polarized movie reviews labeled with their sentiment class: positive or negative. This dataset was built by Stanford researchers for their paper from 2011, which achieves an accuracy of 88.89%. The reviews belong to many different movies (a movie can only have up to 30 reviews) to avoid situations where the algorithm may learn non-generalizable patterns for specific plots, directors, actors or characters. The dataset is first split in two: 50% for training and the other 50% for testing.

As in any other dataset, the IMBD dataset, needs a bit of pre-processing so it can be used to properly train a Machine Learning model. Fortunately, BigML handles the major tasks of text pre-processing at the time of the Topic Model creation so you don’t need to care about text tokenization, stemming, lower and upper cases or stop words. When you create the Topic Model, the text is automatically tokenized so each single word is considered a token by the model vocabulary. Non-word tokens such as symbols and punctuation marks are automatically removed. One interesting task for further study would be to replace symbols that may indicate sentiment to test how it improves the model e.g., “:-)” becomes “SMILE”.  BigML also removes stop words by default, however, in this case negative stop words may indicate sentiment so we can opt to leave them. Finally, stemming is applied over all the words so terms with the same lexeme are considered the same unique term (e.g., “loving” and “love”).

The only data cleansing task that we need to perform beforehand for this dataset is removing the HTML tag “<br>” which is seen frequently in the review content as seen below. You can do it at the time of the model creation using the configuration option “Excluded terms” explained below.

review2

Single Decision Tree

To get a first sense of whether the text in the reviews have any power to predict the sentiment, we are going to build a single decision tree by selecting the “sentiment” as the objective field and using the reviews as input.

We assume in this post that you already know the typical process to build any model in BigML, i.e., how to upload the data to BigML and how to create a dataset. If you are not familiar then take a peak here.

When you are done creating your dataset, you can see that the movie reviews dataset is composed of two fields: sentiment (positive or negative) and reviews (the review text). Not surprisingly, the words “movie” and “film” are the most frequent ones in the collection (see image below).

untitled-2

If we perform a 1-click model, we obtain the decision tree in a few seconds. As you mouse over the root node and go down the tree until you reach the leaf nodes, you can see the different prediction paths. For example, in the image below you can see that if the review contains the terms “bad” and “worst” then it is a “negative” review with 92.99% of confidence. As expected, we can find words in the nodes such as “awful”, “boring”, “wonderful”, “amazing”, which are the terms that best split the data to best predict sentiment.

prediction-path

Confidence values seem pretty high for most prediction paths in this tree, but to measure its predictive power we need  to evaluate it by using data it has not seen before. We use the previously set aside 50% of the dataset, which contains the remaining 25,000 movie reviews that have not been used to train our model.

confusion.png

The evaluation yields an overall accuracy of 75.52%, which is not that bad for a 1-click model. However, we can definitely improve on this!

Discovering the Topics in the Reviews

We already saw that single terms are not very bad predictors of movie review sentiment. But having a particular term in a given review may not always be applicable, so the model gets quite complex as it uses lots of possible term combinations in order to get to the best prediction. This is because each person has his/her own way of expressing ideas with different vocabulary although the reviews may transmit very similar concepts. What if we could group those terms thematically related so that the model wouldn’t have to always look for single terms but instead groups of terms? You can now implement that approach in BigML by using Topic Models!

In the interest of time, we will neither go over the step by step process to create a Topic Model in BigML Dashboard nor cover all Topic Model configuration options here. We will just explain the relevant options that suffice in solving our problem of predicting the sentiment.

As we mentioned at the beginning of this post, the only data cleansing needed is the removal of the HTML tag “<br>”, so instead of using the 1-click Topic Model option we need to configure the model first. From the dataset, we select the “Configure Topic Model” option.

configure.png

When the configuration panel has been displayed, use the “Excluded terms” option to type in the term “<br>” and click in the “Create Topic Model” green button below.

exclude terms.png

When our Topic Model is created, we can filter and inspect it by using the two visualizations provided by BigML.

In the first view, you will be able to see at-a-glance all the topics represented by circles sized in proportion to the topic importances. Furthermore, the topics are displayed in a map layout, which depicts the relationship between them such that closer topics are more thematically related.

topicsmodel.png

However, probably the best way to get an overview of your topics is to take a look at the second visualization: the bar chart. In this chart, you can view the top terms (up to 15) by topic. Each term is represented by a bar whose length signifies the importance of the term within that topic (i.e., the term probability). As you can see in the image below, you can use the top menu filters to better visualize the most important terms by topic.

The Topic Model finds the terms that are more likely to appear together and groups them into different topics. This probabilistic method yields quite accurate groupings of terms that are highly thematically related. If we inspect the model topic by topic, we can easily see that Topics 35 and 38 reveal terms like “bad”, “poor”, “boring”, “stupid”, “waste”, etc., all of which are clearly related to negative reviews. On the other hand, we can see positive sentiment topics like Topic 20 that includes words like “love”, “beautiful”, “perfect”, “excellent”, “recommend”, etc. Then there are other more neutral topics that indicate the genre of the film or its target audience e.g., Topic 21 containing “kids”, “animation”, “cartoons”, etc. We expect such topics that are not apparently correlated with any kind of sentiment to be identified as less important in our predictive model.

Including Topics in our Model

Now that we have seen that the topics discovered in our dataset seem to follow a general logic that may be useful to predict sentiment, we can include them as predictors. You can easily achieve this by performing a Batch Topic Distribution. Simply click in the corresponding option found within the 1-click menu and select the dataset used to train the Topic Model.

batchtm.png

The Batch Topic Distribution calculates the probability of each topic for a given instance, so we will obtain a distribution of probabilities, one per topic, for all our dataset instances.

When the Batch Topic Distribution is created, we can access the dataset that contains the topic distributions. As seen below, a new field per topic is created whereby its values represent the probabilities for each of the instances.

dataset3.png

Now we can recreate a model by using the topics as input fields. Amazingly, the resulting tree is very different from the one created before. Now the most important values to predict sentiment are not the single terms anymore, but the topics!

modelwithtopics.png

If we click in the Summary Report to see the field importances for our model, we can see that topics are much more important than full review texts. Not only that, we can also see that topics containing terms that may transmit sentiment are more important than neutral ones related to film genres or plots.

importances.png

Time to evaluate our model!

We first need to perform a Batch Topic Distribution over the test dataset so it contains the same topic fields that the model uses to calculate predictions. We need to follow the same steps explained before, but this time we will select our test dataset instead of the training dataset. Once the Batch Topic Distribution has been performed, we can use the output dataset to evaluate our model.

confusionmatrix.png

As you can see, by including the topics in our model, we are able to bump up model accuracy to 80.04%, an improvement of 5% compared to the previous model without topics.  This may sound small, but in real terms it means an increase of 1,250 reviews that are now correctly classified. Add to that the fact that the increase in performance has been realized without any fancy model configuration or complex feature engineering.

Using Random Decision Forests

So far we used a single tree to be able to easily visualize the differences between the modeling approaches we took (i.e., with or without topics). But it is well known that ensembles of trees usually perform significantly better in great majority of cases. Thus, using the same dataset, we build a Random Decision Forest and evaluate how it does. As expected, a Random Decision Forest with 100 trees swiftly reaches 84.05% accuracy.

confusion3.png

Conclusions

We have seen how anyone working with text data can achieve significant performance enhancements by using topics as additional input fields in their models. With more time, one can still improve on our Random Decision Forests, but maybe that is best left for another day.

In the next post, we will cover how to programming Topic Models.

Would you like to know more about Topic Models? Visit the Topic Models release page and check our documentation to create Topic Models, interpret them, and predict with them through the BigML Dashboard and API, as well as the six blog posts of this series.

Discover and Analyze Relevant Topics in Any Text

BigML is bringing a new resource called Topic Models to help you discover the topics underlying a collection of documents. The main goal of Topic Modeling is finding significant thematically related terms (“topics”) in your unstructured text data. You can find an English example in the image below, which shows the topics found in a newspaper article about unemployment. BigML Topic Models can analyze any type of text in seven different languages: English, Spanish, Catalan, French, Portuguese, German and Dutch.

The resulting topics list can be used as a final output for information retrieval tasks, collaborative filtering, or for assessing document similarity among others. Topics can also be very useful as additional input features in your dataset towards other modeling tasks (e.g., classification, regression, clustering, anomaly detection).

In the first blog post of our series of six posts about Topic Models, we made a high-level introduction of Topic Modeling and the BigML implementation. In this post we will cover more in detail the fundamental steps required to find the topics hidden in your text fields by using the BigML Dashboard:

process.png

1. Upload your Data

Upload your data to your BigML account. You can drag and drop a local file, connect BigML to your cloud repository (e.g., S3 buckets) or copy and paste a URL containing your data. Once you have uploaded your data to BigML, the type for each field in your dataset will be automatically recognized by BigML.

Topic Models will only use the text fields in your dataset to find the relevant topics, so you need at least one text field. If you have a dataset with several types of fields, only the text fields will be considered as inputs to build the Topic Model. If multiple text fields are given as inputs they will be automatically concatenated so the content for each instance can be considered as a “bag of words” by the Topic Model.

source.png

BigML provides several configuration parameters for Text Analysis at the source level so you can decide the tokenization strategy e.g., whether you want to keep stop words. This configuration is used to create the vocabulary that will be used to build your models (see image below). However, Topic Models don’t take into account this configuration, you can configure most of these options at the time of the model creation from the Topic Model configuration panel explained in the 3rd step below. Tokenization and stemming are the only two options not configurable for Topic Models. Regarding tokenization, Topic Models will always tokenize your text fields by terms. On the other hand, stemming will be always active by default (e.g., the words “play”, “played”, “playing” will always be considered one unique term “play”).

configsoruce.png

2. Create the Dataset

From your Source view, use the 1-click Dataset option to create a dataset. This is a structured version of your data ready to be used by a Machine Learning algorithm.

For each of your text fields you will get a distribution histogram of your terms ordered by their frequency in the dataset. Next to the histogram, you will find an option to open the tag cloud of your text field as shown in the image below.

dataset.png

3. Create a Model

If you want to use the default parameter values, you can create a Topic Model from the dataset by using the 1-click Topic Model option. Alternatively you can tune the parameters by using the Configure Topic Model option.

1-click.png

BigML allows you to configure the following parameters:

  • Number of topics: the total number of topics to be discovered in your dataset. You can set this number manually or you can let the algorithm find the optimal number of topics according to the number of instances in your dataset. The maximum number of topics allowed is 64.
  • Number of top terms: the total number of top terms within each topic to be displayed in your Topic Model. By default it is set to 10, the maximum is 128 terms.
  • Term limit: the total number of unique terms to be considered for the Topic Model vocabulary. By default it is set to 4,096, the maximum is 16,384 unique terms.
  • Text analysis options:
    • Case sensitivity: whether the Topic Model should differentiate terms with lower and upper cases. If you activate this option, “Grace” and “grace” will be considered considered two different terms. By default Topic Models are case insensitive.
    • Bigrams: whether the Topic Model, apart from single terms, should include pairs of terms that typically go together (e.g.: “United States” or “mobile phone”). By default this option is not active.
    • Stop words: whether the Topic Model should remove stop words (i.e., words such as articles, prepositions, conjunctions, etc.). By default stop words are removed.
  • Excluded Terms: you can select specific terms to be excluded from the model vocabulary.
  • Sampling options: if you have a very large dataset, you may not need all the instances to create the model. BigML allows you to easily sample your dataset at the model creation time.

configuration.png

 

4. Analyze your Topic Model

BigML provides two different views so you can analyze the topics discovered in your text.

Topic Map

The topic map shows the topics in a map view, where you can get a sense of both the topic importances and the relationship between them. Each circle represents a different topic, and its size depicts the “Topic probability” — this is the average probability of the topic to appear in a given instance of the original dataset. The distance between the circles represents how close or far topics are thematically speaking.

By mousing over each topic you can see the top most important terms for that topic (see “Number of top terms” in the third step above). The terms are ordered by their importance within each topic as measured by the “Term probability”. All term probabilities for a given topic should sum up to 100%. A given term can be attributed to more than one topic, e.g., the word “bank” may be found in a topic related to finances, but also in a topic related to geology (river bank).

topicMap.png

If you mouse over each term within a topic, you will find all the stemmed forms for that term. Stemming is the process of taking just the lexeme for the terms so e.g. the terms “great”, “greatness” and “greater”, are considered the same term, since they all have the same lexeme: “great”.

stemming.png

The map has the following options found in the top menu:

  • Filtering options:
    • Topic probability slider to filter the topic circles by size (importance) i.e., the average probability of a topic in a given instance in the original dataset used to create the model.
    • Search box to filter topics and terms.
    • Include/exclude labels for topics.
    • Reset filters option.
  • Export options:
    • Export map in PNG format.
    • Export model in CSV format.
  • Edit topic names by clicking in a Topic circle, and you can change its name by using the edit icon. This is very useful to interpret the main theme of each topic, when you make predictions afterwards.
  • Tag cloud to see the terms composing a topic in a tag cloud.

filters.png

Term Chart

This view shows the topics and their terms in a bar chart to give you a quick overview of the terms composing the topics as well as their importances. In the horizontal axis, you can see each term probability for each of the top terms of a given topic (you can select up to 15 terms).

term chart.png

The chart has the following options found in the top menu:

  • Filtering options:
    • Term probability slider to filter terms by their probability in a given topic.
    • Search box to filter topics and terms.
    • Number of terms to select the maximum terms shown per topic in the chart.
    • Dynamic axis option to adjust the axis scale to the current filters.
    • Reset filters option.
  • Export options:
    • Export chart in PNG format (with or without legends).
    • Export model in CSV format.

filters2.png

5. Make Predictions

The main objective of creating a Topic Model is to find the relevant topics for your dataset instances. Topic Model predictions are called Topic Distributions in BigML. For each instance you will get a set of probabilities (one per topic) indicating their relevance for that instance. For any given instance, topic probabilities should sum up to 100%.

BigML also allows you to make predictions for a single instance, Topic Distribution, or for several instances simultaneously, Batch Topic Distribution.

Topic Distribution

To get one single prediction for an input text, click the Topic Distribution option as shown in the image below.

TopicDistr.png

You will get a form containing the input fields used to create the Topic Model. Include any text in the corresponding input field boxes. BigML automatically computes each topic probability for that input text. You can see topic probabilities in a histogram like the one shown in the image below. You can also see the terms within each topic by mousing over each topic.

topicdistresult

Batch Topic Distribution

If you want to make predictions for several instances simultaneously, click the Batch Topic Distribution option as shown in the image below.

batchtopicdist

Then you need to select a dataset containing the instances for which you want to make the predictions. It can be the same dataset used to create the Topic Model or a different dataset.

select-dataset

When your batch prediction finishes, you will be able to download the CSV file and see the output dataset.

batchtopicdist2

In the next post, we will cover a real Topic Modeling use case to uncover the hidden topics of movie reviews from the IMBD database to predict the sentiment behind the reviews.

Would you like to know more about Topic Models? Visit the Topic Models release page and check our documentation to create Topic Models, interpret them, and predict with them through the BigML Dashboard and API, as well as the six blog posts of this series.

Introduction to Topic Models

At BigML we’re fond of celebrating every season by launching brand new Machine Learning resources, and our Fall 2016 creation will be headlined by our Latent Dirichlet Allocation (LDA) implementation. After months of meticulous work, BigML’s Development Team is making the LDA primitive available in the Dashboard and API simultaneously under the name of Topic Models. With Topic Models, words in your text data that often occur together are grouped into different “topics”. With the model, you can assign a given instance a score for each topic, which indicates the relevance of that topic to the instance. You can then use these topic scores as input features to train other models, as a starting point for collaborative filtering, or for assessing document similarity, among many other uses.

This post gives you a general overview of how LDA has been integrated into our platform. This will be the first of a series of six posts about Topic Models that will provide you with a gentle introduction to the new resource. First, we’ll get started with Topic Models through the BigML Dashboard. We’ll follow that up with posts on how to apply Topic Models in a real-life use case, how to create Topic Models and make predictions with the API, how to automate this process using WhizzML, and finally a deeper, slightly more technical explanation of what’s going on behind the scenes.

topic_board

Why implement the Latent Dirichlet Allocation algorithm?

There are plenty of valuable insights hidden in your text data. Plain text data can be very useful for content recommendation, information retrieval tasks, segmenting your data, or training predictive models. The standard “bag of words” analysis BigML performs when it creates your dataset is often useful, but sometimes it doesn’t go far enough as there may be hidden patterns in text data that are difficult to discover when you’re only considering occurrences of a single word at a time. Often, the Latent Dirichlet Allocation algorithm is able to organize your text data in such a way that it causes some of this hidden information to spring to the fore.

There are three key vocabulary words we need to know when we’re trying to understand the basics of Topic Models: documents, terms, and topics. Latent Dirichlet Allocation (LDA) is an unsupervised learning method that discovers different topics underlying a collection of documents, where each document is a collection of words, or terms. LDA assumes that any document is a combination of one or more topics, and each topic is associated with certain high probability terms.

topic_graph

Here are some additional useful pointers on our Topic Models:

  • Topic Models work with text fields only. BigML is able to analyze any type of text in seven different languages: English, Spanish, Catalan, French, Portuguese, German and Dutch.
  • Each instance of your dataset will be considered a document (a collection of terms) and the text input field the content of that document.
  • A term is a single lexical token (usually one or more words, but can be any arbitrary string).
  • A topic is a distribution over terms. Each term has a different probability within a topic: the higher the probability, the more relevant is a term for that topic.
  • Several topics may have a high probability associated with the same term. For example, the word “house” may be found in a topic related to properties but also in a topic related to holiday accommodation.
  • Each document will have a probability associated with each topic, according to the terms in the document.

Professor David Blei, the inventor of LDA, gives a very nice tutorial on it here:

To use Topic Models, you can specify a number of topics to discover, or let BigML pick a reasonable number of topics according to the amount of training data that you have. Another parameter allows you to also use consecutive pairs of words (bigrams) in addition to single words when fitting the topics. You may also specify case sensitivity. By default, Topic Models automatically discard stop words and high frequency words that occur in almost all of the documents as they typically do not help determine the boundaries between topics. You can also make predictions with your Topic Models both for individual instances and in batch in which case BigML will assign topic probabilities to each instance you provide; the higher the probability, the greater the association between that topic and the given instance.

For instance, imagine that you have a telecommunications company and you want to predict customer churn at the end of the month. For that, we will use all the information available in the customer service data collected when they call or send emails asking for help. Thanks to Topic Models you can automatically organize your data in a way that let’s you define exactly what a client’s correspondence was “about”. Your Topic Model will then return a list of top terms for each topic found in the data. By analyzing your text data you can also use these topics as input features in order to better cluster your customer correspondences into distinct groups you can devise actionable relationship management strategies for. The image below reveals three potential topics that might be extracted from a dataset of such correspondence. One may easily name the first topic “complaints”, the second “technical issues”, and the third “pricing concerns”.

Topic_1 Topic_2 Topic_3
Mistrust Issue Tax
Tired Antenna Cost
Terrible Technical Free
Doubt Power Dollars
Complaint Break Expensive
Trouble Device Bill

Want to know all about Topic Models?

Visit the Topic Models release page and check our documentation to create Topic Models, interpret them, and predict with them through the BigML Dashboard and API, as well as the six blog posts of this series.

BigML Fall 2016 Release and Webinar: Topic Models and More!

BigML’s Fall 2016 Release is here! Join us on Tuesday, November 29, at 10:00 AM PST (Portland, Oregon / GMT -08:00) / 07:00 PM CET (Valencia, Spain / GMT +01:00) for a FREE live webinar to get a first look at the latest version of BigML! We’ll be focusing on Topic Models, the latest resource that helps you find thematically related terms in your unstructured text data.

tm_topicmap

The BigML Team has been working hard to implement the underlying Latent Dirichlet Allocation (LDA) technique into the BigML Dashboard and API. With Topic Models you can use your identified topics as final output for information retrieval tasks, collaborative filtering, or for assessing document similarity, among other use cases. You can also use those topics as input features to train other models, such as Classification or Regression models, Cluster Analysis, Anomaly Detection, or Association Discovery.

tm_termchart

Topic Models come with two visualizations to better analyze the topics discovered in your text data: a Topic Map and a Term Chart. The Topic Map presents a map view for you to both see the topic importances at a glance and the relationship between them, whereas the Term Chart displays the topics and their terms in a bar chart view to give you a quick overview of the terms composing the topics as well as their importances. While BigML’s Topic Models only use the text fields in your dataset, they can analyze any type of text in seven different languages: English, Spanish, Catalan, French, Portuguese, German and Dutch.

tm_predictions

BigML offers Topic Distributions to make predictions for a given single instance, and Batch Topic Distributions to predict several instances simultaneously. In the related prediction, BigML provides a set of probabilities for each instance (one probability per topic), which indicates the relevance of a certain topic for that instance.

certifications

BigML has been democratizing Machine Learning since 2011, and today marks an important milestone to more systematically do so: we are happy to announce the BigML Certifications for partners that want to master BigML to successfully deliver real-life Machine Learning projects on behalf of their customers. Regardless if you are a software developer, systems integrator, analyst, or scientist our certifications programs pay for themselves by taking your skill set and ability to deliver data-driven solutions to a whole new level. We invite all of you to register with one of our upcoming BigML Engineer and BigML Architect certification waves.

Would you like to know more about Topic Models? Join us on Tuesday, November 29, at 10:00 AM PST (Portland, Oregon / GMT -08:00) / 07:00 PM CET (Valencia, Spain. GMT +01:00). Be sure to reserve your free spot today as space is limited! Following our tradition, we will also be giving away BigML t-shirts to those who submit questions during the webinar. Don’t forget to request yours!

PreSeries goes to the Test Lab at WIRED2016 in London

Innovative technology, leading edge startups and brilliant entrepreneurship were at display at London’s Tobacco Dock for the 2016 edition of Wired conference this week, on November 3 and 4. The inspiring venue and a great line up of speakers in this two-day event made for a truly special gathering along with the Test Lab, which was organized by Telefónica. PreSeries, the joint venture between Telefónica Open Future_ and BigML, has been invited to participate, and made the best of it by interactively demoing its capabilities thanks to our Alexa integration. Nothing better to capture the moment than a few pictures!

2016-11-03-11-04-22

In addition to the PreSeries booth, Telefónica presented an impressive array of seven more companies from its portfolio:

  • Saffe, a mobile payment app that leverages world-class facial recognition technology
  • Knomo, a company that creates fashionable accessories to get life organized
  • Pzizz, an app that helps people beat insomnia and get great sleep
  • Cru Kafe, a company that offers one of the best ethical and organic coffee labels in the world
  • Voicemod, a platform that lets you add real time audio modification to your app with a simple and free SDK
  • 52 MasterWorks, the first crowdfunding company investing in art
  • Pulsar, an audience intelligence platform that provides insights on customer behaviour from social media data

Over 700 people visited the Test Lab, which was even further expanded the evening of the first exhibition day, when 13 additional companies joined the event.

2016-11-03-14-20-47

The Test Lab and WIRED2016 event will close its doors today (Friday, November 4, at 06:45 PM GMT). We sincerely thank the Wired community and Telefónica for this great opportunity to showcase the value of PreSeries to fellow early stage startups and investors that are thirsty for quantifiable and objective feedback about their growth potential.

2016-11-03-20-09-21

%d bloggers like this: