Skip to content

Programming Topic Models

In this post, the fourth one of our Topic Model series, we will briefly demonstrate how you can create a Topic Model by using the BigML API. As mentioned in our introductory post, Topic Modeling is an unsupervised learning method to discover the different topics underlying a collection of documents. You can also read the detailed process to create Topic Models using the BigML Dashboard and a real use case to predict the sentiment of movie reviews in the second and third posts respectively.

The API workflow to create a Topic Model is composed of four steps:

Untitled.png

Any resource created with the API will automatically be created in your Dashboard too, so you can take advantage of BigML’s intuitive visualizations at any time. In case you never used the BigML API before, all requests to manage your resources must use HTTPS and be first authenticated by using your username and API key to verify your identity. For instance, here is a base URL example to manage Topic Models.

https://bigml.io/topicmodel?username=$BIGML_USERNAME;api_key=$BIGML_API_KEY

For more details, check the API documentation related to Topic Models here.

1. Upload your Data

Upload your data in your preferred format from a local file, a remote file (using a URL) or from your cloud repository e.g., AWS, Azure etc. This will automatically create a source in your BigML account.

To do this, you need to open up a terminal with curl or any other command-line tool that implements standard HTTPS methods. In the example below we are creating a source from a remote file containing almost 110,000 Airbnb reviews of accommodations in Portland, Oregon.

curl "https://bigml.io/source?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"remote":"http://data.insideairbnb.com/united-states/or/portland/2016-07-04/data/reviews.csv.gz"}'

Topic Models only accept text fields as inputs so your source should always contain at least one text field. To find out how Topic Models tokenize and analyze the text in your data, please read our previous post.

2. Create a Dataset

After the source is created, you need to build a dataset, which computes basic statistics for your fields and gets them ready for the Machine Learning algorithm to take over.

curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"source":"source/68b5627b3c1920186f123478900"}'

3. Create a Topic Model

When your dataset has been created, you need its ID to create your Topic Model. Once again, although you may have many different field types in your dataset, the Topic Model will only use the text fields.

curl "https://bigml.io/topicmodel?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/98b5527c3c1920386a000467"}'

If you don’t want to use all the text fields in your dataset you can use either the argument input_fields (to indicate which fields you want to use as inputs) or the argument excluded_fields (to indicate the fields that you don’t want to use). In this case, since we don’t want to use the text field that contains the reviewer name, we define as our input data just the field containing the text of the reviews.

curl "https://bigml.io/topicmodel?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/98b5527c3c1920386a000467", 
            "input_fields":"comments"}'

Apart from the dataset and the input fields, you can also include additional arguments like the parameters that we explained in our previous post to configure your Topic Model.

4. Make Predictions

The main goal of creating a Topic Model is to find the topics in your dataset instances. Predictions for Topic Models are called Topic Distributions in BigML since they return a set of probabilities (one per topic) for a given instance. The sum of all topic probabilities for a given instance is always 100%.

BigML also allows you to perform a Topic Distribution for one single instance or for several instances simultaneously.

Topic Distribution

To get the topic probability distributions for a single instance, you just need the ID of the Topic Model and the values for the input fields used to create the Topic Model. In most cases, it may be just a text fragment.

curl "https://bigml.io/topicdistribution?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"topicmodel":"topicmodel/58231122983efc15d400002a",
            "input_data":{
            "000005": "Lovely hosts, very accommodating - I was unable to 
            meet at the original check-in time so were flexible and let me 
            come an hour earlier. Clean, tidy, very cute room! Perfect - 
            thanks very much!"
            }
           }'

Batch Topic Distribution

To get the topic probability distributions for multiple instances, you need the ID of the Topic Model and the ID of the dataset containing the values for the instances you want to predict.

curl "https://bigml.io/batchtopicdistribution?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"topicmodel":"topicmodel/58231122983efc15d400002a",
            "dataset":"dataset/98b5527c3c1920386a000467"}'

When the Batch Topic Distribution has been performed, you can download it as a CSV file simply by appending “download” to the Batch Topic Distribution URL.

If you want to use the topics as inputs to build another model (as we explain in the third post of our series), we recommend that you create a dataset from the Batch Topic Distribution. You can easily do so by using the argument output_dataset at the time of the Batch Topic Distribution creation as indicated in the snippet below.

curl "https://bigml.io/batchtopicdistribution?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"topicmodel":"topicmodel/58231122983efc15d400002a",
            "dataset":"dataset/98b5527c3c1920386a000467", 
            "output_dataset": true}'

In the next post, we will explain how to use Topic Models with WhizzML.

Would you like to know more about Topic Models? Visit the Topic Models release page and check our documentation to create Topic Models, interpret them, and predict with them through the BigML Dashboard and API, as well as the six blog posts of this series.

Predicting Movie Review Sentiment with Topic Models

In this blog post, the third one of our Topic Models series, we are showcasing how you can use BigML Topic Models to improve your model performance. We are using movie reviews extracted from the IMBD database to predict if a given review has a positive or a negative sentiment. Notice that in this post we will not dive into all the configuration options that BigML offers for Topic Models, for that we recommend that you read our previous post.

amelie.png

The Data

The dataset contains 50,000 highly polarized movie reviews labeled with their sentiment class: positive or negative. This dataset was built by Stanford researchers for their paper from 2011, which achieves an accuracy of 88.89%. The reviews belong to many different movies (a movie can only have up to 30 reviews) to avoid situations where the algorithm may learn non-generalizable patterns for specific plots, directors, actors or characters. The dataset is first split in two: 50% for training and the other 50% for testing.

As in any other dataset, the IMBD dataset, needs a bit of pre-processing so it can be used to properly train a Machine Learning model. Fortunately, BigML handles the major tasks of text pre-processing at the time of the Topic Model creation so you don’t need to care about text tokenization, stemming, lower and upper cases or stop words. When you create the Topic Model, the text is automatically tokenized so each single word is considered a token by the model vocabulary. Non-word tokens such as symbols and punctuation marks are automatically removed. One interesting task for further study would be to replace symbols that may indicate sentiment to test how it improves the model e.g., “:-)” becomes “SMILE”.  BigML also removes stop words by default, however, in this case negative stop words may indicate sentiment so we can opt to leave them. Finally, stemming is applied over all the words so terms with the same lexeme are considered the same unique term (e.g., “loving” and “love”).

The only data cleansing task that we need to perform beforehand for this dataset is removing the HTML tag “<br>” which is seen frequently in the review content as seen below. You can do it at the time of the model creation using the configuration option “Excluded terms” explained below.

review2

Single Decision Tree

To get a first sense of whether the text in the reviews have any power to predict the sentiment, we are going to build a single decision tree by selecting the “sentiment” as the objective field and using the reviews as input.

We assume in this post that you already know the typical process to build any model in BigML, i.e., how to upload the data to BigML and how to create a dataset. If you are not familiar then take a peak here.

When you are done creating your dataset, you can see that the movie reviews dataset is composed of two fields: sentiment (positive or negative) and reviews (the review text). Not surprisingly, the words “movie” and “film” are the most frequent ones in the collection (see image below).

untitled-2

If we perform a 1-click model, we obtain the decision tree in a few seconds. As you mouse over the root node and go down the tree until you reach the leaf nodes, you can see the different prediction paths. For example, in the image below you can see that if the review contains the terms “bad” and “worst” then it is a “negative” review with 92.99% of confidence. As expected, we can find words in the nodes such as “awful”, “boring”, “wonderful”, “amazing”, which are the terms that best split the data to best predict sentiment.

prediction-path

Confidence values seem pretty high for most prediction paths in this tree, but to measure its predictive power we need  to evaluate it by using data it has not seen before. We use the previously set aside 50% of the dataset, which contains the remaining 25,000 movie reviews that have not been used to train our model.

confusion.png

The evaluation yields an overall accuracy of 75.52%, which is not that bad for a 1-click model. However, we can definitely improve on this!

Discovering the Topics in the Reviews

We already saw that single terms are not very bad predictors of movie review sentiment. But having a particular term in a given review may not always be applicable, so the model gets quite complex as it uses lots of possible term combinations in order to get to the best prediction. This is because each person has his/her own way of expressing ideas with different vocabulary although the reviews may transmit very similar concepts. What if we could group those terms thematically related so that the model wouldn’t have to always look for single terms but instead groups of terms? You can now implement that approach in BigML by using Topic Models!

In the interest of time, we will neither go over the step by step process to create a Topic Model in BigML Dashboard nor cover all Topic Model configuration options here. We will just explain the relevant options that suffice in solving our problem of predicting the sentiment.

As we mentioned at the beginning of this post, the only data cleansing needed is the removal of the HTML tag “<br>”, so instead of using the 1-click Topic Model option we need to configure the model first. From the dataset, we select the “Configure Topic Model” option.

configure.png

When the configuration panel has been displayed, use the “Excluded terms” option to type in the term “<br>” and click in the “Create Topic Model” green button below.

exclude terms.png

When our Topic Model is created, we can filter and inspect it by using the two visualizations provided by BigML.

In the first view, you will be able to see at-a-glance all the topics represented by circles sized in proportion to the topic importances. Furthermore, the topics are displayed in a map layout, which depicts the relationship between them such that closer topics are more thematically related.

topicsmodel.png

However, probably the best way to get an overview of your topics is to take a look at the second visualization: the bar chart. In this chart, you can view the top terms (up to 15) by topic. Each term is represented by a bar whose length signifies the importance of the term within that topic (i.e., the term probability). As you can see in the image below, you can use the top menu filters to better visualize the most important terms by topic.

The Topic Model finds the terms that are more likely to appear together and groups them into different topics. This probabilistic method yields quite accurate groupings of terms that are highly thematically related. If we inspect the model topic by topic, we can easily see that Topics 35 and 38 reveal terms like “bad”, “poor”, “boring”, “stupid”, “waste”, etc., all of which are clearly related to negative reviews. On the other hand, we can see positive sentiment topics like Topic 20 that includes words like “love”, “beautiful”, “perfect”, “excellent”, “recommend”, etc. Then there are other more neutral topics that indicate the genre of the film or its target audience e.g., Topic 21 containing “kids”, “animation”, “cartoons”, etc. We expect such topics that are not apparently correlated with any kind of sentiment to be identified as less important in our predictive model.

Including Topics in our Model

Now that we have seen that the topics discovered in our dataset seem to follow a general logic that may be useful to predict sentiment, we can include them as predictors. You can easily achieve this by performing a Batch Topic Distribution. Simply click in the corresponding option found within the 1-click menu and select the dataset used to train the Topic Model.

batchtm.png

The Batch Topic Distribution calculates the probability of each topic for a given instance, so we will obtain a distribution of probabilities, one per topic, for all our dataset instances.

When the Batch Topic Distribution is created, we can access the dataset that contains the topic distributions. As seen below, a new field per topic is created whereby its values represent the probabilities for each of the instances.

dataset3.png

Now we can recreate a model by using the topics as input fields. Amazingly, the resulting tree is very different from the one created before. Now the most important values to predict sentiment are not the single terms anymore, but the topics!

modelwithtopics.png

If we click in the Summary Report to see the field importances for our model, we can see that topics are much more important than full review texts. Not only that, we can also see that topics containing terms that may transmit sentiment are more important than neutral ones related to film genres or plots.

importances.png

Time to evaluate our model!

We first need to perform a Batch Topic Distribution over the test dataset so it contains the same topic fields that the model uses to calculate predictions. We need to follow the same steps explained before, but this time we will select our test dataset instead of the training dataset. Once the Batch Topic Distribution has been performed, we can use the output dataset to evaluate our model.

confusionmatrix.png

As you can see, by including the topics in our model, we are able to bump up model accuracy to 80.04%, an improvement of 5% compared to the previous model without topics.  This may sound small, but in real terms it means an increase of 1,250 reviews that are now correctly classified. Add to that the fact that the increase in performance has been realized without any fancy model configuration or complex feature engineering.

Using Random Decision Forests

So far we used a single tree to be able to easily visualize the differences between the modeling approaches we took (i.e., with or without topics). But it is well known that ensembles of trees usually perform significantly better in great majority of cases. Thus, using the same dataset, we build a Random Decision Forest and evaluate how it does. As expected, a Random Decision Forest with 100 trees swiftly reaches 84.05% accuracy.

confusion3.png

Conclusions

We have seen how anyone working with text data can achieve significant performance enhancements by using topics as additional input fields in their models. With more time, one can still improve on our Random Decision Forests, but maybe that is best left for another day.

In the next post, we will cover how to programming Topic Models.

Would you like to know more about Topic Models? Visit the Topic Models release page and check our documentation to create Topic Models, interpret them, and predict with them through the BigML Dashboard and API, as well as the six blog posts of this series.

Discover and Analyze Relevant Topics in Any Text

BigML is bringing a new resource called Topic Models to help you discover the topics underlying a collection of documents. The main goal of Topic Modeling is finding significant thematically related terms (“topics”) in your unstructured text data. You can find an English example in the image below, which shows the topics found in a newspaper article about unemployment. BigML Topic Models can analyze any type of text in seven different languages: English, Spanish, Catalan, French, Portuguese, German and Dutch.

The resulting topics list can be used as a final output for information retrieval tasks, collaborative filtering, or for assessing document similarity among others. Topics can also be very useful as additional input features in your dataset towards other modeling tasks (e.g., classification, regression, clustering, anomaly detection).

In the first blog post of our series of six posts about Topic Models, we made a high-level introduction of Topic Modeling and the BigML implementation. In this post we will cover more in detail the fundamental steps required to find the topics hidden in your text fields by using the BigML Dashboard:

process.png

1. Upload your Data

Upload your data to your BigML account. You can drag and drop a local file, connect BigML to your cloud repository (e.g., S3 buckets) or copy and paste a URL containing your data. Once you have uploaded your data to BigML, the type for each field in your dataset will be automatically recognized by BigML.

Topic Models will only use the text fields in your dataset to find the relevant topics, so you need at least one text field. If you have a dataset with several types of fields, only the text fields will be considered as inputs to build the Topic Model. If multiple text fields are given as inputs they will be automatically concatenated so the content for each instance can be considered as a “bag of words” by the Topic Model.

source.png

BigML provides several configuration parameters for Text Analysis at the source level so you can decide the tokenization strategy e.g., whether you want to keep stop words. This configuration is used to create the vocabulary that will be used to build your models (see image below). However, Topic Models don’t take into account this configuration, you can configure most of these options at the time of the model creation from the Topic Model configuration panel explained in the 3rd step below. Tokenization and stemming are the only two options not configurable for Topic Models. Regarding tokenization, Topic Models will always tokenize your text fields by terms. On the other hand, stemming will be always active by default (e.g., the words “play”, “played”, “playing” will always be considered one unique term “play”).

configsoruce.png

2. Create the Dataset

From your Source view, use the 1-click Dataset option to create a dataset. This is a structured version of your data ready to be used by a Machine Learning algorithm.

For each of your text fields you will get a distribution histogram of your terms ordered by their frequency in the dataset. Next to the histogram, you will find an option to open the tag cloud of your text field as shown in the image below.

dataset.png

3. Create a Model

If you want to use the default parameter values, you can create a Topic Model from the dataset by using the 1-click Topic Model option. Alternatively you can tune the parameters by using the Configure Topic Model option.

1-click.png

BigML allows you to configure the following parameters:

  • Number of topics: the total number of topics to be discovered in your dataset. You can set this number manually or you can let the algorithm find the optimal number of topics according to the number of instances in your dataset. The maximum number of topics allowed is 64.
  • Number of top terms: the total number of top terms within each topic to be displayed in your Topic Model. By default it is set to 10, the maximum is 128 terms.
  • Term limit: the total number of unique terms to be considered for the Topic Model vocabulary. By default it is set to 4,096, the maximum is 16,384 unique terms.
  • Text analysis options:
    • Case sensitivity: whether the Topic Model should differentiate terms with lower and upper cases. If you activate this option, “Grace” and “grace” will be considered considered two different terms. By default Topic Models are case insensitive.
    • Bigrams: whether the Topic Model, apart from single terms, should include pairs of terms that typically go together (e.g.: “United States” or “mobile phone”). By default this option is not active.
    • Stop words: whether the Topic Model should remove stop words (i.e., words such as articles, prepositions, conjunctions, etc.). By default stop words are removed.
  • Excluded Terms: you can select specific terms to be excluded from the model vocabulary.
  • Sampling options: if you have a very large dataset, you may not need all the instances to create the model. BigML allows you to easily sample your dataset at the model creation time.

configuration.png

 

4. Analyze your Topic Model

BigML provides two different views so you can analyze the topics discovered in your text.

Topic Map

The topic map shows the topics in a map view, where you can get a sense of both the topic importances and the relationship between them. Each circle represents a different topic, and its size depicts the “Topic probability” — this is the average probability of the topic to appear in a given instance of the original dataset. The distance between the circles represents how close or far topics are thematically speaking.

By mousing over each topic you can see the top most important terms for that topic (see “Number of top terms” in the third step above). The terms are ordered by their importance within each topic as measured by the “Term probability”. All term probabilities for a given topic should sum up to 100%. A given term can be attributed to more than one topic, e.g., the word “bank” may be found in a topic related to finances, but also in a topic related to geology (river bank).

topicMap.png

If you mouse over each term within a topic, you will find all the stemmed forms for that term. Stemming is the process of taking just the lexeme for the terms so e.g. the terms “great”, “greatness” and “greater”, are considered the same term, since they all have the same lexeme: “great”.

stemming.png

The map has the following options found in the top menu:

  • Filtering options:
    • Topic probability slider to filter the topic circles by size (importance) i.e., the average probability of a topic in a given instance in the original dataset used to create the model.
    • Search box to filter topics and terms.
    • Include/exclude labels for topics.
    • Reset filters option.
  • Export options:
    • Export map in PNG format.
    • Export model in CSV format.
  • Edit topic names by clicking in a Topic circle, and you can change its name by using the edit icon. This is very useful to interpret the main theme of each topic, when you make predictions afterwards.
  • Tag cloud to see the terms composing a topic in a tag cloud.

filters.png

Term Chart

This view shows the topics and their terms in a bar chart to give you a quick overview of the terms composing the topics as well as their importances. In the horizontal axis, you can see each term probability for each of the top terms of a given topic (you can select up to 15 terms).

term chart.png

The chart has the following options found in the top menu:

  • Filtering options:
    • Term probability slider to filter terms by their probability in a given topic.
    • Search box to filter topics and terms.
    • Number of terms to select the maximum terms shown per topic in the chart.
    • Dynamic axis option to adjust the axis scale to the current filters.
    • Reset filters option.
  • Export options:
    • Export chart in PNG format (with or without legends).
    • Export model in CSV format.

filters2.png

5. Make Predictions

The main objective of creating a Topic Model is to find the relevant topics for your dataset instances. Topic Model predictions are called Topic Distributions in BigML. For each instance you will get a set of probabilities (one per topic) indicating their relevance for that instance. For any given instance, topic probabilities should sum up to 100%.

BigML also allows you to make predictions for a single instance, Topic Distribution, or for several instances simultaneously, Batch Topic Distribution.

Topic Distribution

To get one single prediction for an input text, click the Topic Distribution option as shown in the image below.

TopicDistr.png

You will get a form containing the input fields used to create the Topic Model. Include any text in the corresponding input field boxes. BigML automatically computes each topic probability for that input text. You can see topic probabilities in a histogram like the one shown in the image below. You can also see the terms within each topic by mousing over each topic.

topicdistresult

Batch Topic Distribution

If you want to make predictions for several instances simultaneously, click the Batch Topic Distribution option as shown in the image below.

batchtopicdist

Then you need to select a dataset containing the instances for which you want to make the predictions. It can be the same dataset used to create the Topic Model or a different dataset.

select-dataset

When your batch prediction finishes, you will be able to download the CSV file and see the output dataset.

batchtopicdist2

In the next post, we will cover a real Topic Modeling use case to uncover the hidden topics of movie reviews from the IMBD database to predict the sentiment behind the reviews.

Would you like to know more about Topic Models? Visit the Topic Models release page and check our documentation to create Topic Models, interpret them, and predict with them through the BigML Dashboard and API, as well as the six blog posts of this series.

Introduction to Topic Models

At BigML we’re fond of celebrating every season by launching brand new Machine Learning resources, and our Fall 2016 creation will be headlined by our Latent Dirichlet Allocation (LDA) implementation. After months of meticulous work, BigML’s Development Team is making the LDA primitive available in the Dashboard and API simultaneously under the name of Topic Models. With Topic Models, words in your text data that often occur together are grouped into different “topics”. With the model, you can assign a given instance a score for each topic, which indicates the relevance of that topic to the instance. You can then use these topic scores as input features to train other models, as a starting point for collaborative filtering, or for assessing document similarity, among many other uses.

This post gives you a general overview of how LDA has been integrated into our platform. This will be the first of a series of six posts about Topic Models that will provide you with a gentle introduction to the new resource. First, we’ll get started with Topic Models through the BigML Dashboard. We’ll follow that up with posts on how to apply Topic Models in a real-life use case, how to create Topic Models and make predictions with the API, how to automate this process using WhizzML, and finally a deeper, slightly more technical explanation of what’s going on behind the scenes.

topic_board

Why implement the Latent Dirichlet Allocation algorithm?

There are plenty of valuable insights hidden in your text data. Plain text data can be very useful for content recommendation, information retrieval tasks, segmenting your data, or training predictive models. The standard “bag of words” analysis BigML performs when it creates your dataset is often useful, but sometimes it doesn’t go far enough as there may be hidden patterns in text data that are difficult to discover when you’re only considering occurrences of a single word at a time. Often, the Latent Dirichlet Allocation algorithm is able to organize your text data in such a way that it causes some of this hidden information to spring to the fore.

There are three key vocabulary words we need to know when we’re trying to understand the basics of Topic Models: documents, terms, and topics. Latent Dirichlet Allocation (LDA) is an unsupervised learning method that discovers different topics underlying a collection of documents, where each document is a collection of words, or terms. LDA assumes that any document is a combination of one or more topics, and each topic is associated with certain high probability terms.

topic_graph

Here are some additional useful pointers on our Topic Models:

  • Topic Models work with text fields only. BigML is able to analyze any type of text in seven different languages: English, Spanish, Catalan, French, Portuguese, German and Dutch.
  • Each instance of your dataset will be considered a document (a collection of terms) and the text input field the content of that document.
  • A term is a single lexical token (usually one or more words, but can be any arbitrary string).
  • A topic is a distribution over terms. Each term has a different probability within a topic: the higher the probability, the more relevant is a term for that topic.
  • Several topics may have a high probability associated with the same term. For example, the word “house” may be found in a topic related to properties but also in a topic related to holiday accommodation.
  • Each document will have a probability associated with each topic, according to the terms in the document.

Professor David Blei, the inventor of LDA, gives a very nice tutorial on it here:

To use Topic Models, you can specify a number of topics to discover, or let BigML pick a reasonable number of topics according to the amount of training data that you have. Another parameter allows you to also use consecutive pairs of words (bigrams) in addition to single words when fitting the topics. You may also specify case sensitivity. By default, Topic Models automatically discard stop words and high frequency words that occur in almost all of the documents as they typically do not help determine the boundaries between topics. You can also make predictions with your Topic Models both for individual instances and in batch in which case BigML will assign topic probabilities to each instance you provide; the higher the probability, the greater the association between that topic and the given instance.

For instance, imagine that you have a telecommunications company and you want to predict customer churn at the end of the month. For that, we will use all the information available in the customer service data collected when they call or send emails asking for help. Thanks to Topic Models you can automatically organize your data in a way that let’s you define exactly what a client’s correspondence was “about”. Your Topic Model will then return a list of top terms for each topic found in the data. By analyzing your text data you can also use these topics as input features in order to better cluster your customer correspondences into distinct groups you can devise actionable relationship management strategies for. The image below reveals three potential topics that might be extracted from a dataset of such correspondence. One may easily name the first topic “complaints”, the second “technical issues”, and the third “pricing concerns”.

Topic_1 Topic_2 Topic_3
Mistrust Issue Tax
Tired Antenna Cost
Terrible Technical Free
Doubt Power Dollars
Complaint Break Expensive
Trouble Device Bill

Want to know all about Topic Models?

Visit the Topic Models release page and check our documentation to create Topic Models, interpret them, and predict with them through the BigML Dashboard and API, as well as the six blog posts of this series.

BigML Fall 2016 Release and Webinar: Topic Models and More!

BigML’s Fall 2016 Release is here! Join us on Tuesday, November 29, at 10:00 AM PST (Portland, Oregon / GMT -08:00) / 07:00 PM CET (Valencia, Spain / GMT +01:00) for a FREE live webinar to get a first look at the latest version of BigML! We’ll be focusing on Topic Models, the latest resource that helps you find thematically related terms in your unstructured text data.

tm_topicmap

The BigML Team has been working hard to implement the underlying Latent Dirichlet Allocation (LDA) technique into the BigML Dashboard and API. With Topic Models you can use your identified topics as final output for information retrieval tasks, collaborative filtering, or for assessing document similarity, among other use cases. You can also use those topics as input features to train other models, such as Classification or Regression models, Cluster Analysis, Anomaly Detection, or Association Discovery.

tm_termchart

Topic Models come with two visualizations to better analyze the topics discovered in your text data: a Topic Map and a Term Chart. The Topic Map presents a map view for you to both see the topic importances at a glance and the relationship between them, whereas the Term Chart displays the topics and their terms in a bar chart view to give you a quick overview of the terms composing the topics as well as their importances. While BigML’s Topic Models only use the text fields in your dataset, they can analyze any type of text in seven different languages: English, Spanish, Catalan, French, Portuguese, German and Dutch.

tm_predictions

BigML offers Topic Distributions to make predictions for a given single instance, and Batch Topic Distributions to predict several instances simultaneously. In the related prediction, BigML provides a set of probabilities for each instance (one probability per topic), which indicates the relevance of a certain topic for that instance.

certifications

BigML has been democratizing Machine Learning since 2011, and today marks an important milestone to more systematically do so: we are happy to announce the BigML Certifications for partners that want to master BigML to successfully deliver real-life Machine Learning projects on behalf of their customers. Regardless if you are a software developer, systems integrator, analyst, or scientist our certifications programs pay for themselves by taking your skill set and ability to deliver data-driven solutions to a whole new level. We invite all of you to register with one of our upcoming BigML Engineer and BigML Architect certification waves.

Would you like to know more about Topic Models? Join us on Tuesday, November 29, at 10:00 AM PST (Portland, Oregon / GMT -08:00) / 07:00 PM CET (Valencia, Spain. GMT +01:00). Be sure to reserve your free spot today as space is limited! Following our tradition, we will also be giving away BigML t-shirts to those who submit questions during the webinar. Don’t forget to request yours!

PreSeries goes to the Test Lab at WIRED2016 in London

Innovative technology, leading edge startups and brilliant entrepreneurship were at display at London’s Tobacco Dock for the 2016 edition of Wired conference this week, on November 3 and 4. The inspiring venue and a great line up of speakers in this two-day event made for a truly special gathering along with the Test Lab, which was organized by Telefónica. PreSeries, the joint venture between Telefónica Open Future_ and BigML, has been invited to participate, and made the best of it by interactively demoing its capabilities thanks to our Alexa integration. Nothing better to capture the moment than a few pictures!

2016-11-03-11-04-22

In addition to the PreSeries booth, Telefónica presented an impressive array of seven more companies from its portfolio:

  • Saffe, a mobile payment app that leverages world-class facial recognition technology
  • Knomo, a company that creates fashionable accessories to get life organized
  • Pzizz, an app that helps people beat insomnia and get great sleep
  • Cru Kafe, a company that offers one of the best ethical and organic coffee labels in the world
  • Voicemod, a platform that lets you add real time audio modification to your app with a simple and free SDK
  • 52 MasterWorks, the first crowdfunding company investing in art
  • Pulsar, an audience intelligence platform that provides insights on customer behaviour from social media data

Over 700 people visited the Test Lab, which was even further expanded the evening of the first exhibition day, when 13 additional companies joined the event.

2016-11-03-14-20-47

The Test Lab and WIRED2016 event will close its doors today (Friday, November 4, at 06:45 PM GMT). We sincerely thank the Wired community and Telefónica for this great opportunity to showcase the value of PreSeries to fellow early stage startups and investors that are thirsty for quantifiable and objective feedback about their growth potential.

2016-11-03-20-09-21

A Few (More) Words of Advice

In part one of this blog post, I began trying to apply some of the wisdom of E. W. Dijkstra to the field of machine learning.  We’ll continue with that discussion here, and offer a few concluding thoughts.E. Dijkstra

Avoid involvement in projects so vague that their failure could remain invisible: such involvement tends to corrupt one’s scientific integrity.

As professionals trying to draw conclusions from data, we are scientists in a very important and concrete sense.  Thus, we ought to bear, at least to some degree, the mantle foisted upon us by that title. This means in part that we have a responsibility to be skeptical of our own successes.

A big part of this skepticism in Machine Learning comes from one’s choice of evaluation metric. A few years ago I wrote a paper about evaluation metrics for binary classification which showed empirically that you could often make one solution look better than another just by a clever choice of metric, even when all of your choices were well-accepted metrics.

Of course, as I mentioned above, the metric you choose for success should depend greatly on the context in which the dataset is to be deployed. What if you don’t? It’s usually not hard to make a project look somewhat successful even if you know the results aren’t robust enough to be deployed into a production environment. This is the definition of an invisible failure.

A great perspective on metrics choice comes from Dr. Kiri Wagstaff, who argues that Machine Learning scientists would be better served by measuring success not by accuracy or AUC, but by real world criteria like number of people better served and lives improved. The opposite of “invisible failure” is “conspicuous success” and this should be our goal. Unless we choose projects and metrics that allow for that goal, we’ll never achieve it.

Write as if your work is going to be studied by a thousand people.

I think Dijkstra meant this as an admonishment to maintain both pride in and quality of one’s work even in the absence of immediate notoriety, but I’m going to take this comment in a slightly different direction.

It’s easy to get caught up with the idea of supervised Machine Learning and process automation as the only goals of working with data. While those goals are very nice, sometimes the most powerful and transformative things that come out of the data are stories. I’ve interviewed for data engineering-type roles at many different businesses, including a Fortune 500 company. The department in which I interviewed was focused entirely on telling stories from the data: “The highest value users for the website have this behavior pattern. Here are some efforts to get more users to follow that pattern. How successful were those efforts? Why?” There might not be a predictive model anywhere in that list of steps, but those are stories that can only be told by data, and give information that might dramatically change efforts throughout the company.

On a lighter note, I was once working in a situation where we needed to test an implementation of topic modeling. To do this, we went behind the company to their recycling dumpster, pulled out a few thousand documents, did OCR on them, and fed them to our topic modeling software, our idea being that this was a great example of a messy and diverse document collection. Among the topics that were found in the documents was one containing the words “position”, “salary”, “full-time”, and “contract”. It turned out that a lot of the documents were job postings that had been printed out by employees who were “testing the waters” so to speak; an interesting comment on employee morale at that moment.

Many datasets have a story to tell, if only we will listen. Part of a data engineer’s job is to write that story, and if done right it might indeed be read by a thousand people.

Raise your standards as high as you can live with, avoid wasting your time on routine problems, and always try to work as closely as possible at the boundary of your abilities. Do this because it is the only way of discovering how that boundary should be moved forward.

Readers of this blog post are just as likely as anyone to fall victim to the classic maxim, “When all you have is a hammer, everything is a nail.” I remember a job interview where my interrogator appeared disinterested in talking further after I wasn’t able to solve a certain optimization using Lagrange multipliers. The mindset isn’t uncommon: “I have my toolbox.  It’s worked in the past, so everything else must be irrelevant.”

Even more broadly, Machine Learning practitioners tend to obsess over the details of model choice and parameter fine tuning, when the conventional wisdom in the field is that vastly more improvement comes from feature engineering than from fiddling with your model. Why the disconnect? Part of it is that feature engineering is usually comparatively boring and non-technical. Another part is that it’s hard, requiring the modeler to explore knowledge outside of their field and (heaven forbid) talk to people.

You should always be looking to make the effort that most improves your chances of success on your current or future projects. Often, that will require moving outside of your comfort zone or doing things that you normally wouldn’t. But the more often you do these things, the larger your toolbox gets, and the more likely success will follow.

This idea and many of the other ideas in these posts apply throughout life for everyone, but for those working on the frontier of a new field they are especially applicable. At this stage in the game, a win for one of us is really a win for all of us; every victory further legitimizes the field and expands the field for everyone. If we’re all careful about the work we choose, are creative in our approaches, and ensure that our successes are both numerous and meaningful in the real world, then the current promise surrounding Machine Learning is sure to become a reality.

A Few Words of Advice

Computer scientists are fortunate enough to live in a time where the founders of our field, our Newtons and Galileos, are still among the living. Sadly, many of these brilliant pioneers have begun to leave us as this century progresses, but even as that inevitably continues, we can still reflect on the products of their genius and vision. Earlier this year, Marvin Minsky passed away, but his musings on the nature of human intelligence are still as relevant today as (perhaps even more so than) they were 30 years ago.

Edsger DijkstraOne such luminary was the great Edsger W. Dijkstra who died in 2002. His list of accomplishments that we consider foundational to the field is spectacular both in its length and breadth. Luckily for us, he was also a prolific writer and much of his correspondence has made into a permanent archive available to the public.

One of my favorite pieces from that archive is EWD 1055A, sometimes referred to as his “Advice to a Young Scientist”.  Those of us who work with data for a living are just beginning to carve out our professional niche in the broader world, so it’s worth considering how Dijkstra’s advice might apply to us.  Here, in two parts, are my thoughts.  I’ll be reaching past his initial intent occasionally, but I hope he’d be proud of how well some of his arguments generalize.

Before embarking on an ambitious project, try to kill it.

Dijkstra was a big fan of knowing your limitations. He gave us another notorious quote when asked by a Ph.D. student what he should study for his dissertation topic. Dijkstra’s response was, “Do only what you can do”.

It’s still early days for machine learning. The bounds and guidelines about what is possible or likely are still unknown in a lot of places, and bigger projects that test more of those those limitations are more likely to fail. As a fledgling data engineer, especially in industry, it’s almost certainly the more prudent course to go for the “low hanging-fruit”; easy-to-find optimizations that have real world impact for your organization. This is the way to build trust among skeptical colleagues and also the way to figure out where those boundaries are, both for the field and for yourself.

As a personal example, I was once on a project where we worked with failure data from large machines with many components. The obvious and difficult problem was to use regression analysis to predict the time to failure for a given part.  I had some success with this, but nothing that ever made it to production. However, a simple clustering analysis that grouped machines by frequency of replacement for all parts had some lasting impact: This enabled the organization to “red flag” machines that fell into “high replacement” group where the users may have been misusing the machines, and bring these users in for training.

So if you are a new or newly-hired machine learning practioner, before you embark on that huge, transformative project, consider smaller, quicker, surer efforts first. If you’re trying to find the boundaries of what you can do, you might as well start with what you can do.

Don’t get enamored with the complexities you have learned to live with (be they of your own making or imported). The lurking suspicion that something could be simplified is the world’s richest source of rewarding challenges.

Remember, deploying machine learning in the real world is about simplification. If you have a machine learning model that automates some process, but acquiring the data, learning the model, and deployment take more time, money, and/or human effort than simply executing the process by hand, the model is useless.

This provides a useful way of looking for new machine learning projects. Is there any drudgery machine learning could automate away? Are there time-consuming practices that careful examination of data might be able to prove are unnecessary or counter-productive? A great example comes from Google itself, who famously turned its considerable data analysis expertise on its own interview practices and found that they were basically worthless.

BigML, too, was founded on this premise: Putting machine learning into practice doesn’t have to be a massive exercise in complexity. Machine learning itself can be made simpler, without extra software libraries or languages or even a line of code. Simplicity is a value near and dear to our hearts, and it should be to yours as well.

Never tackle a problem of which you can be pretty sure that (now or in the near future) it will be tackled by others who are, in relation to that problem, at least as competent and well-equipped as you are.

Machine-learned models are usually deployed for one of just a few reasons:

  1. Only a small number of humans can do it (medical diagnosis)
  2. Humans can do it, but computers are faster (optical character recognition)
  3. Humans can do it, but computer are better (automated vehicles)

This is crucial to keep in mind when evaluating your models. Suppose your model gets an F1-score of 0.98. Wow, congratulations! But if the human currently doing this job gets a 0.99, what good is the model? You model must always be evaluated in the context it is to be deployed.

This point is something of a corollary to the point about ambitious projects. It’s often fairly easy to make a machine learned model that does as well, or even 10% or 20% better than a human or a hand-programmed expert system. But getting to the 2x or 3x improvement that will make a measurable difference in your employer’s business, that can be difficult or impossible. Said another way, you’ll find that when people have to optimize, they’re usually not terrible at it, and beating them by a lot tends to be hard.

One of our machine learning experts here at BigML has some experience classifying credit card transactions into rough categories. His first thought when given the task was, of course, “machine learning!” However, he quickly found that a set of hand-coded rules that he could hack together in a few hours gave near-perfect accuracy, without the need to format data and learn a model and so on. A lot of times machine learning is the right thing. Sometimes it isn’t.

Your model is only as good as it is in context. And if that context includes an existing or easily implemented solution, you’d better evaluate your model against it. If you don’t, someone else with broader vision surely will.

PreSeries Receives Partnership Award at Telco Data Analytics Conference

The Telco Data Analytics conference series is the perfect occasion to witness leading industry players showcasing their innovations, success stories and strategies for the future of telecommunications. It is the place to go to uncover new trends, network with potential partners and stay abreast of opportunities and challenges facing the market. The latest edition of the European tour took place in Madrid on October 25 and 26. Major operators like Telefonica, Orange, Swisscom and other industry players such as Huawei, EMC, ip.access, Netscout, Solvatio and SAP were in attendance.

One of the highlights was the award ceremony, where BigML won the Partnership Award for its collaboration with Teléfonica Open Future_ in creating PreSeries. Preseries’ mission is to take advantage of the latest innovations in Machine Learning to transform startup financing from its current subjective form into a highly objective data-driven practice.

p_20161025_173137

Amir Tabakovic (VP Business Development at BigML) holding the prize

Thanks to Telco Data Analytics for this award. We are looking forward to establish an even more fruitful partnerships in the future!

First Summer School in Machine Learning in São Paulo!

A bit of context

Machine Learning is making its presence felt on the worldwide stage as a major driver of digital business success. A good proof of that was our recently completed second edition of the Valencian Summer School in Machine Learning celebrated last September 2016 in Spain. Over 140 attendees representing 53 companies and 21 academic organizations from 19 countries travelled to Valencia for a crash course in Machine Learning and it was a great success!

vssml16-group

What are the next steps?

Encouraged by the level of interest and motivated by our mission to democratize Machine Learning, we continue spreading Machine Learning concepts with this series of courses. This time in São Paulo, Brazil, on December 8 and 9. BigML, in collaboration with VIVO and Telefónica Open Future_, will be holding the two-day hands-on summer school perfect for business leaders and advanced undergraduates, as well as graduate students and industry practitioners, who are interested in boosting their productivity by applying Machine Learning techniques.

brazilian-2

All lectures will take place at the VIVO Auditorium from 8:30 AM to 6:00 PM BRST during December 8 and 9. You will be guided through this Machine Learning journey starting with basic concepts and techniques, and proceeding to more advanced topics that you need to know to become the master of your data with BigML. Check out the program here!

Special closure

To complete this Summer School, at the closure of the event on Friday, December 9, we will showcase real-world companies based on the Machine Learning technology. We will also present the Artificial Intelligence Startup Battle, where the jury is a Machine Learning algorithm that predicts the probability of success of early stage startups, with no human intervention.

If you are a startup with applied Artificial Intelligence and Machine Learning as a core component of your offering, submit your application to compete in the AI Startup Battle! Read this blog post for more details.

Join us at the Brazilian Summer School in São Paulo!

The Brazilian Summer School in Machine Learning is FREE, but by invitation only. The application deadline is Friday, December 2, by 9:00 PM BRST. Applications will be processed on an as received basis, and invitations will be granted right after individual confirmations to allow for travel plans. Make sure that you register soon since space is limited!

logo_bssml16

bssml16_logos-new-constrain50

%d bloggers like this: