Skip to content

10 Offbeat Predictions for Machine Learning in 2017

As each year wraps up experts pull their crystal balls from their drawers and start peering into it for a glimpse of what’s to come in the next one. At BigML, We have been following such clairvoyance carefully this past holiday season to compare and contrast with our own take on what 2017 will have in store, which can come across as quite unorthodox to some experts out there.

Enterprise Machine Learning Predictions Nobody is Talking About

For the TL;DR crowd, our crystal ball is showing us a cloudy (no pun intended) 2017 Machine Learning market forecast with some sunshine behind the clouds for good measure. To put it more directly, enterprises need to look beyond the AI hype for practical ways to incorporate Machine Learning into their operations. This starts with the right choice of internal platform that will help them build on smaller, low hanging fruit type projects that leverage their proprietary datasets. In due time, those projects add up to create positive feedback effects that ultimately not only introduce decision automation on the edges, but help agile Machine Learning teams transform their industries.

Jumping back to our regularly scheduled programming, let’s start with a quick synopsis of the road traveled so far:

  • Machine Learning has already set on an irreversible path in becoming impactful (VERY impactful) on how we’ll do our jobs throughout many sectors and eventually touching the whole economy.

  • Machine Learning Use Cases by Industry

  • But digesting, adopting and profiting from 36 years of Machine Learning advances and best practices has been a very bumpy ride for many businesses few have managed to navigate so far.

  • There are many “New Experts” that read a couple of books or take a few online classes and are suddenly in a position to “alter” things just because they have access to cheap capital. While top technology companies have been “collecting” as much experienced Machine Learning talent as possible to get ready ready for the up and coming AI economy, other businesses are at the mercy of Machine Learning-newbie investors and inexperienced recent graduates with unicorn ambitions. It is wishfully assumed that versatile, affordable and scalable solutions based on a magical new algorithm will materialize out of these ventures.

  • In 2017, we suspect that the ecosystem is going to start converging around the right approach, albeit after otherwise avoidable roadkills.

Before we get to the specific predictions, we must note that 2016 was a special year in that it presented us with the watershed event such that the planet’s Top 5 most valuable companies are all technology companies for the first time in history. All five share the common traits of large scale network effects, highly data-centric company cultures and new economic value-added services built atop sophisticated analytics. Whats more they have been heavily publicizing their intent to make Machine Learning the fulcrum of their future evolution. With the addition of revenue generating unicorns like Uber and Airbnb the dominance of the tech sector is likely to continue in the coming years that will benefit immensely from the wholesale digitization of the World economy.

Changing of the Guard?

However, the trillion dollar question is how legacy companies (i.e., non-tech firms with rich data plus smaller technology companies) can counteract and become an integral part of the newly forming value chains to be able to not only survive, but thrive in the remainder of the decade. Today, these firms are stuck with rigid rear view mirror business intelligence systems and archaic workstation-based traditional statistical systems running simplistic regression models that fail to capture the complexity of many real life predictive use cases.

At the same time, they sit on growing heaps of hard to replicate proprietary datasets that go underutilized. The latest McKinsey Global Institute report named The Age of Analytics: Competing in a Data-driven World reveals that less than 30% of the potential of modern analytics technologies outlined in their 2011 report has been realized — not even counting the new opportunities made possible by the advent of the same technologies in the last five years. To make matters worse, the progress looks very unbalanced across industries (i.e., as low as 10% in U.S. Healthcare vs. up to 60% in the case of Smartphones) at a time analytics prowess is correlated with competitive differentiation more than ever.

Machine Learning Industry Adoption

Even if it maybe hidden behind polished marketing speak pushed by major vendors and research firms (e.g., “Cognitive Computing”, “Machine Intelligence” or even doomsday-like “Smart Machines”), the Machine Learning genie is out of the bottle without a doubt as its wide-ranging potential across the enterprise has already made it part of the business lexicon. This new found appetite for all things Machine Learning means many more legacy firms and startups will begin their Machine Learning journeys in the 2017. The smart ones will separate themselves from the bunch by learning from others’ mistakes. Nonetheless, some old bad habits are hard to kick cold turkey, so let’s dive in with some gloomier predictions and end on a higher note:

  • PREDICTION #1:

    “Big Data” soul searching leads to the gates of #MachineLearning.

    Tweet: PREDICTION#1: “Big Data” soul searching leads to the gates of #MachineLearning. @BigMLcom https://ctt.ec/a1fa8+

  • The soul searching in the “Big Data” movement will continue as experts recognize the level of technical complexity that aspiring companies must navigate to piece together useful “Big Data” solutions that fit their needs. At the end of the day “Big Data” is tomorrow’s data but nothing else. The recent removal of the “Big Data” entry from the Gartner Hype Cycle is further testament to the same realization. All this will only hasten the pivot to analytics and specifically to Machine Learning as the center of attention so as to recoup the sunk costs from those projects via customer touching smart applications. Moreover, the much maligned sampling remains a great tool to rapidly explore new predictive use cases that will support such applications.

Big Data vs. Machine Learning Trends

  • PREDICTION #2:

    VCs investing in algorithm-based startups are in for a surprise.

    Tweet: PREDICTION #2: VCs investing in algorithm-based startups are in for a surprise. @BigMLcom https://ctt.ec/r3SnA+

  • The education process of VCs will continue, albeit slowly and through hard lessons. They will keep investing in algorithm-based startups with the marketable academic founder resumes, while perpetuating myths and creating further confusion e.g., portraying Machine Learning as synonymous with Deep Learning, completely misrepresenting the differences between Machine Learning algorithms and Machine-learned models or model training and predicting from trained models1. A deeper understanding of the discipline with the proper historical perspective will remain elusive in the majority of the investment community that is on the look out for quick blockbuster hits. On a slightly more positive note, a small subset of the VC community seems to be waking up to the huge platform opportunity Machine Learning presents.

    Benedict Evans on Machine Learning

  • PREDICTION #3:

    #MachineLearning talent arbitrage will continue at full speed.

    Tweet: PREDICTION #3: #MachineLearning talent arbitrage will continue at full speed. @BigMLcom https://ctt.ec/8Q43c+

  • The media frenzy around AI and Machine Learning will continue at full steam as humored by Rocket AI type parties, where young academics will be courted and ultimately funded by aforementioned investors. Ensuing portfolio companies will find it hard to compete on algorithms as few algorithms are really widely useful in practice although some do slightly better than other for very niche problems. Most will be cast as brides at shotgun weddings with corporate development teams looking to beef up on Machine Learning talent strictly for internal initiatives. In some nightmare scenarios, the acquirers will have no clear analytics charter, yet they will be in a frantic hunt to grab headlines to generate the illusion that they too are on the AI/Machine Learning bandwagon.

Machine Learning Talent Arbitrage

  • PREDICTION #4:

    Top down #MachineLearning initiatives built on Powerpoint slides will end with a whimper.

    Tweet: PREDICTION #4: Top down #MachineLearning initiatives built on Powerpoint slides will end with a whimper. @BigMLcom https://ctt.ec/_I589+

  • Legacy company executives that opt for getting expensive help from consulting companies in forming their top-down analytics strategy and/or making complex “Big Data” technology components work together before doing their homework on low hanging predictive use cases will find that actionable insights and game-changing ROI will be hard to show. This is partially due to the requirement to have the right data architecture and flexible computing infrastructure already in place, but more importantly outperforming 36 years of collective achievements by the Machine Learning community with some novel approach is just a tall order regardless how relatively cheap computing has become.

Top Down Data Science Consulting Fail

  • PREDICTION #5:

    #DeepLearning commercial success stories will be few and far in between.

    Tweet: PREDICTION #5: #DeepLearning commercial success stories will be few and far in between. @BigMLcom https://ctt.ec/8f0ac+

  • Deep Learning’s notable research achievements such as the AlphaGo challenge will continue generating media interest. Nevertheless, its advances in certain practical use cases such as speech recognition and image understanding will be the real drivers for it to find a proper spot in the enterprise Machine Learning toolbox alongside other proven techniques. Interpretability issues, dearth of experienced specialists, its reliance on very large labeled training datasets and significant computational resource provisioning will limit mass corporate adoption in 2017. In its current form, think of it as the Polo of Machine Learning techniques, a fun time perhaps that will let you rub elbows with the rich and famous provided that you can afford a well-trained horse, the equestrian services and upkeep, the equipment and a pricey club membership to go along with those. Nevertheless, not quite an Olympic sport. So short of a significant research breakthrough in the unsupervised flavors of Deep Learning, most legacy companies experimenting with Deep Learning are likely to come to the conclusion that they can get better results faster if they pay more attention to areas like Reinforcement Learning and the bread and butter Machine Learning techniques such as ensembles.

Deep Learning Hype

  • PREDICTION #6:

    Exploration of reasoning and planning under uncertainty will pave the way to new #MachineLearning heights.

    Tweet: PREDICTION #6: Exploration of reasoning and planning under uncertainty will pave the way to new #MachineLearning heights. @BigMLcom https://ctt.ec/1GAi3+

  • Of course, Machine Learning is only a small part of AI. More attention to research and the resulting applications from startups in the fields of reasoning and planning under uncertainty and not only learning will help cover truly new ground beyond the better understood pattern recognition. Not surprisingly, Facebook’s Mark Zuckerberg has reached similar conclusions in his assessment of the state of AI/Machine Learning after spending nearly a year to code his intelligent personal assistant “Jarvis”, that was loosely modeled after the same in the Iron Man series.

Mark Zuckerberg's Jarvis AI

  • PREDICTION #7:

    Humans will still be central to decision making despite further #MachineLearning adoption.

    Tweet: PREDICTION #7: Humans will still be central to decision making despite further #MachineLearning adoption. @BigMLcom https://ctt.ec/iBjl8+

  • Some businesses will see early shoots of faster and evidence-based decision making powered by Machine Learning, however humans will still be central to the decision making. Early examples of smart applications will emerge in certain industry pockets adding to the uneven distribution of capabilities due to differences in regulatory frameworks, innovation management approaches, competitive pressures, end customer sophistication and demand for higher quality experiences as well as conflicting economic incentives in some value chains. Despite the talk about the upcoming singularity and robots taking over the world, cooler heads in the space point out that it will take  a while to create truly intelligent systems. In the meanwhile, businesses will slowly learn to trust models and their predictions as they realize that algorithms can outperform humans in many tasks.

s. Machine Intelligence

  • PREDICTION #8:

    Agile #MachineLearning will quietly take hold beneath the cacophony of AI marketing speak.

    Tweet: PREDICTION #8: Agile #MachineLearning will quietly take hold beneath the cacophony of AI marketing speak. @BigMLcom https://ctt.ec/5eO8B+

  • A more practical and agile approach to adopting Machine Learning will quietly take hold next year. Teams of doers not afraid to get their hands dirty with unruly yet promising corporate data will completely bypass the “Big Data” noise and carefully pick low hanging predictive problems that they can solve with well proven algorithms in the cloud with smaller sampled datasets that have a favorable signal to noise ratio. As they build confidence in their abilities, the desire to deploy what they have build in product as well as to add more use cases will mount. No longer bound by data access issues, complex, hard to deploy tools these practitioners not only start improving their core operations but also start thinking about predictive use cases with a higher risk-reward profiles that can serve as the enablers of brand new revenue streams.

Lean, Agile Data Science Stack

  • PREDICTION #9:

    MLaaS platforms will emerge as the “AI-backbone” for enterprise #MachineLearning adoption by legacy companies.

    Tweet: PREDICTION #9: MLaaS platforms will emerge as the “AI-backbone” for enterprise #MachineLearning adoption by legacy companies. @BigMLcom https://ctt.ec/6RuU9+

    MLaaS platforms will emerge as the “AI Backbone” in accelerating the adoption of Agile Machine Learning practices. Consequently, commercial Machine Learning will get cheaper and cheaper thanks to a new wave of applications built on MLaaS infrastructure. Cloud Machine Learning platforms in particular will democratize Machine Learning by

    • significantly lowering costs by eliminating complexity or front-loaded vendor contracts
    • offering a preconfigured frameworks that packages the most effective algorithms
    • abstracting the complexities of infrastructure setup and management from the end user
    • providing easy integration, workflow automation and deployment options through REST APIs and bindings.

Machine Learning Platforms for Developers

  • PREDICTION #10:

    Data Scientists or not, more Developers will introduce #MachineLearning into their companies.

    Tweet: PREDICTION #10: Data Scientists or not, more Developers will introduce #MachineLearning into their companies. @BigMLcom https://ctt.ec/LCRKX+

  • 2017 will be the year, when developers start carrying the Machine Learning banner easing the talent bottleneck for thousands of businesses that cannot compete with the Googles of the world in attracting top research scientists with over a decade of experience in AI/Machine Learning, which doesn’t automagically translate to smart business applications that deliver business value. The developers will start rapidly building and scaling such applications on MLaaS platforms that abstract painful details (e.g., cluster configuration and administration, job queuing, monitoring and distribution etc.)  that are better kept underground in the plumbing. Developers just need a well-designed and well-documented API instead of knowing what a LR(1) Parser is to compile and execute their Java code or knowing what Information Gain or the Wilson Score are to be able to solve a predictive use case based on a decision tree.

Developer-driven Machine Learning

We are still in the early innings of “The Age of Analytics”, so there is much more to feel excited about vs. dwelling on bruises from past false starts. Here’s to keeping calm and carrying on with this exciting endeavor that will take business as we know it through a storm by perfecting the alchemy between mathematics, software and management best practices. Happy 2017 to you all!

1: The A16Z presenter seems to think every self-driving car has to learn what a stop sign is by itself, thus reinventing the wheel many times over instead of relying on tons of historical sensor data from an entire fleet of such vehicles. In reality, few Machine Learning use cases require a continuously trained algorithm e.g., handwriting recognition.

Fourth Edition of the Startup Battle at 4YFN

Four Years From Now, the startup business platform of Mobile World Congress that enables startups, investors and corporations to connect and launch new ventures together, goes to Barcelona, Spain, from February 27 to March 1, 2017. We could not think of a better context to run the fourth edition of our series of Startup Battles.

Telefónica has invited PreSeries, the joint venture between Telefónica Open Future_ and BigML, to participate at the 4YFN event and showcase its early stage venture investing platform on the main stage on February 28 in front of an audience of over 500 technologists. In a nutshell, PreSeries provides insights and many other metrics to help investors make objective, data-driven decisions in investing in tomorrow’s leading technology companies.

battle_-4yfn

In rapid fire execution mode, Valencia was the first city that witnessed the World’s premiere Artificial Intelligence Startup Battle last March 15, 2016. On October 12, the PreSeries Team travelled to Boston to celebrate the second edition at the PAPIs ‘16 conference. Less than two months later we celebrated the third edition in São Paulo, in the BSSML16 context. The fourth edition of our series of startup battles will be hosted in Barcelona, Spain. The distinguished audience and press members in Catalonia will discover how an algorithm is able to predict the success of a startup without any human intervention.

To recap the process, five startups from the Wayra Academy, Telefónica’s startups accelerator, will present their projects on stage through five-minute to-the-point pitches. Afterwards, PreSeries will ask a number of questions to each contender in order to provide a score between 0 and 100. The startup with highest score of all will win the battle. Having the opportunity to participate in the battle is key for participating startups as it will give them excellent exposure to potential corporate sponsors, strategic partners and the venture investments community. Stay tuned for future announcements, where we will reveal the contenders of the fourth edition of our Startup Battle as it may just prove to be the most competitive one so far.

Even Easier Machine Learning for Every Day Tasks

by

Recently, the “Machine Learning for Everyday Tasks” post suddenly rising to the top of Hacker News drew our attention. In that post, Sergey Obukhov, software developer at San Francisco-based startup Mailgun, tries to debunk the myth that Machine Learning is a hard task:

I have always felt like we can benefit from using machine learning for simple tasks that we do regularly.

This certainly rings true to our ears at BigML, since we aim to democratize Machine Learning by providing developers, scientists, and business users powerful higher-level tools that package the most effective algorithms that Machine Learning has produced in the last two decades.

In this post, we are going to show how much easier it is to solve the same problem tackled in the Hacker News post by using BigML. To this end, we have created a test dataset with similar properties to the one used in the original post so that we can replicate the same steps with analogous results.

Predicting processing time

The objective of this analysis is predicting how long it will take to parse HTML quotations embedded within e-mail conversations. Most messages are processed in a very short time, while some of them take much longer. Identifying those lengthier messages in advance is useful for several purposes, including load-balancing and giving more precise feedback to users.

Our analysis is based on a CSV file containing a number of fictitious track records of our system performance when handling email messages:

HTML length, Tag count, Processing Time

We would like to classify a given incoming message as either slow or fast given its length and tag count based on previously collected data.

Finding a definition for slow and fast through clustering

The first step in our analysis is defining what slow and fast actually mean. The approach in the original post is clustering, which identifies groups of relatively homogeneous data points. Ideally, we would hope that this algorithm is able to collect all slow executions in one cluster and all fast executions in another.

In the original post, the author has written a small program to calculate the optimal number of clusters. Then he uses that number as a parameter to actually build the clusters.

The task of estimating the number of clusters to look for is so common that BigML provides a ready-made WhizzML script that does exactly that: Compute best K-means Clustering. Alternatively, BigML also provides the G-means algorithm for clustering, which is able to automatically identify the optimal number of clusters. For our analysis, we will use the Compute best K-means Clustering script, following these steps:

  1. Create a dataset in BigML from your CSV file
  2. Execute the Compute best K-means clustering script using that dataset.

We can carry out those steps in a variety of ways, including:

  • Using BigML Dashboard, which makes it really easy to investigate a problem and build a machine learning solution for it in a pointing and click manner.
  • Writing a program that uses BigML REST API and the proper bindings that BigML provides for a number of popular languages, such as Python, Node.js, C#, Java, etc.
  • Using bigmler, a command-line tool, which makes it easier to automate ML tasks.
  • Using WhizzML, a server-side DSL (Domain Specific Language) that makes it possible to extend the BigML platform with your custom ML workflows.

We are going to use the BigML bindings for Python as follows:

import webbrowser
from bigml.api import BigML

api = BigML()

source = api.create_source('./post-data.csv')
dataset = api.create_dataset(source)
api.ok(dataset)
print "dataset" + dataset['resource']
execution = api.create_execution('script/57f50fb57e0a8d5dd200729f',
                                 {"inputs": [
                                     ["dataset", dataset['resource']],
                                     ["k-min", 2],
                                     ["k-max", 10],
                                     ["logf", True],
                                     ["clean", True]
                                 ]})
api.ok(execution)
best_cluster = execution['object']['execution']['result']
webbrowser.open("https://bigml.com/dashboard/" + best_cluster)

The result tells us that we have:

  • Two clusters (green and orange) that contain definitely slow instances.

post-2

post-3

  • The blue cluster includes the majority of instances, both fast and not-so-fast, as its statistical distribution in the cluster detail panel indicates:

post-1

Seemingly, our threshold to distinguish fast tasks from slow tasks points to the green cluster.

At this point, the original post gives up on using clustering as a means to determine a sensible threshold, and reverts to plotting time percentiles against tag count. Luckily for them, the percentile distribution shows a nice bubbling up at the 78th percentile, but in general this kind of analysis may not always yield such obvious distributions. As a matter of fact, detecting such an abnormalities can be even harder with multidimensional data.

BigML makes it very simple to further investigate the properties of the above green cluster. We can simply create a dataset including only the data instances belonging to that cluster and then build a simple model to better understand its characteristics:

centroid_dataset = api.create_dataset(best_cluster, { 'centroid' : '000000' })
api.ok(centroid_dataset)

centroid_model = api.create_model(centroid_dataset)
api.ok(centroid_model)

webbrowser.open("https://bigml.com/dashboard/" + centroid_model['resource'])

This, in turn, produces the following model:

post-4

If you inspect the properties of the tree nodes, you can see that the tree is clearly quickly split into two subtrees with all nodes on the left-hand subtree having processing times lower than 14.88 sec, and all nodes belonging to the subtree on the right with processing times greater than 16.13.

post-5

This suggests that a good choice for the threshold between fast and slow can be approximately 15.5 sec.

If we follow along the same steps as in the original post and apply the percentile analysis to our data instances here, we arrive at the following distribution:

percentiles

This distribution clearly starts growing faster between the 88th to the 89th percentile, confirming our choice of threshold:

table

To summarize, we have found a comparable result by applying a much more generalizable analysis approach.

Feature engineering

With the proper threshold identified, we can mark all data instances with running times lower than 15.5 as fast and the rest as slow. This is another task that BigML can tackle easily via its built-in feature engineering capabilities on BigML Dashboard:

post-10

Alternatively, we do the same in Python:

extended_dataset = api.create_dataset(dataset, {
    "new_fields" : [
        { "field" : '(if (< (f "time") 15.5) "fast" "slow")',
          "name" : "processing_speed" }]})

webbrowser.open("https://bigml.com/dashboard/" + extended_dataset['resource'])

Which produces the following dataset:

post-11

Predicting slow and fast instances

Once we have all our instances labeled a fast and slow, we can finally build a model to predict whether an unseen instance will be fast and slow to process. The following code creates the model from our extended_dataset:

extended_dataset = api.create_dataset(dataset, {
   "excluded_fields" : ["000002"],
    "new_fields" : [
        { "field" : '(if (< (f "time") 15.5) "fast" "slow")',
          "name" : "processing_speed" }]})
api.ok(extended_dataset)
webbrowser.open("https://bigml.com/dashboard/" + extended_dataset['resource'])

Notice that we excluded the original time field from our model, since we are now relying on our new feature that tells apart the fast instances from slow ones. This step yields the following result that shows a nice split between fast and slow instances at around 9,589 tags (let’s call this _MAX_TAGS_COUNT):

post-12

Admittedly, our example here is pretty trivial. As was the case in the original post, our prediction boils down to this conditional:

def html_too_big(s):
    return s.count(' _MAX_TAGS_COUNT

But, what if our dataset were more complex and/or the prediction involved more intricate calculations? This is another situation, where using a Machine Learning platform such as BigML provides an advantage over an ad-hoc solution. With BigML, predicting is just a matter of calling another function provided by our bindings:

from bigml.model import Model

final_model = "model/583dd8897fa04223dc000a0c"
prediction_model = Model(final_model)
prediction1 = prediction_model.predict({
    "html length": 3000,
        "tag count": 1000 })
prediction2 = prediction_model.predict({
    "html length": 30000,
        "tag count": 500 })

What’s more, predictions are fully local, which means no access to BigML servers is required!

Conclusion

Machine Learning can be used to solve everyday programming tasks. There are certainly different ways to do that, including tools like R and various Python libraries. However, those options have a steeper learning curve to master the details of the algorithms inside as well as the glue code to make them work together. One must also take into account the need to maintain and keep alive such glue code that can result in considerable technical debt.

BigML, on the other hand, provides practitioners all the tools of the trade in one place in a tightly integrated fashion. BigML covers a wide range of analytics scenarios including initial data exploration, fully automated custom Machine Learning workflows, and production deployment of those solutions on large-scale datasets. A BigML workflow that solves a predictive problem can be easily embedded into a data pipeline, which unlike R or Python libraries does not require any desktop computational resources and can be reproduced on demand.

Predicting the Publication Year of NIPS Papers using Topic Modeling

The Neural Information Processing Systems (NIPS) conference is one of the most important events in Machine Learning. It receives hundreds of papers from researchers all over the world each year. On the occasion of the NIPS conference held in Barcelona last week, Kaggle published a dataset containing all NIPS papers between 1987 and 2016.

We found it an excellent opportunity to put BigML’s latest addition in practice: Topic Models.

Assuming that paper topics evolve gradually over the years, our main goal in this post will be to predict the decade in which the papers were published by using their topics as inputs. Then, by examining the resulting model, we can get a rough idea of which research topics are popular now, but not in the past, and vice-versa.

We will accomplish this in four steps: first, we will transform the data into a Machine Learning-ready format; second, we will create a Topic Model and inspect the results; then, we will add the resulting topics as input fields; finally, we will build the predictive model using the decade as the objective field.

1. The Data

We start by uploading the CSV file “papers” to BigML. BigML creates a source while automatically recognizing the field types and showing a sample of the instances so we can check that the data has been processed correctly. As you can see below, the source contains the title, the authors, the year, the abstracts and the full text for each paper.

source.png

Notice that BigML supports compressed files such as .zip files so we don’t need to decompress the source file first. Moreover, BigML automatically performs some text analysis that also aids Topic Models (e.g., tokenization, stemming, stop words and case sensitivity cleaning) so you don’t need to worry about any text pre-processing. You can read more about the text options for Topic Models  here.

When the source is ready, we can create a dataset, which is a structured version of your data interpretable by a Machine Learning model. We do this by using the 1-click option shown below.

1-click dataset.png

Since we want to predict the publication decade of the NIPS papers, we need to transform the “year” into a categorical field. This field will include four different categories: 80s, 90s, 2000s and 2010s. We can easily do so by clicking the option “Add fields to dataset”.

add_fields.png

Then we need to select “Lisp s-expression” and use the Flatline editor to calculate the decade using the “year” field. We will not cover all the steps to create a field using the Flatline editor, but you can find a detailed explanation in Section 9.1.7 of the datasets document.

flatline.png

The formula we need to insert contains several “If…” statements to group years into decades:

(if (< (f "year") 1990) "80s" (if (< (f "year") 2000) "90s" (if (< (f "year") 2010) "2000s" "2010s")))

When the new field is created, we can find as the last field in the dataset. By mousing over the histogram we can see the different decades:

dataset2.png

2. Discovering the Topics Underlying the NIPS Papers

Creating a Topic Model in BigML is very easy. You can either use the 1-click option or you can configure the parameters. To discover the topics for the NIPS papers, we are going to configure the following  parameters:

  • Number of top terms: by default, BigML shows top 10 terms per topic. We prefer set a higher limit this time, up to 30 terms, so we can have more terms glean the topic themes from.
  • Bigrams: we include bigrams in the Topic Model vocabulary since we expect the NIPS reports to show a high number of them, e.g., neural networks, reinforcement learning or computer vision.
  • Excluded terms: we exclude terms such as numbers and variables since they are not significant in delimiting the papers’ thematic boundaries over time and can generate some noise.

topic model conf.png

When the Topic Model is created, you can inspect the topic terms using two different visualizations: the topic map and the term chart. See both in the images below.

topics

chart-bar

You can see the resulting Topic Model and play with the associated BigML interactive visualizations here!

The discovered topics provide a nice overview of most of the major subtopics in Machine Learning research, and we’ve renamed them to make them readable at a glance.  In the “north” of the topic model map, we have topics related to Bayesian and Probabilistic modeling, along with Text/Speech processing and computer vision, which represent domains where those techniques are popular. In the “south”, we get the topics that are heavily tilted towards matrix mathematics, including PCA and the specification of multivariate Gaussian probabilities. In the “west” we have supervised learning and optimization, with topics containing theorem proving along with various occurrences of numbers in this quadrant. In the “east” we have two rather isolated topics corresponding to data structures, specifically trees and graphs. Finally, in the center of the map, we have topics that occur across every discipline:  General AI terms (like “robot”), people talking about the real-world domain that they’re working in, and acknowledgements for collaborators and funding.

With the topics discovered, let’s try to predict the topic distribution for a new document.  A good way to visually analyze the Topic Model predictions is to use BigML Topic Distributions. You can use the corresponding option within the 1-click menu:

topic dist.png

A form containing the fields used to create the Topic Model will be displayed so you can insert any text and get the topic distributions.

We input the following data for our first Topic Distribution:

  • Title: “Deep Learning Models of the Retinal Response to Natural Scenes”
  • Abstract: “A central challenge in sensory neuroscience is to understand neural computations and circuit mechanisms that underlie the encoding of ethologically relevant, natural stimuli. (..) Here we demonstrate that deep convolutional neural networks (CNNs) capture retinal responses to natural scenes nearly to within the variability of a cell’s response, and are markedly more accurate than linear-nonlinear (LN) models and Generalized Linear Models (GLMs). (…) the injection of latent noise sources in intermediate layers enables our model to capture the sub-Poisson spiking variability observed in retinal ganglion cells. (..) Overall, this work demonstrates that CNNs not only accurately capture sensory circuit responses to natural scenes, but also can yield information about the circuit’s internal structure and function.”

The resulting topics, in the order of importance, include: Human Visual System (22.15%), Neurobiology (19.89%), Neural Networks (10.77%), Human Cognition (8.72%), Computer Vision (6.96%),  Noise (5.17%) among others with lower probabilities. You can see the resulting probabilities histogram in the image below:

topic dist 1.png

After making several predictions for different papers, we’re pretty confident that the predictions map fairly well with the judgements a human expert might make. Give it a try for yourself with this Topic Model link!

3. Including Topics as Input Fields

At this point, we know that the resulting topics are consistent and the model  satisfactorily calculates the different Topic Distributions for the papers. Now, we can try using the topic distribution to predict when the paper was written.

In order to incorporate the different Topic Distributions for all the papers in the dataset, we need to click in the “Batch Topic Distribution” option and select the dataset that contains the field “decade”(the field we created in the first step above).

topic dist 2.png

When the Batch Topic Distribution is created, we can find the resulting dataset containing all the topic distributions as fields.

topic dist 3.png

4. Predicting a Paper’s Decade

Finally, we are ready to build a model to predict any paper’s decade using the topics as inputs.

We first need to split the dataset into training and testing subsets randomly. In this case, we are going to use 80% of the dataset to build a Logistic Regression. For this, we remove all fields except the topics and the paper abstract and select the decade as the objective field.

BigML visualizations for Logistic Regression allow us to interpret the most influencing topics to predict the decade. By selecting a topic for the x-axis of the Logistic Regression chart, we can see the most sensitive topics that evolve over time vs. the most stable topics. The fluctuating topics will be better predictors of the decades than the more steady topics, which will be mostly irrelevant for our supervised model.

For example, we can see in the image below that as the probability for the topic “Circuits/Hardware” increases, it is more likely to appear in the papers from the 80’s and 90’s than in the papers from the 21st century. Therefore, it can be an important topic in determining which decade a paper was written.

LR circuits.png

The topic “Support Vector Machines” for example, tend to be very frequent in papers from the 2000’s while it is less probable in other decades.

LR SVM.png

Other topics like “Small numbers” (which includes all the numbers found in the papers) or “Probabilistic Distributions” tend to have a stable probability throughout all the decades. You can observe this in the image below, where the graph lines are pretty flat, i.e., the predicted probabilities for the decades do not change per topic probabilities.

The results seem to fit nicely vs. our expectations, but to objectively measure the overall predictive power of this model, we need to evaluate it by using the remaining 20% of the data.

The Logistic Regression evaluation shows around 80% accuracy, which is not bad. However, after trying other classification models, we find out that the best performing model is a Bagging ensemble of 300 trees, which achieves an accuracy of 84%. You can see its confusion matrix below.

confusion matrix.png

In here, we see the most difficult decade to predict for is the 80’s, very likely due to a smaller number of papers (57 in total) in the sample as compared to the other decades.

To improve the model performance further, we can try some more Feature Engineering such as the length of the text, the authors, the number of papers from the authors, various extracted entities like the university/country published in.

We encourage you to delve in to this fun dataset, and let us know of ways to improve it. If you haven’t got a BigML account yet, it’s high time you get one today. Sign up now, it’s free!

Brazilian Summer School in Machine Learning and AI Startup Battle in the Books!

The BigML Team has travelled to São Paulo, Brazil, to conduct another edition of our series of Machine Learning schools as part of our ongoing educational activities to help democratize Machine Learning not only across industries and job functions but also across geographies. Despite the fact that this was not our first such experience, we keep being positively surprised about the way these hands on training sessions are received with such enthusiasm. The Brazilian Summer School in Machine Learning was no exception in this regard.

It’s safe to say that the Brazilian techies have replied to our call in a big way. We received more than 450 applications from 9 different countries to join this event. However, due to space and travel/visa issues we had to accept the maximum of 202 attendees representing 6 different states in Brazil (173 from São Paulo, 11 from Minas Gerais, 10 from Rio de Janeiro, 4 from Santa Catarina, 2 from Paraná, and 2 from Rio Grande do Sul). There was a full house at the VIVO Auditorium in São Paulo. Check out all the pictures in Flickr, Google+, and Facebook

group

The two-day program was packed with topics such as supervised and unsupervised learning techniques, feature engineering, how to get your data ready for Machine Learning, as well as automating Machine Learning workflows.

Artificial Intelligence Startup Battle

Following the footsteps of the inaugural AI Startup Battle that took place in Valencia, Spain, last March 15, 2016, and the second edition in Boston last October 12, 2016, the closure of the BSSML16 was brought with the third edition of the Artificial Intelligence Startup Battle. Brazilian media outlets covered the battle, since it was the first time in history that Brazil held a contest where the jury was a Machine Learning algorithm that predicted the probability of success of an early stage startup. The four competitors (DataholicsPayGo Energy, Prognoos, and Sppin-Kapputo) and Saffe Payments took to the stage, although Saffe Payments did not participate in the competition because they are already part of the Wayra academy.

_s3x3051

Contenders of the AI Startup Battle at the BSSML16. From left to right: Guilherme Paiva, Co-Founder and CEO of Sppin-Kapputo; Mark O’Keefe, Co-Founder of PayGo Energy; Daniel Mendes, Founder and CEO of Dataholics; Raul Magno, Co-Founder and CEO of Prognoos; and Renato Valente, Country Manager of Telefónica Open Future_ in Brazil.

The PreSeries Machine Learning algorithm interviewed the contenders until it had enough information to provide a score between 0 and 100. The winning company, with a score of 92.33, was Prognoos, a startup from São Paulo that has built an artificial intelligence platform applying e-commerce user interaction and browsing data to personalize their buying experience through its proprietary algorithm.  This startup is being invited to Telefónica Open Future_’s accelerator to enjoy access to the Wayra Academy (for up to six months) and to Wayra services and network of contacts e.g., training, coaching, a global network of talent, as well as the opportunity to reach many Telefónica enterprises in Brazil and abroad. After six months, the winning company will be evaluated and may apply to run for a full Wayra acceleration process, including up to USD $50,000 convertible note loan (versus a possible 7 to 10% equity).

_s3x3029

Prognoos, winner of the AI Startup Battle at the BSSML16, represented by Raul Magno, its Co-Founder and CEO (left). Renato Valente, Country Manager of Telefónica Open Future_ in Brazil (right).

The second place was for PayGo Energy, from Nairobi, Kenya, with a score of 71.90, they seek to democratize LPG (Liquefied Petroleum Gas or Propane) for the 2.9 billion people worldwide who lack access to clean cooking fuel. The third position was for Dataholics, from São Paulo, with a score of 39.72, they focus on providing a solution to detect the products and services that fit a given consumer profile based on their social media and demographic information. And finally, the fourth place went to Sppin-Kapputo, from Belo Horizonte in Brazil, with a score of 28.14, this company is an information broker that uses Machine Learning to allow real estate investors and construction companies to make better decisions by relying on analytics tools and prediction models that evaluate the impact of construction on a real estate market.

At the end of the event, BigML’s CEO and President of PreSeries, Francisco J. Martin, highlighted: “We already knew from the growing number of active BigML users in Brazil that the region holds tremendous potential due to an abundance of young and hungry to learn minds as well as world class academics in Machine Learning and AI. This week was further testament that geographic barriers are no longer strong enough to prevent the spread of innovative and ambitious ideas that consider not only their local market but the whole worlds as their target audience for their data driven smart applications.”

The next edition of our Machine Learning schools and AI Startup Battles will take place soon, so stay tuned for new announcements on Twitter (@bigmlcom) and other social media channels: LinkedIn, Facebook, and Google+.

Looking forward to seeing you again in future editions of our Machine Learning training events and AI Startup Battles around the world.

Brazilian AI Startup Battle: Meet the Contenders!

As a thriving innovation hub, São Paulo’s economic impact can be felt all across Brazil, South America and even the world. The city that is colloquially known as Sampa or Terra da Garoa (Land of Drizzle) will also be the host to the third installment of the AI Startup Battle powered by PreSeries and Telefónica Open Future_.

battle_brazil

The AI Startup Battle is a one-of-kind startup contest, where five early-stage ventures compete on stage and are judged and ranked by Artificial Intelligence (AI). The AI is completely autonomous, no human intervention compromises the bias-free algorithm. How does it work? The contenders first pitch their startup and are then asked questions by the AI live on-stage. All the information gathered is then processed by the AI to generate a score from 0 to 100, which represents the startup’s long-term estimated likelihood of success. Paving the way for a more quantifiable early stage investing, previous editions of the battle were held in Valencia, Spain and Boston, USA as part of the PAPIs conferences.

The impending edition of the AI Startup Battle will be part of the Brazilian Summer School in Machine Learning 2016 (BSSML16). The contest is taking place on December 9 at the VIVO auditorium in São Paulo and is part of a series of Machine Learning courses organized by BigML in collaboration with VIVO and Telefónica Open Future_. BSSML16 is a two-day course for industry practitioners, advanced undergraduates, as well as graduate students seeking a fast-paced, practical, and hands-on introduction to Machine Learning. The Summer School will also serve as the ideal introduction to the kind of work that students can expect if they enroll in advanced Machine Learning masters.

Meet the contenders of the 1st Brazilian edition

PayGo Energy

screen-shot-2016-12-06-at-1-03-59-pm

PayGo Energy from Nairobi, Kenya believes in equal opportunities and seeks to democratize LPG (Liquefied Petroleum Gas or Propane) for the 2.9 billion people worldwide who lack access to clean cooking fuel. PayGo Energy allows families to purchase gas in small amounts, making LPG more affordable than ever. They operate on a pay-as-you-go basis. Their micro-payment structure critically aligns with existing consumer spending habits to overcome current cost barriers and enable access to a stable supply of cooking gas.

Dataholics

screen-shot-2016-12-06-at-1-05-50-pm

DATAHOLICS from São Paulo, Brazil provides a solution to detect the products and services that fit a given consumer profile based on his/her social media and demographic information. This rich data significantly improves the targeting of direct marketing campaigns, product recommendations for e-commerce, market research, database enrichment and the generation of highly qualified trade leads.

Prognoos

screen-shot-2016-12-06-at-1-18-32-pm

Prognoos built an artificial intelligence platform with a very low operational cost. Their first product presents a unique browsing experience with matchmaking algorithms. It uses e-commerce user interaction and browsing data to personalize their buying experience through its proprietary algorithm, assuring a good match between the ideal product to the right customer with the aim of decreasing the churn rate.

Kapputo

screen-shot-2016-12-06-at-1-15-21-pm

Kapputo is an information broker that uses Big Data and Machine Learning in order to allow real estate investors and construction companies to make better decisions by relying on analytics tools, prediction models that evaluate the impact of construction on a real estate market.

Saffe

Screen Shot 2016-12-07 at 5.07.56 PM.png

What if you could pay with just a selfie? Saffe Payments is a mobile payment app that leverages world-class facial recognition technology to make your life easier and more secure.

Good luck to all participants, and stay tuned for the results for the Brazil’s first ever AI Startup Battle!

BigML Fall 2016 Release Webinar Video is Here!

Thank you to all webinar attendees for the active feedback and questions about BigML’s Fall 2016 Release that includes Topic Models, our latest resource that helps you find thematically related terms in your text data. Our implementation of the underlying Latent Dirichlet Allocation (LDA) technique, one of the most popular probabilistic methods for topic modeling tasks, is now available from the BigML Dashboard and API. As is the case for any Machine Learning workflow, you can also automate your Topic Model workflows with WhizzML!

If you missed the webinar, it’s not a problem. Now you can watch the complete session on the BigML Youtube channel.

Please visit our dedicated Fall 2016 Release page for more resources, including:

  • The Topic Models documentation to learn how to create, interpret and make predictions from the BigML Dashboard and the BigML API.

  • The series of six blog posts that explain Topic Models step by step, starting with the basics and wrapping up with the mathematical insights of the LDA algorithm.

Many thanks for your time and attention. We are looking forward to bringing you our next release!

Who Wants to Know the Inner Workings of LDA?

In our recent series of blog posts on Topic Models, we’ve tried to explore this powerful new resource in the BigML Dashboard, in the API, using WhizzML, and we have also suggested some uses for it. But we’ve left a nuts and bolts description of how Latent Dirichlet Allocation (LDA) works until the end. Within this post, the last of a series of six posts, we’ll try here to give you exactly that: A high-level overview of the internal mathematics that underlies Topic Models, and what that mathematics might imply for you, the modeler.

david-blei-lda

David M. Blei – One of the creators of LDA, together w. Andrew Y. Ng and Michael I. Jordan.

While I’ll explain a few things here, a more precise and technical explanation given by the inventor of the technique, David Blei, is available. Where there seems to be conflict between his explanation and mine, rest assured, his is correct!

My Generation

A crucially important aspect of the topic models learned by Latent Dirichlet Allocation is that they are generative models. This is different from many of our models at BigML.  Decision trees and logistic regressions are discriminative models.  This means essentially that they spend their effort using the data to model the boundary between classes, directly approximating the function of interest without much concern about exactly why the data is the way it is.

Generative models are a bit different. Typically, generative models posit a statistical structure that is said to have generated the data. The modeling process is then a process of using the data to fit the parameters of that structure so that the structure is likely to have generated the data that we see.

More concretely, Latent Dirichlet Allocation imagines that each document is a distribution over topics in your dataset, like {Topic 1: 0.8, Topic 2: 0.0, Topic 3: 0.1, Topic 4: 0.1}. Each of those topics, as we know, is a distribution over words, like {President: 0.5, united: 0.25, states: 0.25}.  Now, to generate a document from that topic distribution, we first choose a random topic according to those probabilities (so we’d be very likely to choose topic 1, unlikely to choose topics 3 or 4, and would never choose topic 2). Once we’ve chosen our topic, we choose a word from that topic (so we’d choose “President” with high probability, and “united” or “states” with lower probability).  That gives us a single word in our document. Then we repeat the process over and over again until we have however long of a document we’d like. Et Voilà, we’ve generated a document from a topic distribution!

The astute reader will notice we’ve glossed over where the topic distribution for our document came from. There’s actually another meta-distribution from which we choose these topic distributions. So each document in our collection is a random choice of a topic distribution from the overlying distribution over topic distributions. With that, we’ve got a structure that could possibly generate our entire document collection, if all of those distributions are just right.

Another Tricky Day

As we mentioned earlier, learning proceeds in a somewhat backwards way from how we think about the generative structure: That is, the model gives a a procedure to generate documents given the distributions, but we already have the documents and need to infer the parameters of the model that probably generated them.

This is a tricky proposition so I won’t bore you with the details, but there are techniques like collapsed Gibbs sampling and variational methods that allow you to do this sort of “reverse inference” in generative models, and that’s exactly what we use in practice for our Topic Models.

The Seeker

So how can this knowledge affect the way you use BigML topic models? One of the key things to remember when using Topic Models is that the learning process is trying to use the generative structure above to explain how your document collection came to be. This means that if there’s a certain bunch of terms that occur all the time in your documents, the model will spend a lot of effort trying to explain how they got there, and might end up ignoring terms that are less frequent.

Why might this be bad?  Mostly because text data often tends to be “dirty”, with a lot of cruft that you don’t care about modeling. A great example is web pages. There’s a lot of nice information on web pages, but you’ll find if you pass raw web page source to Topic Models, the terms it finds most important will be things like “html”, “span”, “div” and “href”: Because these formatting directives appear all the time in web pages and in different quantities, the model will spend lots of effort trying to explain the differences in the occurrence rates of these tokens. Maybe that’s what you’re looking for; if you’ve got a dataset that is half web pages and half e-mail messages, the tokens that denote web pages might be useful indeed. It might also be nice to know which pages have more links. Then again, maybe you don’t care about HTML tags at all.

BigML attempts to remove tokens that occur “too frequently to be useful” when doing topic modeling, but it’s often specific to the use case whether or not something that’s frequently occurring is very important or just noise. But this is a problem that’s easy enough for you to fix: You can just exclude those useless terms from your dataset, either by pre-processing before you upload it, using Flatline, or using the excluded_terms parameter during the modeling process. You’ll often find that eliminating the half-dozen or so most common terms will yield a very different model; you’ll have changed the composition of your collection so much that other terms will have become far more relevant on a relative basis than they were the first time around.

Eyesight to the Blind

With a little understanding of how topic models work, you can create them with open eyes and maybe even improve your results by tweaking the data just a bit. Give them a try and, as usual, please drop us a line at support@bigml.com if you’re having trouble, or even if you just have some interesting data and want to show off the results.  Good luck!

If you want to know more about this new resource, please visit the Topic Models release page and check our documentation to create Topic Models, interpret them, and predict with them through the BigML Dashboard and API, as well as the six blog posts of this series.

P.S. – Extra credit trivia question: What is notable about the names of all the subsections in this post?

BigML and CICE Join Forces to Revolutionalize Machine Learning Education

Democratizing Machine Learning has always been BigML’s founding mission, so we are continually searching for new opportunities. As such, when a company is interested in our technology and is willing to help us further our cause of “Machine Learning for everyone”, we feel the urge to collaborate. This is exactly what happened with our new education partner. Today we are happy to announce our educational collaboration with CICE, the Leading School in New Technologies Training in Madrid, Spain.

CICE, the only Official Training Center in Spain for more than 20 multinational companies, is already a community of 70,000+ students from 30 different countries. With 35 years of experience, the school provides High Quality and Official Training Programs and the homologations from the leading companies thanks to certified teaching professionals recruited from the most prestigious Spanish production companies. In fact, the CICE Team is currently going through BigML’s new certifications program and they will soon be among the first batch of BigML Certified Engineers.

cice-team-ok

CICE aims to create efficacy through constant educational innovation by being protagonists and agents of the deep digital transformation our society is going through. They have already set exemplary standards for the fast-growing market segment of New Technologies Education through a mix of:

  • The best instructors: certified professionals with wide experience and proven ability to teach.
  • The best facilities: a heterogeneous system of professional Apple and DELL workstations along with a set of cloud services that provide the best educational experience.
  • The best homologations: a set of alliances that make CICE to provide a unique learning environment that guarantees a learning process that will fulfill the student’s expectations, in a way only the leaders of the Educational Sector can do.
  • The greatest quantity and quality of production: facts to demonstrate objectively that an excellent education at CICE plus the effort and talent of the students they train, turns into a harvest of projects that make CICE the winner of the high profile national and international competitions.

CICE’s enthusiasm for the BigML platform and their eagerness to improve Machine Learning education were big reasons why we agreed to collaborate. CICE has already joined our education program, which has more than 100 ambassadors and 620+ universities affiliated around the world that actively promote BigML.  They have started spreading the word through their students, social media, and at the education events they run in Madrid.

foto-2

We are excited about the official partnership and are looking forward to make a big difference together in graduating the future data-driven leaders of the 21st century ‘s global digital economy.

Automated Topic Modeling Workflows Done Right

by

This series of posts started by introducing Topic Models as BigML’s implementation of Latent Dirichlet Allocation (LDA) to help discover thematically related terms in unstructured text data. We later explained how to use it through the BigML Dashboard, showed how to apply Topic Models in a real-life use case, and how to program Topic Models using the BigML API. This post will focus on automating LDA workflows by using WhizzML, a DSL for Machine Learning that helps automate workflows, program high-level algorithms, and share workflows and algorithm with others.

Let’s dive in by creating a Topic Model and making a prediction with it. In BigML, you can perform single instance predictions (referred to as a Topic Distribution) or in batch mode, which is called Batch Topic Distribution. 

Firstly, we will create a Topic Model without specifying any particular configuration option, that is, relying on default settings. For that, you just need to create a script with the source code below:

screen-shot-2016-11-18-at-08-44-13

BigML’s API is mostly asynchronous, so the above creation function will return a response before the Topic Model creation is completed. This implies that the Topic Model information is not ready to make predictions right after the above code snippet is executed, so it is convenient to wait for its completion before you can predict with it. You can do just that by using the directive “create-and-wait-topicmodel”. See the example below:

screen-shot-2016-11-18-at-08-47-07

Now let’s try configuring a Topic Model via WhizzML. The properties to configure can be easily added in the mapping by using property pairs like <property_name> and <property_value>. For instance, in order to calculate a Topic Model with a dataset that contains one or more text fields, BigML automatically determines the number of topics, but if you prefer the maximum number of topics that BigML allows, you should add the property “number_of_topics” and set it to 64. Additionally, if you want your Topic Model to be case sensitive, then you need to set thecase_sensitive” property as true. Property names always need to be between quotes and the value should be expressed in the appropriate type. The code for our example can be seen below:

screen-shot-2016-11-18-at-08-56-47

For more details about all available properties please check the API documentation.

You now know how to create a Topic Model, let’s see how to create predictions from it. The code is similar to the one used to create any resource. You just need the ID of the Topic Model you want to predict with and provide the input data as a map with the new text (or texts) that you want to predict for. The input_data property is a map that uses the field ID as key. Here’s an example:

Screen Shot 2016-11-18 at 09.19.27.png

This is one of the exceptions to the asynchronous behavior of BigML’s API, therefore the Topic Distribution gets ready without stopping the source code progress in this case.

In many working scenarios, a batch prediction that allows predictions from a set of new data is more useful than a single prediction. In other words, a Batch Topic Distribution is usually preferred to a Topic Distribution.

It is pretty straightforward to create a Batch Topic Distribution from an existing Topic Model, where the dataset with name ds1 represents a set of rows with the text data to analyze:

screen-shot-2016-11-18-at-09-36-48

It’s likely that before creating a Batch Topic Distribution you will need to configure properties including fields mapping between Topic Model and the Dataset with the input data, or you may need to set another property to include the importance of each field as columns in your results. The way you should configure these properties is the same as for creating a Topic Model. The full list of all available properties is listed under API documentation. It’s also important to know that contrary to single predictions, batch predictions are asynchronous in nature. Below is such an example:

screen-shot-2016-11-18-at-10-25-34

Up to this point we were covering source code of scripts. In order to write the code in the BigML Dashboard, you can also use a pretty editor that supports colored syntax highlighting, auto-complete functions, and code format utility. Take it for a spin and build your scripts quickly.

screen-shot-2016-11-18-at-10-32-47

Nevertheless, if you are at home with APIs and you can find more script creation answers here. To obtain results from WhizzML scripts we need to first run them, so let’s see how to execute a Script in the Dashboard and how to carry out the same process by calling the right endpoint directly in the API.

First, the BigML Dashboard option: Look for your new script in the scripts list and click on it.

screen-shot-2016-11-18-at-09-47-54

You will see a page like the one below, that is, the inputs you need to fill before you run. For instance, in the Topic Model script creation that we described at the beginning of this post, you just need to select the dataset you want to use to build your Topic Model from the dropdown.

screen-shot-2016-11-18-at-09-50-10

It’s mandatory to fill all the input fields with a grey icon so they are validated with a green icon (empty values are not accepted except for text inputs).

screen-shot-2016-11-18-at-10-04-41

Finally, we will focus on how to execute a script through the API. To do this you need to compose a POST request with JSON content to /execute endpoint with two parameters: one will be the ID of the script you previously created, and the other one will be the “inputs”, which is a list of pairs that follow the schema <input_name> <input_value>. These include all the inputs without a defined default value in your script. Let’s see an example to showcase this idea: for the first script we used above, the input you need to fill is “ds1” with the dataset identifier you want to use to create a Topic Model. The complete request to the BigML API should be similar to this curl example:

 curl "https://bigml.io/execution?$BIGML_AUTH"
              -X POST
              -H 'content-type: application/json'
              -d '{"script": "script/55f007d21f386f5199000003",
                 "inputs": [["ds1", "dataset/55f007d21f386f5199000000"]]}'

We hope you enjoyed reading this quick tour on executing a script using the BigML API. For a more extensive list of execution parameters and how to access to the execution results, please visit the corresponding section in the API documentation. Notice that we didn’t dive into the authentication description, but it is described here. Finally, for an extensive description of WhizzML you can visit the WhizzML page.

In the next blog post we will discover the internal mathematics that underlies Topic Models and what these mathematics might imply for you, the modeler.

Would you like to know more about Topic Models? Visit the Topic Models release page and check our documentation to create Topic Models, interpret them, and predict with them through the BigML Dashboard and API, as well as the six blog posts of this series.

%d bloggers like this: