The promise of voice recognition has been around for a long time, but it has always been quite miserable. In fact, just back in 2012 my daughter and I were helping my mother purchase a new car. I paired my phone to the in-car audio and tried to dial home. After several attempts, we were nearly in tears from laughing at how impossibly bad it was at recognizing the spoken phone number.
However, in the last few years, advances in Machine Learning have improved the capability of voice recognition dramatically; see for example the section about the history of Siri here. Even more importantly, the availability of voice recognition APIs like Amazon’s Alexa Voice Service have made it possible for the rapid adoption of voice controlled applications.
But what about that moment in Star Trek IV, The Voyage Home, when Scotty not only expects to be able to speak to the computer, but to have the computer reply intelligently? To get there we need to not only rely on machine learning for voice recognition, but to bring voice recognition to machine learning applications!
As of today, we are one step further along that path with the introduction of the BigML for Alexa skill.
The BigML for Alexa skill combines the predictive power of BigML with the voice processing capabilities of the Alexa Voice Service. Using an Alexa enabled device like an Amazon Echo or Dot, this integration makes it possible to use spoken questions and answers to generate predictions using your own models trained in BigML.
For example, if you have data regarding wine sales with features like the sale month and grape variety, you could build a model which predicts the sales for a given month, variety, etc. With this model loaded into the BigML for Alexa skill, you could generate a sales prediction by answering questions vocally.
If you already have an AVS device like the Amazon Echo, you can quickly get a feel for the capabilities of the BigML Alexa skill in two steps:
First, enable the skill with:
“Alexa, enable the Big M. L. skill”
Then you can run a demo with:
“Alexa, ask Big M. L. to give me a demo”
This will load a model which ask questions about a patient’s diagnostic measurements like the 4-hour plasma glucose and BMI and uses your answers to make a prediction about the likelihood of that individual having diabetes. Of course, keep in mind that this is only a demo and is not medical advice!
If you want to try the BigML Alexa skill with your own BigML models, you just need to link the skill to your BigML account:
And then ask to load the latest model with
“Alexa, ask Big M. L. to load the latest model”.
This will load your most recently created model and launch a prediction.
As you start to play with your own models, you may run into some quirks with how field names are spoken, especially if they have punctuation or abbreviations. No worries – you can control how the fields are spoken using the labels and descriptions in your BigML dataset.
How to do this and lots of other tips and tricks can be found in the BigML for Alexa documentation
Now we just need the formula for transparent aluminum!
This Summer 2016 Release BigML is bringing Logistic Regression to the Dashboard, a very popular supervised Machine Learning method for solving classification problems. This upcoming release is the perfect scenario to guide you through Logistic Regression step by step. That is why we are presenting several blog posts to introduce you to this Machine Learning method.
Within this first post you will have a general overview of what Logistic Regression is. In the coming days, we will be complementing this post with five more entries: a second post that will take you through the six necessary steps to get started with Logistic Regression, a third post about how to make predictions with BigML’s Logistic Regression, a fourth blog post about how to create a Logistic Regression using the BigML API and a fifth one that will explain the same process using WhizzML instead, and finally, a sixth post that will analyze the difference between Logistic Regression and Decision Trees.
Let’s get started with Logistic Regression!
Why Logistic Regression?
Before machine learning hit the scene, the go-to tool for statistical modelling was regression analysis. Regressions aim to model the behavior of an objective variable as a combination of effects from a number of predictor variables. Among these tried-and true techniques is Logistic Regression, which was originally developed by statistician David Cox in 1958. Logistic regression is used to solve classification problems, where the objective is a categorical variable. Let’s see a simple example.
The dataset we’ll work with contains health statistics from 768 people belonging to the Pima Native American ethnic group. Our objective is to model the effect of an individual’s plasma glucose level on whether that individual contracts diabetes. The above scatterplot shows the plasma glucose levels of individuals with and without diabetes. At a glance, we can see that some relationship exists, where higher levels of plasma glucose are indicative of having diabetes. Note that while the x-axis is numeric in the scatterplot, the y-axis is categorical. This means we’ll need to apply a transformation before we can encode this relationship numerically. Rather than relating glucose levels directly to true/false values, we model the probability of diabetes as a function of plasma glucose. Speaking in terms of probability is appropriate because, as we can see in the above graph, there is no clear-cut threshold on plasma glucose beyond which we can say a person will have diabetes. Most individuals with plasma glucose below 80 mmol/L do not have diabetes, while most with levels above 180 mmol/L have diabetes. Within that range however, there is a significant amount of overlap. We need a way to express this interval of fuzziness flanked by two zones of certainty. For this purpose, we’ll use a function called the logistic function.
In this single-predictor example, our regression function is characterized by only two parameters: the slope of the transition and where the transition point is located along the x-axis. Fitting a logistic regression is simply learning the values of these parameters. The ability to encapsulate the model in only two numbers is one of the main selling points of logistic regression. Check out these slides from the Valencian Summer School in Machine Learning 2016 for more details and examples.
Having said that, in this first post we won’t go too deep into Logistic Regression. Sometimes simplicity is the key to understand the basic concepts:
- The aim of a Logistic Regression is to model the probability of an event that occurs depending on the values of the independent variables.
- A Logistic Regression estimates the probability that an event occurs for a randomly selected observation versus the probability of this event not occurring at all.
- A Logistic Regression classifies observations by estimating the probability that an observation is in a particular category.
These videos from Brandon Foltz offer a deeper dive into the essence of Logistic Regression:
Want to know more about Logistic Regression?
Check out our release page for documentation on how to use Logistic Regression with the BigML Dashboard and the BigML API. You can also watch the webinar, see the slideshow and read the other blog posts of this series about Logistic Regression.
BigML’s Summer 2016 Release is here! Join us on Wednesday, September 28, 2016 at 10:00AM US Pacific Time (Portland, Oregon / GMT -08:00) / 7:00 PM CET (Valencia, Spain. GMT +02:00) for a FREE live webinar to learn about the newest version of BigML. We’ll be diving into Logistic Regression, one of the most popular supervised Machine Learning methods for solving classification problems.
Last Fall we launched Logistic Regressions in the BigML API to let you easily create and download models to your environment for fast, local predictions. With this Summer Release, we go a step further by bringing Logistic Regression to the BigML Dashboard. This new and intuitive Dashboard visualization includes a chart and a coefficients table. The former lets you analyze the impact of an input field in the objective field predictions, whereas the table shows all the coefficients learned for each of the logistic function variables, ideal for inspecting model results and debugging tasks.
You can plot the impact of the input fields in either one or two dimensions, simply select the desired option with the green slider. On the right-hand-side legend you will see the class probabilities change according to the input fields selected as the axis. You can also set the values for the rest of the input fields using the form below the legend.
The ultimate goal of creating a Logistic Regression is to make predictions with it. You can easily predict single instances using the BigML prediction form, just input the values for the fields used by the Logistic Regression and you will get an immediate response of the predicted class along with its probability. BigML also provides the probabilities for the rest of classes in the objective field in a visual histogram that changes in real-time as you configure the input field values.
In addition to commercial activities, BigML plays an active role in promoting Machine Learning for education. With special offers, our education program is rapidly expanding around the world thanks to the participation from top universities all around the World. We would like to spread the word to more students, professors and academic researchers with your help, so please feel free to refer your fellow educators.
Are you ready to discover all you can do with Logistic Regressions? Join us on Wednesday, September 28, 2016 at 10:00AM PDT (Portland, Oregon. GMT -07:00) / 7:00 PM CET (Valencia, Spain. GMT +02:00). Be sure to reserve your free spot today as space is limited! We will also be giving away BigML t-shirts to those who submit questions during the webinar. Don’t forget to request yours!
This week over 140 attendees representing 53 companies and 21 academic organizations from 19 countries, gathered in Valencia to get their hands dirty with a curriculum jam packed with practical Machine Learning techniques and case studies that they can put to good use where they work or teach.
The regularly scheduled sessions were capped with an additional surprise talk from BigML’s Strategic Advisor Professor Geoff Webb about Multiple Test Correction for Streams and Cascades of Statistical Hypothesis Tests that he has been developing as part of his recent academic research.
The diverse backgrounds of the attendees and their active participation and willingness to absorb Machine Learning knowledge have jointly served to prove that the proverbial Machine Learning genie is out of the bag never again to be solely confined to small academic and scientific circles. The writing is already on the wall. In today’s knowledge economy driven increasingly by smart applications, Machine Learning is no longer an elective. Rather, it’s one of the main courses to be mastered by developers, engineers, information technology professionals, analysts, and even hands-on functional specialists from areas as varied as marketing, sales, supply chain, operations, finance or human resources. We thank all of our graduates for their enthusiasm as well as their valuable feedback teaching us a few new things in the process.
As the BigML family, we wish to stay connected for new editions of our training events to be held in larger and larger venues! THANK YOU VERY MUCH!
Telefónica Open Future_, Telefónica’s startup accelerator that helps the best entrepreneurs grow and build successful businesses, and PAPIs.io invite you to participate in the Artificial Intelligence Startup Battle of PAPIs ‘16, the 3rd International Conference on Predictive Applications and APIs, to be held in Boston on October 12 at the Microsoft New England Research and Development Center.
Artificial Intelligence (AI) has a track-record of improving the way we make decisions. So why not use it to decide which startups to invest in, and take advantage of all the startup data that is available? The AI Startup Battle, powered by PreSeries (a joint venture between BigML and Telefónica Open Future_), is a unique experience you don’t want to miss, where you’ll witness real-world and high-stakes AI.
As an early stage startup, you will enjoy a great opportunity to secure seed investment, and get press coverage in one of the technology capitals of the world. On the other hand, attendees will discover disruptive innovation from the most promising startups in AI, as the winner will be chosen by an impartial algorithm that evaluates startups’ chances of success based on signals derived from decades of entrepreneurial undertakings.
Want to compete in the AI Startup Battle?
If you are a startup with applied AI and Machine Learning as a core component of your offering, then we’ll be happy to meet you! Submit your application and if you are selected, you’ll be able to pitch on stage, make connections at the conference, and get unique exposure among a highly distinguished audience.
Five Artificial Intelligence startups will be selected to present their projects on stage on October 12 at the closing of the PAPIs ‘16 conference. They will be automatically judged by an application that uses a Machine Learning algorithm to predict the probability of success of a startup, without human intervention.
The five startups selected to present will get a free exhibitor package at PAPIs worth $4,000 each.
The winner of the battle will be invited to Telefonica Open Future_’s acceleration program and will receive funding of up to $50,000. The winner will not only enjoy an incredible place to work but also access to mentors, business partners, a global network of talent as well as the opportunity to reach millions of Telefónica customers.
NOTE: During the acceleration program, part of the startup team should be working from one of the countries where Telefónica operates (i.e., Argentina, Brazil, Chile, Colombia, Germany, Mexico, Peru, Spain, United Kingdom, or Venezuela).
Disrupting early stage technology investments
It only seems like yesterday when Telefónica Open Future_ sponsored the World’s first AI Startup Battle last March in Valencia, when we first introduced PreSeries to the world. The winner of this world’s premiere AI Startup Battle was the Madrid based Novelti, who scored a competition high 86 points. The event was the warm-up round to the upcoming one in Boston. This time we will find out some of the most promising early stage startups in Artificial Intelligence and Machine Learning in North America, Europe and the rest of the world.
As a bonus, the minds behind PreSeries will take the center stage to speak about their technology architecture, e.g., the supporting data, the training of the model, and its evaluation framework. It is a rare opportunity to find out what goes on behind the scenes in delivering this innovate real-life predictive application.
Startup Battle highlights:
- AI Startup Battle.
- Wednesday, October 12, 2016 from 5:00 PM to 6:30 PM (EDT).
- Microsoft New England Research and Development Center – 1 Memorial Dr #1 1st floor, Cambridge, Massachusetts, 02142. USA.
- To apply to present at the battle, please fill out the application form before September 29th. Spots are limited and will be awarded on a first come first serve basis.
- To attend the AI battle please register to get your FREE ticket.
The scenario of the AI Startup Battle could not be more innovative. The battle is part of PAPIs ’16, the 3rd International Conference on Predictive Applications and APIs. It is a community conference dedicated to real-world Machine Learning and related intelligent applications. Subject-matter experts and leading practitioners from around the world will fly to Boston to discuss new developments, opportunities and challenges in this rapidly evolving space. The conference features tutorials and talks for all levels of experience, and networking events to help you connect with speakers, exhibitors, and other attendees.
As PAPIs conference series makes its debut in the United States, there are also some changes including a pre-conference training day on October 10, 2016. Curriculum includes Operational Machine Learning with Open Source & Cloud Platforms, where participants will learn about the possibilities of Machine Learning, how to create predictive models from data, operationalize and evaluate them.
Registration to PAPIs ’16 is separate. You can find out more details here.
BigML and Las Naves are getting ready to host the 2nd Machine Learning Summer School in Valencia (September 8-9), which is fully booked. Although we are not able to extend any new invitations for the Summer School, we are happy to share that BigML’s Strategic Advisor Professor Geoff Webb (Monash University, Melbourne) will be giving an open talk on September 8th at the end of the first day of the Summer School. All MLVLC meetup members are cordially invited to attend this talk, which will start promptly at 6:30 PM CEST, in Las Naves. After Professor Webb’s talk, there will be time allocated for free drinks and networking. Below are the details of this unique talk.
A multiple test correction for streams and cascades of statistical hypothesis tests
Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e. rejecting a null hypothesis in error. Family-wise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be predetermined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance.
Subfamilywise Multiple Testing is a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed.
To demonstrate the importance and relevance of this work to current machine learning problems, Professor Webb and co-authors further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models.
They assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state-of-the-art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.
Please RSVP for this talk soon and be sure to take advantage of this unique chance to learn more about theis cutting edge technique, while joining our Summer School attendees from around the world for a stimulating session of networking afterwards.
With no lack of drama both on and off the track, the 31st Summer Olympics and its 39 events have been wrapped up recently. As the city of Rio is preparing for the first Paralympic Games to take place in the Southern Hemisphere, some are experiencing Synchronized Swimming, Canoe Slalom and Modern Pentathlon withdrawal symptoms. As Usain Bolt, Michael Phelps and Simone Biles stole the show, Silicon Valley has not just quietly sat and watched the proceedings. Not at all. In fact, VCs, investment bankers and tech giants active in the Machine Learning space have been in a race of their own that goes on unabated even if they don’t get the benefit of prime time NBC TV coverage.
Machine Learning as strategic weapon
It is fair to say that we have been witnessing the unfolding of the ghost Olympic event of Machine Learning startup acquisition. The business community’s scores are not fully revealed yet and the acquisition amounts are mostly being kept under the wraps — albeit in a leaky kind of way. Regardless, the most recent acquirers include Apple acquiring Turi and Gliimpse, Salesforce purchasing BeyondCore, Intel picking up Nervana Systems, and Genee being scooped up by Microsoft. So what is driving this recent surge?
The bulk of the M&A activity have been led by household B2C names like Google, Apple and Facebook that are sitting on top of piles of consumer data that can result in a new level of innovation when coupled with existing as well as emerging Machine Learning techniques like Deep Learning. The dearth of talent to make this opportunity a reality has resulted in a very uneven distribution of the said talent as those deep pocketed “acquihirers” outbid other suitors to the tune of $10M per FTE for early stage startups (and even higher in the case of accomplished academic brains).
The emerging need for a platform approach
As great as having some of the brightest minds work on complex problems is, it is no guarantee of success without the right tools and processes to maximize the collaboration with and the learning among developers, analysts and resident subject-matter experts. Indeed, the best way to scale and amplify the impact from the efforts of these highly capable, centralized yet still relatively tiny teams is adopting a Machine Learning platform approach.
It turns out that those that started on the path of prioritizing Machine Learning as a key innovation enabler early on already have poured countless developer man-years into building their own platforms from scratch. Facebook’s FbLearner Flow, which the company recently revealed is a great example of this trend. As of now the platform claims to have supported over 1 million modeling experiments conducted to date, which make 6 million predictions per second possible for various Facebook modules such as the news feed. But perhaps the most impressive statistic is that 25% of Facebook engineers have become users of the platform over the years. This is very much in line with Google’s current efforts to train more developers to help themselves when it comes to building Machine Learning powered smart features and applications.
Machine Learning haves (1%) and have nots (99%)
Examples like the above are inspirational, but this brings the question how many companies can realistically afford to build their own platform from scratch. The short answer is “Not too many!”
Left to their own devices, these firms face the following options:
Hiring few Data Scientists that may each bring their own open source tools and libraries of varying levels of complexity potentially limiting the adoption of Machine Learning in other functions of the organization, where the ownership of mission critical applications and core industry expertise reside.
Turn to commercial point solution providers with a few built in blackbox Machine Learning driven use cases per function e.g., HR, Marketing, Sales etc.
Count on the larger B2B players’ recently launched Machine Learning platforms to catch up and mature in a way that can not only engage highly experienced Machine Learning specialists, but also serve the needs of developers and analysts alike e.g., IBM, Microsoft (Azure), Amazon (AWS) etc.
Although these options may be acceptable ways to dip your toes in the water or stop the bleeding in going to market with a very specific use, they are not satisfactory longer term approaches that strike the optimal balance between time to market, return on investment and a collaborative transformation that leads to a data driven culture of continuous innovation that transcends what can be achieved with small teams of PhDs. As a result, despite the recent advances in data collection, storage and processing, we are stuck with a data rich but insights (and business outcomes) poor environment awash with a cacophony of buzzwords in many industries.
Luckily, there’s still an incipient industry of independent Machine Learning platforms like BigML, H2O and Skytree (no more Turi) that can supply this unfulfilled demand from the so far lagging 99%. However, we must remember that replacing those platforms with new complete ones may require years of arduous work by highly specialized teams, which runs counter to the present day two co-founder, Silicon Valley accelerator startup recipe targeting a quick exit despite little to no Intellectual Property.
Regardless if any tech bellwether is able to create a monopoly, it seems safe to assume that for the foreseeable future the race for Machine Learning talent is only going to get hotter as more companies get a taste of its value. We will all see whether this game of inverse musical chairs will lasts long enough to make it to the official program of Tokyo 2020!
The good folks over at the innovative web application integration and event based automation startup Zapier have recently prepared a comprehensive list of Top 50 Google Sheets Add-Ons to Supercharge Your Spreadsheets. We were flattered to find out that they listed BigML for Google Sheets (aka BigML-GAS) as a top add-on for “number crunching”.
Like all of our BigML-Tools, BigML-GAS is super-easy to use and it is a great way to expose your predictive models externally to your workgroup or other partners regardless of their experience with Machine Learning. For example, imagine that you have a model predicting the likelihood of a prospect to become a customer based on your historical sales pipeline results. Your Field Sales team that may collect data on prospects can easily score each of their prospects without needing a BigML account of their own or knowing anything about predictive models or Machine Learning for that matter! The beauty of it is that you need no complex IT integration with any CRM tools to get going with such a use case.
But, why stop there? With a vast ecosystem of apps, Google Sheets offers users a wide array of options to piece together functionality without relying on programming skills or internal developers. Besides BigML-GAS, there are other noteworthy add-ons for the analytical minded; in case they decide to get serious about automating the gathering and analysis of data in Google Sheets as well as implementing new processes based on the resulting actionable insights. Here are a couple from Zapier’s eBook:
- Blockspring: Let’s you import and analyze data with APIs. For example, you can extract pricing data for your competitors’ products from Amazon API. Once you import the data, you can analyze it manually or on a schedule.
- Aylien: This is a great add-on for text analytics tasks such as sentiment analysis. For instance, you can pull in Tweets about your products and analyze their sentiment scores to see which ones are being received more favorably in which geographies.
- import.io: With this service, you can turn any website into a spreadsheet. If you would like to augment your Sheet with data from external sites, use this handy tool to scrape the data and infuse with what you have. import.io is Free for up to 10k queries per month.
- Geocode by Awesome Table: This add-on helps you get latitudes & longitudes from addresses in a Google Sheet to display them on a map that you can share.
In the coming weeks, we will present some interesting use cases that leverage the Google Sheet ecosystem along with BigML. In the meanwhile, do you have cool idea that you have already implemented on BigML? Are you planning to extend to Google Sheets with BigML-GAS? If so, let us know and we will present it on the BigML blog to spread the learning.
Real world Machine Learning is not just about the application of an algorithm to a dataset but a workflow, which involves a sequence of steps such as adding new features, sampling, removing anomalies, applying a few algorithms in cascade, and stacking a few others. The exact steps are often arrived at during iterative experiments performed by the practitioner. In other words, when it comes to the real life Machine Learning process, not everything is as automatic as various business media may make you believe.
Usually, one starts by playing around a bit with the data to assess its quality and to get more familiar with it. Then, a significant amount of time is spent in feature engineering datasets, configuring models, evaluating them, and iterating or combining resources to improve results. Finally, when the right workflow is found, traceability and replicability become must have concerns to bring the workflow to a production environment. Without those attributes, one can’t ensure that errors are eliminated, or workflows can be rerun and otherwise improved by everyone in a workgroup.
You are probably asking, “That all sounds great, but how does one achieve this without creating even more complexity?”. That’s precisely why today BigML has launched a new game-changing feature, which can create a WhizzML script for any Machine Learning workflow in a single click: Scriptify.
Auto-scripting workflows to generate your resources
With this update, all BigML resources (datasets, models, evaluations, etc.) have a new menu option named Scriptify that allows you to automatically generate a script that can regenerate that resource (or generate a similar one if it is used with different inputs with the same field structure)! BigML resources have always been downloadable as JSON objects, where the user-specified properties at the time of the creation or update of the resource are stored as attributes. This white-box approach is crucial to ensure that everything you do with BigML is traceable, reusable and replicable. So we can use our recently launched scripting language, WhizzML to inspect any existing resource in BigML and easily extract the entire workflow needed to generate it.
Let’s start by explaining an example use case. Say you created an evaluation some time ago, but you don’t remember which model was evaluated, whether it was a balanced model or not, or which test dataset was used. No problem! This information is already stored in each of the resources. You don’t need to track it down manually or document it in a separate repo. Clicking the Scriptify your evaluation link in the actions menu of your evaluation view screen will unravel it, and swiftly generate a script, which can reproduce the original evaluation.
This new script will now be available as another resource. Like any other WhizzML script, you will be able to execute it to recreate your evaluation on demand.
To create a resource in BigML, you usually provide the starting resource (e.g., if you want to build a model, you’ll need to decide from which dataset) and some configuration. The corresponding Scriptify action retrieves this information, and recursively does the same for all resources used throughout your entire workflow.
Following the example in the previous section, first the evaluation is analyzed to find out the IDs of both the model that was evaluated and the dataset used as test dataset since these are the origin resources for the evaluation. Then, each of them is analyzed recursively to find out the origin resources that were used to build them. The model was built from a dataset and the test dataset from a source. Finally, it turns out that the dataset used to build the model was built from the same source as the test dataset by using a 80%-20% split. In general, any Scriptify call will bubble up through the hierarchy of parent objects until it finds the source object for every resource involved in the analysis. As the Scriptify process needs to explore every object in the hierarchy, it will be stopped if any intermediate resource has been deleted. For each of the resources, the script extracts the attributes used in the create and/or update calls and generates a new script which contains the WhizzML code able to replicate them.
As you can see in the code example all the resources derive from the data in a file named s3://bigml-public/csv/diabetes.csv, which was initially uploaded to build a source object. This URL is kept as input for the script, but you can change it if need be. In a production environment, you periodically need to repeat the same process on new data. Using this script, you would only need to provide the new URL as input for the script to rebuild the evaluation on new data by using the same procedure and its configurations.
Scriptify as the building block for automation
Another interesting property of scripts in BigML is that you modify them to create new scripts. The link create a new script using this one opens an editor screen, where you can modify the code in the original script. Following the example, if you find out that your model was not balanced and you want to try the evaluation on a balanced model, you can do so by adding the
balance_objective flag to the model creation call attributes.
Clicking the validate button checks your changes and points out any errors. You can also change the outputs of the script. This view searches all the possible outputs in the script and offers you a list to select from. In this case, we set the model ID to be returned as output in addition to the original evaluation ID.
As you can see, you don’t really need to know WhizzML to start scripting your workflows. Rather, you can create any workflow on BigML’s Dashboard and let Scriptify magically transform it into WhizzML code! You can even make it a 1-click automation option on your menu. Finally, this code can also be modified and composed to create, share, and execute new workflows in a scalable, parallelizable and reproducible way. Now it’s your turn to transform your custom Machine Learning resources into blazingly fast, traceable and reproducible workflows ready for production environments!
Raw information is useless if you don’t understand what it means, but sometimes there is just so much it’s hard to get a handle on what is going on. One way to better understand your data is through cluster analysis – grouping similar data in “clusters”. At BigML we use a centroid-based clustering to group your data with just one click. While this is terribly convenient, it can obscure how the clustering decisions are actually made. When a dataset has dozens of input fields (or more!), how can you tell which ones were actually important in grouping your data?
What is important anyway?
This is a really big question, but at BigML we specialize in turning big questions into answers. Here something is important to a process if it affects the outcome of that process. For example, consider the importance of an input field when building a decision tree. A BigML decision tree automatically finds the importance of each input field by finding every time an input field was used to make a split in the tree and then averaging how different the prediction would have been without that split. With this definition of importance, more is better. While a single tree may give some understanding of which fields are important, a whole forest of trees would give even more certainty that the most important fields are identified.
How can we apply this definition of importance to the case of clusters? If one of the input fields was which cluster the datapoint belonged to, we could build a model to predict that field. Find the importance of other input fields in this model, and that would give their importance to deciding cluster membership.
In fact, there is already a one-click way to grow a tree based on a cluster just this way. When viewing a cluster in the BigML dashboard, if you shift click on a particular cluster, below on the right you can click to create a model (a single tree) from this cluster. This will give you the importance of each input field, but only to this cluster. But until now, there was no easy way to see the overall importance of each field considering all the clusters.
Global importance is here!
This new BigML script creates not just one, but an ensemble of trees designed to find the importance of your input fields. With just a few clicks, you will know which fields contribute to how your clusters were decided. And because the script uses an ensemble, you can be more confident that these fields really are the ones you want.
You can import this script directly from our gallery here. Now you are ready to analyze your cluster! Pick any of your clusters from the dropdown menu. Whether it was a fast one-click cluster, or you spent a long time carefully tailoring your cluster options, this script will be able to tell which input fields were the important ones. Once you’ve got your cluster, click “Execute” and let BigML do its thing. When complete, the output is
importance-list, a map of input field id, field name and importance, ordered from most to least important.
Cluster Classification in more detail
This script is written in WhizzML, BigML’s new domain-specific language, WhizzML. If you have a complicated task to do, just a few lines of WhizzML can replace the repetitive clicking needed to massage data in the dashboard. The cluster classification script takes a cluster ID and uses WhizzML to:
- Create an extension of the cluster’s source dataset, adding a new field ‘cluster’
- Create an ensemble from those resources.
- Put it all together to report each field’s importance.
This only takes a few steps because the script exploits the features of two BigML resources: the batchcentroid and the ensemble. Many BigML resources automatically contain calculated information that would be too much to show in the dashboard. But with a little WhizzML we can reveal their secrets.
Here’s the function that creates the extension of the cluster’s source dataset:
It begins with
define, which is how all WhizzML functions are defined. Here, a function
label-by-cluster is defined to take as an input the cluster id
cl-id. Next, a
let expression assigns some variables to objects pulled out of that cluster. Here’s where BigML resources really shine. We ultimately want a dataset resource, and we could create the dataset we are after using Flatline to edit every row of the original dataset. But instead we will create a batchcentroid. Set the
output_dataset parameter to true, and a batchcentroid resource automatically creates a dataset where each row is labeled by its cluster. But we want this dataset to have all the same fields as the original in addition to this new cluster field. So we set the parameter
output_fields to the be the same fields as the original, and we’ve got exactly what we want!
Now that we have this extended dataset, we can figure out how important each field is in determining cluster membership by building an ensemble of trees with
cluster membership as the objective. BigML automatically calculates the importance of each field in a model, we just have to know where to look to get those numbers.
Here’s the function that creates a map of field ids and their importance:
Just as before,
define creates the function
make-importance-map, which takes an ensemble id and a list of input field ids as inputs. In the
let statement, we go into the ensemble and pull out a list of all the models it contains, then go into get model to pull out the list of field importances. Now we just have to put everything together. Without getting too lost in details, the helper functions
list-to-map turns our lists into maps and
merge-with combines all the maps by addition. One final
map to divide by the number of models, and we have a map of field importance averaged over all the models of the ensemble.
That’s it in a nutshell! If you have ever spent a lot of time carefully setting cluster parameters, only to find you aren’t really sure why your clusters were chosen as they were, this is the script for you. It will tell you exactly which fields are important not just to a single cluster, but across all of them. More understanding is just a few clicks away.