Skip to content

PAPIs.io 2015 Preliminary Program Announced

As we have blogged about before, PAPIs.io 2015 is taking place in Sydney this year on the 6th and 7th of August right before the KDD conference. As a founding member and sponsor, BigML is looking forward to this year’s event. PAPIs.io is unique in that it has been able to bring together data scientists, developers and practitioners from large tech companies to leading startups and prominent educational institutions from around the globe to discuss all aspects of Predictive APIs and Predictive Apps. The very hands on and interactive approach of the agenda is centered on addressing the challenges of building real-world predictive applications based on a growing number of Predictive APIs that are making Machine Learning more and more accessible to developers. As a bonus, this year’s event will also introduce a technical track. Our enthusiasm is only elevated further after seeing today’s preliminary conference program announcement, which exhibits great diversity in terms of the speakers and the topics to be covered.

PAPIs.io 2015

Here are some preliminary program highlights:

  • Big Wins with Small Data: PredictionIO in Ecommerce (David Jones, Resolve Digital)
    There’s a lot of noise about big data and cutting edge algorithms optimisations. Returning to the basics, this presentation shows you might not need as much data as you think to get real world benefits. Learn about machine learning in ecommerce, PredictionIO and how we used off the shelf, well implemented algorithms to get a 71% increase in revenue with an online wine retailer.
  • Open Sourcing a Predictive API (Alex Housley, Seldon)
    After operating for three years as a “black box” predictive API, Seldon recently open-sourced it’s entire predictive stack. Alex will talk about Seldon’s journey from closed to open: the challenges and pitfalls, architectural considerations, case studies, changes to business models, and new opportunities for partnership across the full stack – between both open and closed technology providers.
  • Deploying Predictive Models with the Actor Framework (Brian Gawalt, Upwork)
    Build a better, faster, more efficient predictive API with the Actor model of programming. Latency, logging, full utilization are all easily handled with this framework. Upwork (formerly Elance-oDesk) freelancer availability model — anticipating who’s looking for work right now — is now a real-time service, without costly or complicated build-out of our stack or our datacenter, thanks to the Actor model.
  • Protocols and Structures for Inference: A RESTful API for Machine Learning (James Montgomery, University of Tasmania)
    Diversity in machine learning APIs works against realizing machine learning’s full potential by making it difficult to compose multiple algorithms. This paper introduces the Protocols and Structures for Inference (PSI) service architecture and specification for presenting learning algorithms and data as RESTful web resources that are accessible via a common but flexible and extensible interface. This is joint work with Dr. Mark Reid of the Australian National University and NICTA and Dr. Barry Drake of Canon Information Systems Research Australia.
  • Large scale predictive analytics for anomaly detection (Nicolas Hohn, Guavus Inc.)
    The focus will be on anomaly detection for network data streams, where the aim is to predict a distribution of future values and flag unlikely situations. Challenges both in terms of data science and engineering will be discussed, such as the accuracy, robustness and scalability of the prediction API. An example of a production deployment will also be discussed.
  • AzureML: Anatomy of a machine learning service (Sharat Chikkerur, Microsoft)
    Describing AzureML: a web service enabling software developers and data scientists to build predictive applications including. Will outline the design principles, system design and lessons learned in building such a system.
  • Building Machine Learning Models for Predictive Maintenance Applications (Yan Zhang, Microsoft)
    This talk introduces the landscape and challenges of predictive maintenance applications in the industry, illustrates how to formulate (data labeling and feature engineering) the problem with three machine learning models (regression, binary classification, multi-class classification), and showcases how the models can be conveniently trained and compared with different algorithms in Azure ML.

There will also be a panel discussion moderated by Mark Reid of ANU/NICTA.

If you are also attending KDD and want to kill two birds with one stone while you will have travelled all the way Down Under, there is no better alternative than attending PAPIs.io 2015 and rubbing shoulders with some of the most notable movers and shakers in the Machine Learning world in a more cozy and comfortable setting. You can follow subsequent announcements on PAPIs.io on Twitter. Hope to see you all in Sydney!

How to Perform Clustering in a Single Command Line

It’s been a while since we last wrote about the latest changes in our command line tool, BigMLer. In the meantime, two unsupervised learning approaches have been added to the BigML toolset: Clusters and Anomaly Detectors. Clusters are useful to group together instances that are similar to each other and dissimilar to those in other groups, according to their features. Anomaly Detectors, on the contrary, try to reveal which instances are dissimilar from the global pattern. Clusters and anomaly detectors can be used in market segmentation or fraud detection respectively. Unlike trees and ensembles, they don’t need your training data to contain a field that you must previously label. Rather, they work from scratch which is why they’re called unsupervised models.

In this post, we’ll see how easily you can build a cluster from your data, and a forthcoming post will do the same for anomaly detectors. Using the command line tool BigMLer, these machine learning techniques will easily pop out of their shells, and you will be able to use them either from the cloud or from your own computer, no latencies involved.

Clustering

Clustering your data

There are many scenarios where you can get new insights from clustering your data. Customer segmentation might be the most popular one. Being able to identify which instances share similar characteristics has always been a coveted feature, whether it be for your marketing campaigns or to identify groups with loan default risk or for health diagnosis. Clusters do this job by defining a distance between your data instances based on the values of their features. The distance must ensure that similar instances are closer than different ones. In BigML, all kinds of features contribute to this distance: numeric features, obviously, but also categorical and text features are taken into account. Using this distance, the instances in your dataset that are closer together and more distant to the rest of points are grouped into a cluster. The central points of these clusters, called centroids, give us the average features of the group. The user can optionally label each cluster with a descriptive name based on the features of the centroid. These labels can be used as a cluster-based class name, which can be thought of as assigning a label to each instance of the cluster dataset. Later, you could build a new global dataset with a new column that would be the label of the cluster this instance is assigned to, and in this sense, the label would be a kind of category. Then, when new data appears, you can tell which cluster it belongs to by checking which centroid it is closest to.

So, how can BigMLer help cluster your data? Just type:

and the process begins: first your data is uploaded to BigML‘s servers, and BigML infers the fields that your source contains, their types, the missing values these fields might have, and builds a source object. This source is then summarized into a dataset, where all statistical information is computed. After that, the dataset is used to look for the clusters in your data and a cluster object is created. The command’s output shows the steps of the process and the IDs for the created resources. Those resources are also stored in files in the output directory.

The default clustering algorithm used is G-means, but what does that mean? We talked about that extensively in a prior post, but basically this algorithm groups your data around a small number of instances (the number is usually denoted by k) by computing the distance of the rest of the points in the dataset to those instances. Each point in the dataset is considered to be grouped around the closest one of the k-instances selected in the beginning. This process results in k clusters and their centroids (the central point of the each group) are used as the next starting set of instances for a new iteration of the same algorithm. This carries on until there is no more improvement in the separation among chosen clusters. An advanced algorithm, G-means, eliminates the need for the user to divine what the value for k must be. It compares the results found using different values for k and choses the one that achieves the best Gaussian shaped clusters. That’s the algorithm used in BigMLer if no k value is specified, but you can always set the k of your choice in the command if you please:

In this case, the created cluster will have exactly 3 centroids, while in the first case, the final cluster count will be determined automatically. This process of picking instances as seed is done randomly, therefore you must specify a seed using the modifier --cluster-seed in your command if you want to ensure deterministic results. To run the clustering from your created dataset with a seed, use the dataset id as starting point:

This command will create a new cluster from the data enclosed in the existing dataset that will be reproducible in a deterministic way. It will also generate predictions for the new data enclosed in the my_new_instances file, finding which centroid each instance is closer to. But then, how do you know which instances are grouped together?

Profiling your clustered data

To know more about the characteristics of the obtained set of clusters, you can create a dataset for each of them with --cluster-datasets. Refer to your recently created cluster using its id

and you’ll obtain a new dataset per centroid. Each of them will contain the instances of your original dataset associated to the centroid. If you are not interested in all of the groups, you can chose which one to generate by using the name of the associated centroid.

will only generate the datasets associated to the clusters labeled as Cluster 1 and Cluster 2. The generated datasets will show the statistical information associated with the instances that fall into a certain cluster-defined class.

Can you learn more from your cluster? Well, yes! You can also see which features are most helpful to separate the clustered datasets, so that you can infer the attributes that “define” your clusters. In order to do that, you should create a model for each clustered dataset that you are interested in using --cluster-models:

How does BigMLer build a tree model on a particular cluster, let’s say Cluster 1? By adding a new field to every instance in your dataset that will contain whether or not this particular instance is grouped there. This brings an additional advantage: each model has a related field importance histogram that shows the importance a certain field has in the classification. Knowing which fields are important to classify new data as related to a centroid can give you an insight into what features define the cluster.

And these are basically the magic sentences that you must know to identify the groups of instances your data contains, the profiling of each group and the features that make them alike and different from others. Similarly, in a forthcoming post, we’ll talk about another set of commands that will help you pinpoint the outliers in your data by using Anomaly Detectors. Stay tuned!

Inside BigMLKit: A Sample Predictive App for Apple ResearchKit

by

Following up our post announcing the availability of BigMLKit, we are now going to introduce the BigMLKit API and present a sample app that can be used as a playground to experiment with BigMLKit.

BigMLKit

As already mentioned in the previous post, BigMLKit brings the capability of “one-click-to-predict” to iOS and OS X developers. This is accomplished through the notion of task, which is basically a sequence of steps. Each step has traditionally required a certain amount of work such as preparing the data, calling BigML’s REST API, waiting for the operation to complete, collecting the right data to prepare the  next step and so on. BigMLKit takes care of all of this “glue logic” for you in a streamlined manner, while also providing an abstracted way to interact with BigML and build complex tasks on top of our platform.

BigMLKit Classes

BigMLKit’s classes can be grouped into three groups:

  • Foundation
  • Tasks
  • Configuration.

Foundation

Everything in BigML is associated to resources, such as datasets, clusters, sources, etc. A resource’s identity in BigMLKit is defined through a name and a UUID (universally unique identifier), which are encapsulated in the BMLResourceProtocol protocol. A concrete implementation of BMLResourceProtocol will additionally provide more properties and/or methods according to the specific application that is being built. If you want to be able to filter available resources locally, you will possibly need to use Core Data and define a model whose entities contain the attributes you want to filter on. On the other hand, if you only want to support remote behavior for your entities, then their UUID is enough information for the REST API to handle them.

BigMLKit defines three basic types to build resource UUIDs:

  • BMLResourceType
  • BMLResourceUuid
  • BMLResourceFullUuid.

The three types are typedef’ed NSStrings. According to how BigML REST API identifies resources, a BMLResourceFullUuid is made up of a BMLResourceType and a BMLResourceUuid joined through a slash, e.g. “model/de305d54-75b4-431b-adb2-eb6b9e546014”. The class BMLResourceUtils defines convenience methods to extract a BMLResourceType or BMLResourceUuid from a BMLResourceFullUuid.

Tasks and Workflows

Tasks and Workflows are what makes BigMLKit useful.

A workflow is a collection of BigML operations. It can be as simple as a single call to BigML’s REST API or it can include multiple steps, e.g., when creating a sequence of BigML resources starting with a dataset and ending with a prediction.

BigMLKit provides several classes to define and use tasks and workflows, as detailed below.

BMLWorkflow is an abstract base class that is used to build composite workflows combining lower-level workflows together. The simplest form of BMLWorkflow is a BMLWorkflowTask, which corresponds to a single step workflow. BigMLKit provides several BMLWorkflowTask-derived classes that represent basic operations that BigML REST API allows to execute:

  • BMLWorkflowTaskCreateSource
  • BMLWorkflowTaskCreateDataset
  • BMLWorkflowTaskCreateModel
  • BMLWorkflowTaskCreateCluster
  • BMLWorkflowTaskCreatePrediction.

BMLWorkflowTaskSequence is a higher-level workflow that is able to execute a sequence of workflows.

BMLWorkflowTaskContext provides the context for task execution, where input, output, and intermediate results can be stored. The context also acts as a monitor for remote operations: it will poll BigML API to check a resource state progress and handle it according to its semantics. The storage mechanism is exposed through an NSMutableDictionary. The association key/value is an implementation detail of the workflows that use the context to carry through their operation. A context also hosts a connector object, which is responsible for handling the communication with BigML through its API interface. Currently, the connector object is an instance of ML4iOS.

Configuration

BigML’s REST API offers a lot of options to configure the available machine learning algorithms. For each resource type, BigMLKit provides a plist file that describes which options are available and what their type is, so a program can easily handle them, e.g., to display a list of available options or allowing users to set values for them. There are three main classes at play here:

  • BMLWorkflowTaskConfiguration, which allows for collecting all options in a common place and accessing them in an organized way; e.g., by getting all option definitions, or their values, etc.
  • BMLWorkflowTaskConfigurationOption, which is the atomic option. This basically provides a way to set whether the options should be effectively used in a given execution of the workflow, and to retrieve the current option value.
  • BMLWorkflowConfigurator, which is a container for all the BMLWorkflowTaskConfiguration instances associated with a user session. A configurator can be shared across multiple executions of the same workflow, or even different workflows.

In many cases, it is enough to use BigML’s default values for configuration options so there is no need to tweak them. The topic of BigMLKit configuration will be explored in a further post.

Running a Workflow

Running a workflow requires two preliminary steps:

  • creating the workflow
  • creating and setting up the context for its execution.

As mentioned above, at the moment BigMLKit provides single-operation workflows and a task sequence workflow, but you can easily implement any kind of specific workflow that you might need. Creating and setting up a context is workflow specific. You can see an example in the sample app introduced below.

Once the preliminary steps are done, you can run a workflow by calling the runInContext:completionBlock: method. The completion block will be called at the end of the workflow execution with an NSError argument in case an error occurred.

BigMLKitSampleApp

BigMLKitSampleApp is a simple iOS app that shows how you can integrate BigMLKit into your apps. The sample app is available on GitHub and allows to create a prediction from a data file, which is used to train a model. All the steps from datasource creation up to the model creation are executed remotely on BigML servers, while the prediction step is executed locally based on the calculated model and does NOT require access to BigML services. Three sample source files are provided: iris.csv, diab.csv, and wines.csv.

To keep the sample app code simple enough, it defaults to creating a decision tree, although BigMLKit also provides support for training clusters, and, in the near future, anomaly detectors and other Machine Learning algorithms provided by BigML. Furthermore, the app uses static data files to train the models, but in a real application you could as easily read the data to train your models from iOS HealthKit and/or ResearchKit, or you could use HealthKit/ResearchKit data to make predictions based on an existing reference model.

To understand how BigMLKit is integrated into the app, you can inspect the BMLPredictionViewController class, and in particular its two methods called setupFromModel: and startWorkflow.

The setupFromModel: method is called whenever the app delegate detects that the user tapped on any of the three available source files. On the other hand, startWorkflow is responsible for enabling the UI to provide the user with some visual feedback about the workflow being executed. It also handles the display of workflow results. In greater detail:

  • When a tap on a resource file is detected, the app delegate stores the current source file in the shared view model, then calls setupFromModel:.
  • setupFromModel: creates a new workflow, properly initializes a context, and finally calls startWorkflow.
  • startWorkflow will update the UI and then it will start the workflow and provide a callback.
  • The callback, if no errors are found, will use the workflow results (available in the workflow context) to build a prediction form, so the user can try different combinations of input arguments to make new predictions.

This is all that is required! As you can see, BigMLKit makes it really straightforward to run a simple workflow and use the power of machine learning in your apps, and we hope that you will find great applications for our technology and create extensions to BigMLKit that will make it even more convenient.

If you have any questions on how to get started with BigMLKit, feel free to contact us at info@bigml.com.

Detecting numeric irregularities with Benford’s Law

This is the first post in a series of statistics primers to inaugurate the arrival of BigML’s new advanced statistics feature. Depending on your background as a reader, the theory portion of this post may cover ideas which you already understand. If that’s the case go ahead and skip ahead to how to access these stats in BigML. Today’s topic is Benford’s law, which can be applied to detect irregularities in numeric data. It applies to collections of numeric data whose values satisfy the following criteria:

  1. Have a wide distribution, spanning several orders of magnitude.
  2. Generated by some “natural” process, rather than, say, arbitrarily chosen by a human.

Given that those conditions are met, Benford’s law states that the first significant digits (FSDs) will be distributed in a very specific pattern. In other words, we can take each of the digits from 1 to 9 and look at the relative proportion with which they appear in the first significant position among values in the data (e.g. the FSDs for the values 122.4, -54.01, and 0.0048 are 1, 5, and 4 respectively). If these proportions match the ones predicted by Benford’s law, then we can be assured that our data satisfy our two criteria. Otherwise, the data may have been tampered with, or may simply cover too narrow of a range for Benford’s law to apply. If we denote pd as the proportion of the data in which the digit d is in the first significant position, Benford’s law states that these proportions will take on the following values:

p_d = \log10 (d+1) - \log10(d)

In the plots that follow, p1 through p9 are drawn as the green line. We see that 1 should be the FSD in about 30% of the data while 9 should only be about 6% of the FSDs. The first two plots are examples of numeric data which conform to Benford’s law. The Fibonacci numbers and US county populations both satisfy the criteria given above. The gray bars denote the relative proportions of FSDs in the data.

fib county

The next two plots are examples of non-conforming data. The first example is data from the ubiquitous Iris dataset. Although it is undeniably a natural dataset, it fails the first criterion, since its values span only the the narrow range from 4-8 cm. The second example is an instance of fraudulent data. As chronicled in the State of Arizona v. Wayne James Nelson (CV92-18841), Mr. Nelson, a manager in the Arizona state treasurer’s office, attempted to embezzle nearly $2 million through bogus vendor payments.  Since Nelson started small and worked his way up to larger amounts, the values do satisfy the first criterion. However, as all the amounts were artificially invented, the second criterion is not satisfied and the final FSD distribution is very far from the one given by Benford’s law, with the digits 1-6 being too scarce and 7-9 being much more common than expected.

irisnelson

The last of these examples highlights the potential usefulness of this phenomenon in detecting suspicious numbers, and indeed there are many documented cases where fraudulent data have been exposed through application of Benford’s law.  Multiple analyses of results from the 2009 Iranian presidential elections have used Benford’s law to provide statistical evidence suggesting vote rigging.  A post mortem Benford’s law analysis of the accounts for several bankrupt US municipalities revealed inconsistent figures, which could be indicative of the fiscal dishonesty which led to the municipalities’ financial ruin. A team of German economists applied a Benford’s law analysis to the accounting statistics reported by European Union member and candidate nations during the years leading up to the 2010 EU sovereign debt crisis. They found that the numbers released by Greece showed the highest degree of deviation from the expected Benford’s law distribution.  As Greek national debt was one of the main drivers of the crisis, we can draw the conclusion that the Greek government was fudging the numbers to hide its fiscal instability. Interestingly, while researching this topic we found that the Greek source data for this analysis is now conspicuously absent from EUROSTAT website.

Testing Benford’s Law

Having seen that deviation from Benford’s law can be a useful indicator of anomalous data, we are left with the question of actually quantifying that deviation.  This brings us to the topic of statistical hypothesis testing, in which we seek to confirm or reject some hypotheses about a random process, given a finite number of observations from that process. For the purposes of our current discussion, the random process in question is the population from which our numeric data are drawn, and the hypotheses we consider are as follows:

H0 (null hypothesis): The population’s FSD distribution conforms to Benford’s Law

H1 (alternate hypothesis): The popluation’s FSD distribution is different from Benford’s Law

Depending on the outcome of the test, we either accept the null hypothesis, or reject it in favor of the alternate hypothesis. In the latter case, we may have grounds for applying more scrutiny to the values as failure to fit Benford’s law can be a sign of questionable data. The second piece of a statistical test is a significance level, also known as a p-value. In statistics, the results we obtain are not concrete facts; rather, our conclusions are parameterized by some level of certainty less than 100%. The precise definition of the p-value is rather nuanced,  but we can think of it as how extreme the calculated test statistic is, under the assumption that the null hypothesis is true. The workflow of a statistical test is thus as follows:

  1. Calculate a test statistic from the sample data, using the method prescribed for the specific test.
  2. Choose a desired significance level, which determines a critical value for the test statistic.
  3. If the calculated statistic is greater than the critical value, then the null hypothesis is rejected at the chosen significance level. Otherwise, the null hypothesis is accepted.

For Benford’s Law hypothesis testing, commonly employed tests are Pearson’s Chi Square test-of-fit, and the Cho-Gaines d statistic. Let’s work these tests out using our four example datasets.

Chi Square Test-of-fit

This test is a general purpose test for verifying whether data are distributed according to any arbitrary distribution. The test statistic is computed from counts rather than proportions. Let \hat{p}_d be the observed proportion of digit d in the data’s FSD distribution, and p_d be the expected Benford’s law proportion defined previously. For a data set containing N observations, the observed and expected frequencies are given by O_d = N\hat{p}_d and E_d = Np_d respectively. The Chi-square statistic is defined as follows:

\chi^2 = \sum_{d=1}^9 \frac{(O_d - E_d)^2}{E_d}

The critical value for this test comes from a chi-square distribution with (9-1) = 8 degrees of freedom. For a significance level of 0.01, we get a critical value of 20.09. If the value of χ2 is greater than this value, then we can reject a fit to Benford’s law with 99% certainty. In the Nelson check fraud dataset, we have the following observed frequencies:

O_1,\dotsc,O_9 = [1,1,1,0,0,0,0,3,9,8]

In other words, 1 was the first significant digit in one of the entries, while 9 was the FSD in 8 entries. For this 22 point dataset, our expected Benford’s law frequencies are:

E_1,\dotsc,E_9 = [ 6.622 , 3.874 , 2.749, 2.132, 1.742,1.473, 1.276, 1.125, 1.007]

Computing the chi-square statistic is a simple matter of plugging in the values:

\chi^2 = \frac{(1-6.622)^2}{6.622} + \frac{(1-3.974)^2}{3.874} + \dotsb + \frac{(8-1.007)^2}{1.007} = 121.0169

The obtained value is greater than the critical value, so we can indeed say that the fraudulent check data do not fit Benford’s Law. Iris, our other non-conforming dataset also produces a chi-square statistic larger than the critical value (506.3930), while the Fibonacci and US Census datasets produce values less than the critical value (0.1985 and 10.6314 respectively).

Cho-Gaines d

For small sample sizes, the chi-square test can encounter difficulty in discriminating between data which do and do not fit Benford’s Law. The Cho-Gaines’ d statistic is an alternative test which is formulated to be less sensitive to sample size. It is defined as follows:

d = \sqrt{N \sum_{d=1}^9 (\hat{p}_d - p_d)^2}

For a significance level of 0.01, the critical value for d is 1.569. The values for d from our example data are 0.114, 1.066,  7.124, and 2.789 for the Fibonacci, US Counties, Iris, and Nelson datasets respectively. The first two values are less than the critical value, whereas the last two are greater, thus producing a result which is consistent with the chi-square test and visual comparison of the FSD distributions. Rather than being computed from a well parameterized distribution like the chi-square test, these critical values for the Cho-Gaines’ d test are obtained from Monte Carlo simulations, and are only available for a few select significance levels. This means that it is not possible to know the exact p-value for any arbitrary value of d, and thus represents a tradeoff compared to the chi-square test.

Wrap-Up

In this post, we’ve explored First Significant Digit analysis with Benford’s Law. This straightforward concept, when combined with simple statistical tests, can be a useful indicator for rooting out anomalous numeric data. Benford’s law analysis is one of the many statistical analysis tools that are being incorporated into BigML. So stay tuned for a follow up post on how to perform this handy task and more on BigML.

BigML at Informatica World 2015

It’s been a few months since BigML was named a winner of Informatica’s Connect-a-Thon competition at their Data Mania event, where we first announced and showcased capabilities of the Informatica Connector for BigML.  We’re excited to build on this relationship by being a sponsor of the Cloud Innovation Summit at Informatica World 2015.  We’ll also be speaking at the Summit and demonstrating in the Informatica World Solutions Expo (booth 326).

IW15-logo-HiRes

As BigML continues to expand our base of enterprise customers around the world, we see this partnership as a critical component – helping companies quickly leverage data from their on-premise systems and cloud-based apps to perform an array of advanced analytics and/or to build predictive applications through BigML.

So if you’ll be attending Informatica World in Las Vegas next week, please be sure to stop by our booth and/or reach out to us (info@bigml.com) to arrange a 1-1 meeting!

The Past, Present, and Future of Machine Learning APIs

Today our very own José A. “jao” Ortega will be presenting on ‘The Past, Present, and Future of Machine Learning APIs’ at the APIDays Mediterranea & API Words event taking place in Barcelona, Spain.  This event is part of an independent conference series dedicated to APIs, Natural Language Processing and Language Technology.  During the two-day conference representatives from startups, corporations and those involved in the API industry will have a chance to discuss, learn and share about the future and business of APIs.

APIDays Mediterranea

As the event organizers suggest Web 1.0 was readable, Web 2.0 was social and now the web is PROGRAMMABLE through APIs. Since 2011, BigML has worked to implement our similar vision of a programmable web powered by a seamless machine learning layer in the cloud.  Jao’s presentation delves deep into the origins of machine learning – including current success stories and challenges it faces in making an impact in the “real world” as well as the bold ideas on the future direction in the space. The presentation’s emphasis is on machine learning being easily embedded in future smart apps able to adapt themselves to their context in real-time as new information arrives.  Simplicity, programmability, importability / exportability, composability, specialization and standardization will all play big parts in making this future vision come alive.  We truly believe this is the dawn of a new era for the Internet and the digital economy that it has bred.  Machine Learning APIs will be the disruptive force behind this new movement and innovators from all corners of the world are already in on the secret.

As it’s likely that many of you won’t have a chance to be in Barcelona in person, we are posting Jao’s presentation on Slideshare.  Let us know of your thoughts and better yet let us know how BigML can be part of your dream!

How BigML Finds Important Variables in Wide Datasets

This blog post is based on a talk I gave at the Dare2Data conference in Madrid.

I recently found a fascinating sociology survey with more than 39,000 responses to almost 400 questions. The survey, which has been given in the United States since 1972, covers a wide range of topics. Besides demographic info like age, gender, race and income, the survey also covers personal beliefs (“Should racists be allowed to teach college?”), living situation (“Have you been too tired to do housework recently?”) and life experience (“Have you ever injected illicit drugs?”).

While it’s great to have a dataset that’s so, um, rich, most of the variables are simply not relevant to whatever it is I want to predict. If I’m predicting whether your income is higher or lower than the United States median of $50,000, it doesn’t really matter if you’ve received a traffic ticket for a moving violation, or if you think marriage counseling is scientific. (Yes, those are actual questions.)

This is where BigML comes in. Because our algorithm does a “greedy” search through the data, examining every input individually to see how well it predicts the output, it excels at finding the needle of insight in a haystack of irrelevance. BigML actually does check whether moving violations predict income, but quickly learns that marital status, education, employment and age are much more useful.

Top 10 Variables for Predicting Income

Of course, if you change what you’re trying to predict, the list of important variables changes too. At Dare2Data, I tried predicting political beliefs instead of income, with interesting results. (Since I excluded moderates from the training set, it’s more accurate to say that I’m predicting strongly held political beliefs.)

For example, if you meet these five criteria, then you identify as conservative more than 85% of the time:

  1. You disapprove of homosexuality (or don’t respond to the question);
  2. You disapprove of sex before marriage (or don’t respond to the question);
  3. You are white;
  4. You go to church almost every week;
  5. You live in a single-family detached house (a proxy for living in the suburbs).

Of the 2,550 people who meet these five criteria, 2,224 (more than 85%) identify as conservative. This group, who might call themselves “social conservatives”, are an impressive 19% of conservatives in the entire dataset.

The model even finds a sixth factor: if you are also Protestant, but not United Methodist, then you are even more likely to be conservative.  At first I thought this was just noise, but there is actually a large liberal wing within the United Methodist Church that supports same-sex marriage. Amazingly, BigML is able to find this nuance in the data—talk about a needle in a haystack!

On the liberal side, there’s a group that doesn’t disapprove of homosexuality, does disapprove of the death penalty, and is strongly pro-choice. This group is about 85% liberal, accounting for 12% of all liberals in the dataset. Again, it’s remarkable that BigML can find groups of people that behave in such recognizable ways, even though it knows nothing about politics, religion, or other touchy subjects.

Once again, only a small subset of the 400 variables actually matters for prediction:

Top 10 Variables for Predicting Political Beliefs

Hopefully I’ve conveyed how great BigML is at sifting through a dataset with lots of variables.  This type of “wide” dataset pops up all the time in business, especially when examining customer behavior, and traditional tools like Excel or Tableau simply aren’t designed to handle the analysis. By examining the full richness of your data, BigML helps you focus on what’s really important—even if it’s traffic tickets.

Advancing Machine Learning integration with Apple ResearchKit and HealthKit

by

At BigML we are excited to announce BigMLKit, a new open source framework for iOS and OS X that blends the power of BigML’s best-in-class Machine Learning platform with the ease and immediacy of Apple technologies.

BigMLKit

BigMLKit brings the capability of “one-click-to-predict” to iOS and OS X developers in that it makes it really easy to interact with BigML’s REST API though a higher-level view of a “task.” A task is, in its most basic version, a sequence of steps that is carried out through BigML’s API. Each step has traditionally required a certain amount of work such as preparing the data, launching the remote operation, waiting for it to complete, collecting the right data to prepare the  next step and so on. BigMLKit takes care of all of this “glue logic” for you in a streamlined manner, while also providing an abstracted way to interact with BigML and build complex tasks on top of our platform.

BigML is already offering a variety of tools and libraries to make it easy to integrate BigML with whatever environment you might be working in. This includes a REST API, as well as bindings that provide a higher-lever view of it from the most popular programming languages, including PythonNode.jsObjective-C, and so on.  We also provide more advanced tools such as our powerful bigmler, a veritable command-line Swiss Army knife for machine learning, and we have many more surprises in the works that will make machine learning capabilities ever more accessible.

The introduction of HealthKit put the iPhone into the rapidly growing field of health tracking devices that can be used to monitor daily activities that impact one’s health. The Apple Watch will certainly fuel the trend towards health-oriented applications, and the recent open-sourcing of ResearchKit by Apple is providing further momentum for this to extend into medical research.

All of this surely creates a powerful constellation, but it leaves behind a key factor which is not included in the solution that Apple provides with HealthKit and ResearchKit: an easy way to make sense of the collected data. This is where BigML is happy to enter the picture with BigMLKit, which we believe will be a key enabler for a new class of applications in health care and medical research that will empower researchers, doctors, hospitals and health professionals to learn from health data collected via HealthKit and ResearchKit.

BigMLKit thus reaffirms BigML’s commitment to enable new machine-learning-powered applications on any platforms – and adds a special focus on the Apple ecosystem, where the combination of  existing and emerging devices and solutions (such as the iPhone, HealthKit, Apple Watch and ResearchKit) is promising to revolutionize health care and health research.

BigMLKit is still a very young project that can be found on GitHub. We welcome your feedback and we really appreciate your pull requests. Stay tuned for more updates, including a follow-up post with more information about the way you can integrate BigMLKit in your app.

Democratizing Machine Learning: The More, The Merrier!

The machine learning marketplace is heating up. The latest news in the machine learning front was Amazon’s launch of Amazon Machine Learning, which follows a few months on the heels of the commercial release of Azure Machine Learning from Microsoft.  These forays from technology stalwarts (along with IBM Watson) show that the marketplace is ready for machine learning at scale, which certainly reflects the growing business imperative to be able to make smarter decisions from Big Data backends. And more companies providing machine learning solutions is good for the industry at large:  it provides customers with more choices, and will further hasten the pace of innovation from machine learning providers, including BigML.

While BigML clearly isn’t as big as Microsoft, Amazon and the like we do have the benefit of perspective as we were the first company to bet on democratizing machine learning way back in 2011. (At that time Google Prediction API existed but was only oriented to developers, and hasn’t evolved much since).  Rather than pointing out that imitation is the sincerest form of flattery (and yes, we are flattered!), we think this is a good opportunity to highlight some top attributes of BigML in relation to emerging solutions on the marketplace.

mls

BigML provides a robust, full-featured and scalable platform which has been informed by feedback from over 17,000 users who have created tens of millions of predictive models and machine learning tasks that have supported a countless number of predictions.

  1. Key differentiators of the BigML platform include:
  • Support for both supervised and unsupervised learning techniques:  in addition to classification and regression tasks solved by interpretable decision trees or ensembles for top tier performance,  BigML supports cluster analysis and anomaly detection.  And our 2015 roadmap is chock full of added algorithms and techniques for data exploration.
  • Best-of-market interface and visualizations: “Beautiful” “wow” and “amazing” are typical reactions I’ve heard while presenting BigML to customers and conferences.  Check it out for yourself and let us know of another interface that is as rich, enjoyable and intuitive as BigML.
  • Full-featured REST API for programmatic access to advanced ML capabilities, with bindings in several languages:  as beautiful as our interface may be, the brawn and brains of BigML rests in our open API that developers and analysts alike can use to quickly create predictive workflows and other machine learning tasks.
  • Easy sharing of resources and models, including the ability to export models from BigML locally and/or for incorporation into related systems & services:  want to export a model from Azure or Amazon ML?  Good luck with that.  BigML makes it easy to export your models via the interface or API, and you’re free to use your models wherever you wish.
  • BigML Private Deployments can be implemented in any cloud and/or on premise:  As BigML penetrates deeper into the enterprise, our willingness and ability to run in a corporate datacenter has become a critical differentiator.  In addition, we’ve implemented BigML not just on AWS, but also in the Azure and other public and private clouds.
  • In-platform feature engineering and data transformations:  BigML’s Flatline makes it easy to extend and create new features for you dataset, without having to go back to your source – both in the BigML interface and programatically using a rich set of predefined, ML-aware functions or building your own.
  1. BigML is suitable for developers and enterprises alike:
  • Pricing starts at $30/mo for individual users & developers – and you can actually use BigML for free in our Developer mode for tasks under 16MB.
  • Enterprises can purchase fully loaded “custom” subscriptions (bundled with training, support and more) and/or implement a BigML Private Deployment – either in the cloud or behind their firewall
  • All of these approaches (subscriptions or Private Deployments) include unlimited machine learning tasks along with the ability to export models.
  • BigML never charges subscribers for predictions against your own models (in contrast to Azure and Amazon)
  • With BigML subscriptions you can train models as many times as you want — and in parallel — at no extra fee
  1. BigML offers customers both an advanced analytics platform as well as a foundation for development and deployment of predictive applications:
  • It was almost two years ago when Mike Gualtieri at Forrester stated “predictive apps are the next big thing” – and we here at BigML are seeing the reality of that vision on a daily basis both with ISVs and with enterprise developers.
  • As BigML models can be exported, they can easily be incorporated into apps and services – enabling developers to focus on their solution rather than in creating and maintaining ML algorithms
  • BigML offers expert services (directly and through our partners) to help with development and deployment of predictive apps

Beyond the tangible differences listed above, as a nimble, hungry company BigML will constantly innovate at a furious pace to meet and exceed our customers’ needs.  We’re passionate about supporting our users and engage with our enterprise customers on a very integrated basis to ensure not only the success of their implementations, but also that our platform evolves according to current and emerging business requirements.

Want to learn more about BigML and/or get an update on our latest & greatest features?  Contact us and we’ll be happy to run you through a demonstration and discuss our various engagement options.  Or, you can simply get started today!

PAPIs Connect: Europe’s First Machine Learning Event for Decision Makers

A few weeks ago we told you about PAPIs’15, the 2nd International Conference on Predictive APIs and Apps, taking place on August 6-7, 2015 in Sydney, Australia. BigML was a proud sponsor of PAPIs’14 and we look forward to meeting the community again in August.

We’ll also have more opportunities to meet with predictive APIs and predictive apps enthusiasts with the new PAPIs Connect series of events. PAPIs Connect complements the annual PAPIs conference by focusing more on business cases and applications with the aim of educating decision makers about the possibilities of machine learning. BigML will be sponsoring the first edition of PAPIs Connect, which will take place on May 21, 2015 in Paris, France.

PAPIs Connect'15

For the predictive revolution to happen, it is essential to have tools like BigML that lower the barrier to entry for machine learning. Knowing how to use this new technology is not enough, though: we also need to connect it to the domains in which it can have an impact. To do this, it is important to know how to target the right problems that will allow us to create business value from data through machine learning.

PAPIs Connect attendees will gain a business understanding of machine learning and of its importance for their organisations. They will discover what others are doing with predictive technologies, which will likely inspire them to develop their own use cases. Connect is also a great opportunity to meet thought leaders and experts who have used data to deliver an impact on their organizations. Moreover, BigML’s VP of Data Science David Gerster will be showcasing the unique automatic anomaly detection capability that was recently introduced by BigML!

You can see a preliminary version of the program on Lanyrd and can register for the Paris event at the early bird rate until April 17th. In addition, if you have an interesting case study or application built using BigML that you’d like to share with the rest of the world, please let us know and we’ll get you invited to PAPIs Connect in Paris or PAPIs’14 in Sydney!

%d bloggers like this: