Following up our post announcing the availability of BigMLKit, we are now going to introduce the BigMLKit API and present a sample app that can be used as a playground to experiment with BigMLKit.
As already mentioned in the previous post, BigMLKit brings the capability of “one-click-to-predict” to iOS and OS X developers. This is accomplished through the notion of task, which is basically a sequence of steps. Each step has traditionally required a certain amount of work such as preparing the data, calling BigML’s REST API, waiting for the operation to complete, collecting the right data to prepare the next step and so on. BigMLKit takes care of all of this “glue logic” for you in a streamlined manner, while also providing an abstracted way to interact with BigML and build complex tasks on top of our platform.
BigMLKit’s classes can be grouped into three groups:
Everything in BigML is associated to resources, such as datasets, clusters, sources, etc. A resource’s identity in BigMLKit is defined through a name and a UUID (universally unique identifier), which are encapsulated in the BMLResourceProtocol protocol. A concrete implementation of BMLResourceProtocol will additionally provide more properties and/or methods according to the specific application that is being built. If you want to be able to filter available resources locally, you will possibly need to use Core Data and define a model whose entities contain the attributes you want to filter on. On the other hand, if you only want to support remote behavior for your entities, then their UUID is enough information for the REST API to handle them.
BigMLKit defines three basic types to build resource UUIDs:
The three types are typedef’ed NSStrings. According to how BigML REST API identifies resources, a BMLResourceFullUuid is made up of a BMLResourceType and a BMLResourceUuid joined through a slash, e.g. “model/de305d54-75b4-431b-adb2-eb6b9e546014″. The class BMLResourceUtils defines convenience methods to extract a BMLResourceType or BMLResourceUuid from a BMLResourceFullUuid.
Tasks and Workflows
Tasks and Workflows are what makes BigMLKit useful.
A workflow is a collection of BigML operations. It can be as simple as a single call to BigML’s REST API or it can include multiple steps, e.g., when creating a sequence of BigML resources starting with a dataset and ending with a prediction.
BigMLKit provides several classes to define and use tasks and workflows, as detailed below.
BMLWorkflow is an abstract base class that is used to build composite workflows combining lower-level workflows together. The simplest form of BMLWorkflow is a BMLWorkflowTask, which corresponds to a single step workflow. BigMLKit provides several BMLWorkflowTask-derived classes that represent basic operations that BigML REST API allows to execute:
BMLWorkflowTaskSequence is a higher-level workflow that is able to execute a sequence of workflows.
BMLWorkflowTaskContext provides the context for task execution, where input, output, and intermediate results can be stored. The context also acts as a monitor for remote operations: it will poll BigML API to check a resource state progress and handle it according to its semantics. The storage mechanism is exposed through an NSMutableDictionary. The association key/value is an implementation detail of the workflows that use the context to carry through their operation. A context also hosts a connector object, which is responsible for handling the communication with BigML through its API interface. Currently, the connector object is an instance of ML4iOS.
BigML’s REST API offers a lot of options to configure the available machine learning algorithms. For each resource type, BigMLKit provides a plist file that describes which options are available and what their type is, so a program can easily handle them, e.g., to display a list of available options or allowing users to set values for them. There are three main classes at play here:
- BMLWorkflowTaskConfiguration, which allows for collecting all options in a common place and accessing them in an organized way; e.g., by getting all option definitions, or their values, etc.
- BMLWorkflowTaskConfigurationOption, which is the atomic option. This basically provides a way to set whether the options should be effectively used in a given execution of the workflow, and to retrieve the current option value.
- BMLWorkflowConfigurator, which is a container for all the BMLWorkflowTaskConfiguration instances associated with a user session. A configurator can be shared across multiple executions of the same workflow, or even different workflows.
In many cases, it is enough to use BigML’s default values for configuration options so there is no need to tweak them. The topic of BigMLKit configuration will be explored in a further post.
Running a Workflow
Running a workflow requires two preliminary steps:
- creating the workflow
- creating and setting up the context for its execution.
As mentioned above, at the moment BigMLKit provides single-operation workflows and a task sequence workflow, but you can easily implement any kind of specific workflow that you might need. Creating and setting up a context is workflow specific. You can see an example in the sample app introduced below.
Once the preliminary steps are done, you can run a workflow by calling the runInContext:completionBlock: method. The completion block will be called at the end of the workflow execution with an NSError argument in case an error occurred.
BigMLKitSampleApp is a simple iOS app that shows how you can integrate BigMLKit into your apps. The sample app is available on GitHub and allows to create a prediction from a data file, which is used to train a model. All the steps from datasource creation up to the model creation are executed remotely on BigML servers, while the prediction step is executed locally based on the calculated model and does NOT require access to BigML services. Three sample source files are provided: iris.csv, diab.csv, and wines.csv.
To keep the sample app code simple enough, it defaults to creating a decision tree, although BigMLKit also provides support for training clusters, and, in the near future, anomaly detectors and other Machine Learning algorithms provided by BigML. Furthermore, the app uses static data files to train the models, but in a real application you could as easily read the data to train your models from iOS HealthKit and/or ResearchKit, or you could use HealthKit/ResearchKit data to make predictions based on an existing reference model.
To understand how BigMLKit is integrated into the app, you can inspect the
BMLPredictionViewController class, and in particular its two methods called
setupFromModel: method is called whenever the app delegate detects that the user tapped on any of the three available source files. On the other hand, startWorkflow is responsible for enabling the UI to provide the user with some visual feedback about the workflow being executed. It also handles the display of workflow results. In greater detail:
- When a tap on a resource file is detected, the app delegate stores the current source file in the shared view model, then calls
setupFromModel:creates a new workflow, properly initializes a context, and finally calls
startWorkflowwill update the UI and then it will start the workflow and provide a callback.
- The callback, if no errors are found, will use the workflow results (available in the workflow context) to build a prediction form, so the user can try different combinations of input arguments to make new predictions.
This is all that is required! As you can see, BigMLKit makes it really straightforward to run a simple workflow and use the power of machine learning in your apps, and we hope that you will find great applications for our technology and create extensions to BigMLKit that will make it even more convenient.
If you have any questions on how to get started with BigMLKit, feel free to contact us at email@example.com.
This is the first post in a series of statistics primers to inaugurate the arrival of BigML’s new advanced statistics feature. Depending on your background as a reader, the theory portion of this post may cover ideas which you already understand. If that’s the case go ahead and skip ahead to how to access these stats in BigML. Today’s topic is Benford’s law, which can be applied to detect irregularities in numeric data. It applies to collections of numeric data whose values satisfy the following criteria:
- Have a wide distribution, spanning several orders of magnitude.
- Generated by some “natural” process, rather than, say, arbitrarily chosen by a human.
Given that those conditions are met, Benford’s law states that the first significant digits (FSDs) will be distributed in a very specific pattern. In other words, we can take each of the digits from 1 to 9 and look at the relative proportion with which they appear in the first significant position among values in the data (e.g. the FSDs for the values 122.4, -54.01, and 0.0048 are 1, 5, and 4 respectively). If these proportions match the ones predicted by Benford’s law, then we can be assured that our data satisfy our two criteria. Otherwise, the data may have been tampered with, or may simply cover too narrow of a range for Benford’s law to apply. If we denote pd as the proportion of the data in which the digit d is in the first significant position, Benford’s law states that these proportions will take on the following values:
In the plots that follow, p1 through p9 are drawn as the green line. We see that 1 should be the FSD in about 30% of the data while 9 should only be about 6% of the FSDs. The first two plots are examples of numeric data which conform to Benford’s law. The Fibonacci numbers and US county populations both satisfy the criteria given above. The gray bars denote the relative proportions of FSDs in the data.
The next two plots are examples of non-conforming data. The first example is data from the ubiquitous Iris dataset. Although it is undeniably a natural dataset, it fails the first criterion, since its values span only the the narrow range from 4-8 cm. The second example is an instance of fraudulent data. As chronicled in the State of Arizona v. Wayne James Nelson (CV92-18841), Mr. Nelson, a manager in the Arizona state treasurer’s office, attempted to embezzle nearly $2 million through bogus vendor payments. Since Nelson started small and worked his way up to larger amounts, the values do satisfy the first criterion. However, as all the amounts were artificially invented, the second criterion is not satisfied and the final FSD distribution is very far from the one given by Benford’s law, with the digits 1-6 being too scarce and 7-9 being much more common than expected.
The last of these examples highlights the potential usefulness of this phenomenon in detecting suspicious numbers, and indeed there are many documented cases where fraudulent data have been exposed through application of Benford’s law. Multiple analyses of results from the 2009 Iranian presidential elections have used Benford’s law to provide statistical evidence suggesting vote rigging. A post mortem Benford’s law analysis of the accounts for several bankrupt US municipalities revealed inconsistent figures, which could be indicative of the fiscal dishonesty which led to the municipalities’ financial ruin. A team of German economists applied a Benford’s law analysis to the accounting statistics reported by European Union member and candidate nations during the years leading up to the 2010 EU sovereign debt crisis. They found that the numbers released by Greece showed the highest degree of deviation from the expected Benford’s law distribution. As Greek national debt was one of the main drivers of the crisis, we can draw the conclusion that the Greek government was fudging the numbers to hide its fiscal instability. Interestingly, while researching this topic we found that the Greek source data for this analysis is now conspicuously absent from EUROSTAT website.
Testing Benford’s Law
Having seen that deviation from Benford’s law can be a useful indicator of anomalous data, we are left with the question of actually quantifying that deviation. This brings us to the topic of statistical hypothesis testing, in which we seek to confirm or reject some hypotheses about a random process, given a finite number of observations from that process. For the purposes of our current discussion, the random process in question is the population from which our numeric data are drawn, and the hypotheses we consider are as follows:
H0 (null hypothesis): The population’s FSD distribution conforms to Benford’s Law
H1 (alternate hypothesis): The popluation’s FSD distribution is different from Benford’s Law
Depending on the outcome of the test, we either accept the null hypothesis, or reject it in favor of the alternate hypothesis. In the latter case, we may have grounds for applying more scrutiny to the values as failure to fit Benford’s law can be a sign of questionable data. The second piece of a statistical test is a significance level, also known as a p-value. In statistics, the results we obtain are not concrete facts; rather, our conclusions are parameterized by some level of certainty less than 100%. The precise definition of the p-value is rather nuanced, but we can think of it as how extreme the calculated test statistic is, under the assumption that the null hypothesis is true. The workflow of a statistical test is thus as follows:
- Calculate a test statistic from the sample data, using the method prescribed for the specific test.
- Choose a desired significance level, which determines a critical value for the test statistic.
- If the calculated statistic is greater than the critical value, then the null hypothesis is rejected at the chosen significance level. Otherwise, the null hypothesis is accepted.
Chi Square Test-of-fit
This test is a general purpose test for verifying whether data are distributed according to any arbitrary distribution. The test statistic is computed from counts rather than proportions. Let be the observed proportion of digit d in the data’s FSD distribution, and be the expected Benford’s law proportion defined previously. For a data set containing N observations, the observed and expected frequencies are given by and respectively. The Chi-square statistic is defined as follows:
The critical value for this test comes from a chi-square distribution with (9-1) = 8 degrees of freedom. For a significance level of 0.01, we get a critical value of 20.09. If the value of χ2 is greater than this value, then we can reject a fit to Benford’s law with 99% certainty. In the Nelson check fraud dataset, we have the following observed frequencies:
In other words, 1 was the first significant digit in one of the entries, while 9 was the FSD in 8 entries. For this 22 point dataset, our expected Benford’s law frequencies are:
Computing the chi-square statistic is a simple matter of plugging in the values:
The obtained value is greater than the critical value, so we can indeed say that the fraudulent check data do not fit Benford’s Law. Iris, our other non-conforming dataset also produces a chi-square statistic larger than the critical value (506.3930), while the Fibonacci and US Census datasets produce values less than the critical value (0.1985 and 10.6314 respectively).
For small sample sizes, the chi-square test can encounter difficulty in discriminating between data which do and do not fit Benford’s Law. The Cho-Gaines’ d statistic is an alternative test which is formulated to be less sensitive to sample size. It is defined as follows:
For a significance level of 0.01, the critical value for d is 1.569. The values for d from our example data are 0.114, 1.066, 7.124, and 2.789 for the Fibonacci, US Counties, Iris, and Nelson datasets respectively. The first two values are less than the critical value, whereas the last two are greater, thus producing a result which is consistent with the chi-square test and visual comparison of the FSD distributions. Rather than being computed from a well parameterized distribution like the chi-square test, these critical values for the Cho-Gaines’ d test are obtained from Monte Carlo simulations, and are only available for a few select significance levels. This means that it is not possible to know the exact p-value for any arbitrary value of d, and thus represents a tradeoff compared to the chi-square test.
In this post, we’ve explored First Significant Digit analysis with Benford’s Law. This straightforward concept, when combined with simple statistical tests, can be a useful indicator for rooting out anomalous numeric data. Benford’s law analysis is one of the many statistical analysis tools that are being incorporated into BigML. So stay tuned for a follow up post on how to perform this handy task and more on BigML.
It’s been a few months since BigML was named a winner of Informatica’s Connect-a-Thon competition at their Data Mania event, where we first announced and showcased capabilities of the Informatica Connector for BigML. We’re excited to build on this relationship by being a sponsor of the Cloud Innovation Summit at Informatica World 2015. We’ll also be speaking at the Summit and demonstrating in the Informatica World Solutions Expo (booth 326).
As BigML continues to expand our base of enterprise customers around the world, we see this partnership as a critical component – helping companies quickly leverage data from their on-premise systems and cloud-based apps to perform an array of advanced analytics and/or to build predictive applications through BigML.
So if you’ll be attending Informatica World in Las Vegas next week, please be sure to stop by our booth and/or reach out to us (firstname.lastname@example.org) to arrange a 1-1 meeting!
Today our very own José A. “jao” Ortega will be presenting on ‘The Past, Present, and Future of Machine Learning APIs’ at the APIDays Mediterranea & API Words event taking place in Barcelona, Spain. This event is part of an independent conference series dedicated to APIs, Natural Language Processing and Language Technology. During the two-day conference representatives from startups, corporations and those involved in the API industry will have a chance to discuss, learn and share about the future and business of APIs.
As the event organizers suggest Web 1.0 was readable, Web 2.0 was social and now the web is PROGRAMMABLE through APIs. Since 2011, BigML has worked to implement our similar vision of a programmable web powered by a seamless machine learning layer in the cloud. Jao’s presentation delves deep into the origins of machine learning – including current success stories and challenges it faces in making an impact in the “real world” as well as the bold ideas on the future direction in the space. The presentation’s emphasis is on machine learning being easily embedded in future smart apps able to adapt themselves to their context in real-time as new information arrives. Simplicity, programmability, importability / exportability, composability, specialization and standardization will all play big parts in making this future vision come alive. We truly believe this is the dawn of a new era for the Internet and the digital economy that it has bred. Machine Learning APIs will be the disruptive force behind this new movement and innovators from all corners of the world are already in on the secret.
As it’s likely that many of you won’t have a chance to be in Barcelona in person, we are posting Jao’s presentation on Slideshare. Let us know of your thoughts and better yet let us know how BigML can be part of your dream!
This blog post is based on a talk I gave at the Dare2Data conference in Madrid.
I recently found a fascinating sociology survey with more than 39,000 responses to almost 400 questions. The survey, which has been given in the United States since 1972, covers a wide range of topics. Besides demographic info like age, gender, race and income, the survey also covers personal beliefs (“Should racists be allowed to teach college?”), living situation (“Have you been too tired to do housework recently?”) and life experience (“Have you ever injected illicit drugs?”).
While it’s great to have a dataset that’s so, um, rich, most of the variables are simply not relevant to whatever it is I want to predict. If I’m predicting whether your income is higher or lower than the United States median of $50,000, it doesn’t really matter if you’ve received a traffic ticket for a moving violation, or if you think marriage counseling is scientific. (Yes, those are actual questions.)
This is where BigML comes in. Because our algorithm does a “greedy” search through the data, examining every input individually to see how well it predicts the output, it excels at finding the needle of insight in a haystack of irrelevance. BigML actually does check whether moving violations predict income, but quickly learns that marital status, education, employment and age are much more useful.
Of course, if you change what you’re trying to predict, the list of important variables changes too. At Dare2Data, I tried predicting political beliefs instead of income, with interesting results. (Since I excluded moderates from the training set, it’s more accurate to say that I’m predicting strongly held political beliefs.)
For example, if you meet these five criteria, then you identify as conservative more than 85% of the time:
- You disapprove of homosexuality (or don’t respond to the question);
- You disapprove of sex before marriage (or don’t respond to the question);
- You are white;
- You go to church almost every week;
- You live in a single-family detached house (a proxy for living in the suburbs).
Of the 2,550 people who meet these five criteria, 2,224 (more than 85%) identify as conservative. This group, who might call themselves “social conservatives”, are an impressive 19% of conservatives in the entire dataset.
The model even finds a sixth factor: if you are also Protestant, but not United Methodist, then you are even more likely to be conservative. At first I thought this was just noise, but there is actually a large liberal wing within the United Methodist Church that supports same-sex marriage. Amazingly, BigML is able to find this nuance in the data—talk about a needle in a haystack!
On the liberal side, there’s a group that doesn’t disapprove of homosexuality, does disapprove of the death penalty, and is strongly pro-choice. This group is about 85% liberal, accounting for 12% of all liberals in the dataset. Again, it’s remarkable that BigML can find groups of people that behave in such recognizable ways, even though it knows nothing about politics, religion, or other touchy subjects.
Once again, only a small subset of the 400 variables actually matters for prediction:
Hopefully I’ve conveyed how great BigML is at sifting through a dataset with lots of variables. This type of “wide” dataset pops up all the time in business, especially when examining customer behavior, and traditional tools like Excel or Tableau simply aren’t designed to handle the analysis. By examining the full richness of your data, BigML helps you focus on what’s really important—even if it’s traffic tickets.
At BigML we are excited to announce BigMLKit, a new open source framework for iOS and OS X that blends the power of BigML’s best-in-class Machine Learning platform with the ease and immediacy of Apple technologies.
BigMLKit brings the capability of “one-click-to-predict” to iOS and OS X developers in that it makes it really easy to interact with BigML’s REST API though a higher-level view of a “task.” A task is, in its most basic version, a sequence of steps that is carried out through BigML’s API. Each step has traditionally required a certain amount of work such as preparing the data, launching the remote operation, waiting for it to complete, collecting the right data to prepare the next step and so on. BigMLKit takes care of all of this “glue logic” for you in a streamlined manner, while also providing an abstracted way to interact with BigML and build complex tasks on top of our platform.
BigML is already offering a variety of tools and libraries to make it easy to integrate BigML with whatever environment you might be working in. This includes a REST API, as well as bindings that provide a higher-lever view of it from the most popular programming languages, including Python, Node.js, Objective-C, and so on. We also provide more advanced tools such as our powerful bigmler, a veritable command-line Swiss Army knife for machine learning, and we have many more surprises in the works that will make machine learning capabilities ever more accessible.
The introduction of HealthKit put the iPhone into the rapidly growing field of health tracking devices that can be used to monitor daily activities that impact one’s health. The Apple Watch will certainly fuel the trend towards health-oriented applications, and the recent open-sourcing of ResearchKit by Apple is providing further momentum for this to extend into medical research.
All of this surely creates a powerful constellation, but it leaves behind a key factor which is not included in the solution that Apple provides with HealthKit and ResearchKit: an easy way to make sense of the collected data. This is where BigML is happy to enter the picture with BigMLKit, which we believe will be a key enabler for a new class of applications in health care and medical research that will empower researchers, doctors, hospitals and health professionals to learn from health data collected via HealthKit and ResearchKit.
BigMLKit thus reaffirms BigML’s commitment to enable new machine-learning-powered applications on any platforms – and adds a special focus on the Apple ecosystem, where the combination of existing and emerging devices and solutions (such as the iPhone, HealthKit, Apple Watch and ResearchKit) is promising to revolutionize health care and health research.
BigMLKit is still a very young project that can be found on GitHub. We welcome your feedback and we really appreciate your pull requests. Stay tuned for more updates, including a follow-up post with more information about the way you can integrate BigMLKit in your app.
The machine learning marketplace is heating up. The latest news in the machine learning front was Amazon’s launch of Amazon Machine Learning, which follows a few months on the heels of the commercial release of Azure Machine Learning from Microsoft. These forays from technology stalwarts (along with IBM Watson) show that the marketplace is ready for machine learning at scale, which certainly reflects the growing business imperative to be able to make smarter decisions from Big Data backends. And more companies providing machine learning solutions is good for the industry at large: it provides customers with more choices, and will further hasten the pace of innovation from machine learning providers, including BigML.
While BigML clearly isn’t as big as Microsoft, Amazon and the like we do have the benefit of perspective as we were the first company to bet on democratizing machine learning way back in 2011. (At that time Google Prediction API existed but was only oriented to developers, and hasn’t evolved much since). Rather than pointing out that imitation is the sincerest form of flattery (and yes, we are flattered!), we think this is a good opportunity to highlight some top attributes of BigML in relation to emerging solutions on the marketplace.
BigML provides a robust, full-featured and scalable platform which has been informed by feedback from over 17,000 users who have created tens of millions of predictive models and machine learning tasks that have supported a countless number of predictions.
- Key differentiators of the BigML platform include:
- Support for both supervised and unsupervised learning techniques: in addition to classification and regression tasks solved by interpretable decision trees or ensembles for top tier performance, BigML supports cluster analysis and anomaly detection. And our 2015 roadmap is chock full of added algorithms and techniques for data exploration.
- Best-of-market interface and visualizations: “Beautiful” “wow” and “amazing” are typical reactions I’ve heard while presenting BigML to customers and conferences. Check it out for yourself and let us know of another interface that is as rich, enjoyable and intuitive as BigML.
- Full-featured REST API for programmatic access to advanced ML capabilities, with bindings in several languages: as beautiful as our interface may be, the brawn and brains of BigML rests in our open API that developers and analysts alike can use to quickly create predictive workflows and other machine learning tasks.
- Easy sharing of resources and models, including the ability to export models from BigML locally and/or for incorporation into related systems & services: want to export a model from Azure or Amazon ML? Good luck with that. BigML makes it easy to export your models via the interface or API, and you’re free to use your models wherever you wish.
- BigML Private Deployments can be implemented in any cloud and/or on premise: As BigML penetrates deeper into the enterprise, our willingness and ability to run in a corporate datacenter has become a critical differentiator. In addition, we’ve implemented BigML not just on AWS, but also in the Azure and other public and private clouds.
- In-platform feature engineering and data transformations: BigML’s Flatline makes it easy to extend and create new features for you dataset, without having to go back to your source – both in the BigML interface and programatically using a rich set of predefined, ML-aware functions or building your own.
- BigML is suitable for developers and enterprises alike:
- Pricing starts at $30/mo for individual users & developers – and you can actually use BigML for free in our Developer mode for tasks under 16MB.
- Enterprises can purchase fully loaded “custom” subscriptions (bundled with training, support and more) and/or implement a BigML Private Deployment – either in the cloud or behind their firewall
- All of these approaches (subscriptions or Private Deployments) include unlimited machine learning tasks along with the ability to export models.
- BigML never charges subscribers for predictions against your own models (in contrast to Azure and Amazon)
- With BigML subscriptions you can train models as many times as you want — and in parallel — at no extra fee
- BigML offers customers both an advanced analytics platform as well as a foundation for development and deployment of predictive applications:
- It was almost two years ago when Mike Gualtieri at Forrester stated “predictive apps are the next big thing” – and we here at BigML are seeing the reality of that vision on a daily basis both with ISVs and with enterprise developers.
- As BigML models can be exported, they can easily be incorporated into apps and services – enabling developers to focus on their solution rather than in creating and maintaining ML algorithms
- BigML offers expert services (directly and through our partners) to help with development and deployment of predictive apps
Beyond the tangible differences listed above, as a nimble, hungry company BigML will constantly innovate at a furious pace to meet and exceed our customers’ needs. We’re passionate about supporting our users and engage with our enterprise customers on a very integrated basis to ensure not only the success of their implementations, but also that our platform evolves according to current and emerging business requirements.
Want to learn more about BigML and/or get an update on our latest & greatest features? Contact us and we’ll be happy to run you through a demonstration and discuss our various engagement options. Or, you can simply get started today!
A few weeks ago we told you about PAPIs’15, the 2nd International Conference on Predictive APIs and Apps, taking place on August 6-7, 2015 in Sydney, Australia. BigML was a proud sponsor of PAPIs’14 and we look forward to meeting the community again in August.
We’ll also have more opportunities to meet with predictive APIs and predictive apps enthusiasts with the new PAPIs Connect series of events. PAPIs Connect complements the annual PAPIs conference by focusing more on business cases and applications with the aim of educating decision makers about the possibilities of machine learning. BigML will be sponsoring the first edition of PAPIs Connect, which will take place on May 21, 2015 in Paris, France.
For the predictive revolution to happen, it is essential to have tools like BigML that lower the barrier to entry for machine learning. Knowing how to use this new technology is not enough, though: we also need to connect it to the domains in which it can have an impact. To do this, it is important to know how to target the right problems that will allow us to create business value from data through machine learning.
PAPIs Connect attendees will gain a business understanding of machine learning and of its importance for their organisations. They will discover what others are doing with predictive technologies, which will likely inspire them to develop their own use cases. Connect is also a great opportunity to meet thought leaders and experts who have used data to deliver an impact on their organizations. Moreover, BigML’s VP of Data Science David Gerster will be showcasing the unique automatic anomaly detection capability that was recently introduced by BigML!
You can see a preliminary version of the program on Lanyrd and can register for the Paris event at the early bird rate until April 17th. In addition, if you have an interesting case study or application built using BigML that you’d like to share with the rest of the world, please let us know and we’ll get you invited to PAPIs Connect in Paris or PAPIs’14 in Sydney!
This week, I had the honor to present at AIIA’s cross-industry luncheon here in Melbourne thanks to the support of BigML’s local partner GCS Agile. The Australian Information Industry Association (AIIA) is the peak representative body and advocacy group for the ICT (Information and Communications Technology) industry in Australia. For over 35 years it has been their mission to advocate, promote, represent and grow the ICT industry in Australia as a not-for-profit organization, with over 400 member organizations covering a large spectrum between hardware, software, and services companies.
This year, the AIIA is running three cross-industry events under the umbrella theme of ‘Building the Digital Economy’. Featuring the utility, airport, transport, logistics, retail and finance sectors with separate sessions exploring intelligent operations, connectedness and digitization of the customer experience. These themes align with the Victorian government themes of achieving sustainability, productivity and citizen engagement through technology as discussed in their 2014 ICT Strategy and Digital Strategy.
‘Intelligent Operations’ is a mouthful term, but it is really meant to describe how intelligent technologies, including machine learning and predictive analytics can be used by businesses to drive operational efficiencies, employee productivity and improved customer service. Guest speakers at our luncheon included Paul Bunker, Manager, Business Systems & ICT, Melbourne Airport; Sue O’Connor, Deputy Chair, Goulburn Valley Water Corporation as well as myself (Atakan Cetinsoy, V.P. – Predictive Applications, BigML). After Rebecca Campbell-Burns of AIIA set the stage for the afternoon, Mr. Bunker took the podium making a strong case on how the Melbourne International Airport’s track record of operational excellence has added to the continued economic vibrancy of the state of Victoria stressing that they run a 24/7 operation, where cargo planes take to the air precious commodities to Asian destinations every night after the passenger airliner traffic subsides. Managing physical assets efficiently in this fast-paced context, while targeting a world-class traveler experience from the point of arrival until departure requires an analytical blanket that can adapt to sudden changes that may be caused by inclement weather or tightened security, which makes for a very interesting predictive analytics challenges.
Sue O’Connor’s presentation focused on the need for Goulburn Valley Water’s efforts to maintain a very affordable price point for drinking water, the most basic of human needs, at a time of environmental challenges, all the while making the necessary infrastructure investments to ensure the ability to meet growing demand now and in the future despite tight capital and operational expenditure budgets. Sue went on to stress they intend to invest in Internet-enabled sensor networks to the extent that there is a clear business case and attractive ROI.
As I alluded to in my presentation, utility and aviation industries have a huge economic upside ($95 billion USD in savings as per a recent GE study) in efficiency terms from being able to better manage their existing infrastructure with the help of real-time sensor measurements. As long as there is a way to analyze and interpret this tsunami of data in order to detect key signals business value can be drawn in multiple ways. For instance, it may be wise to prioritize big data initiatives targeting cost savings first due to clear return on investment. Predictive maintenance schemes can avoid unnecessary dispatches of field maintenance personnel saving utilities significant amounts in costs. However, sensor data can also be interpreted in ways that help launch completely new context sensitive value added services that can create new revenue streams all together. Luckily, machine learning is here to help with all these use cases. BigML’s “API first” approach to massively scaling carefully curated and well proven machine learning algorithms has been designed to streamline the process from raw data ingestion to real predictive insights. If interested on the topic, you can view my presentation deck on Slideshare.
Up next for us is a trip to Sydney, where we will be presenting at two different events on Wednesday (March 25, 2015). Feel free to come by and join us at either forum by following the links below.
- Advanced Analytics Institute Event at The University of Technology, Sydney
- Big Data Analytics Meetup, Sydney
We will do BigML demos followed by interactive discussions on the promise of machine learning in Australia. It should be fun!
Machine learning and data mining play very nicely with data in a row-column format, where each row represents a data point and each column represents information about that point. It’s a natural format, and is of course the basis for things like spreadsheets, databases, and CSV files.
But what if your data isn’t so conveniently formatted? Let’s say you have an arbitrary pile of documents, like product reviews, and you’d like to classify each one. A simple thing to do would be to use word counts as features, but then you’re forced to make arbitrary decisions about which words are important. If you just use all words, you end up with thousands or maybe tens of thousands of features, which generally decreases the efficiency of machine learning. Moreover, simply counting words gives no information about the context in which the word was used.
Does machine learning know a better way?
Thankfully, there are technologies called topic models which take some steps towards solving this problem. The general idea is to look for “topics” in the data, which are essentially groups of words that often occur together (this is a gross oversimplification, but gives the correct flavor). For example, in a collection of news articles, you may discover a topic that has the words “Obama”, “Congress”, and “President”, which would correspond to the real-world topic of politics. We can then assign each document a score for each topic, indicating that this document is “well-explained” by that topic. When we do this, we transform possibly tens of thousands of words into a small number (~10-100) of features, each one packed with information.
This is a fairly general way of thinking about this problem. For example, you could use the same technology on shopping baskets (arbitrary lists of product serial numbers, say), and the “topics” would be groups of serial numbers that are often purchased together. The main limitation on the usefulness of this is the average length of each document. Because we’re relying on word co-occurance, we’d like our documents to be as long as possible so that we have lots of co-occurances to work with. Twitter-length documents is around the point where this stops being very useful.
All in all, topic modeling is basically just a fancy, automated form of feature engineering that often works nicely on arbitrarily structured documents.
As a proof of concept, I’ve developed a small service called AdFormare. To work with the website, you upload a collection of documents and we do some processing to figure out the topics in the datasets, and the topic scores for each document. As a bonus, we produce a nice visualization that shows you things like which topics often occur together, and shows you examples of documents with high scores for each topic.
Without going too deeply into it, here’s a sample visualization produced from a large collection of movie reviews:
And here’s a little tutorial that tells you what you’re looking at:
Coming Soon To BigML
We’re going to integrate the guts of this technology into BigML, so you can do topic modeling on the text fields in your BigML datasets, but I’m soliciting people to try this out on their own document collections so we can work out the bugs before we deploy. If you’ve got a collection of documents you’d like to see processed like this, by all means e-mail me (email@example.com).