Skip to content

BigML Roadshow Down Under!

This week, I had the honor to present at AIIA’s cross-industry luncheon here in Melbourne thanks to the support of BigML’s local partner GCS Agile. The Australian Information Industry Association (AIIA) is the peak representative body and advocacy group for the ICT (Information and Communications Technology) industry in Australia. For over 35 years it has been their mission to advocate, promote, represent and grow the ICT industry in Australia as a not-for-profit organization, with over 400 member organizations covering a large spectrum between hardware, software, and services companies.

AIIA Event Melbourne

This year, the AIIA is running three cross-industry events under the umbrella theme of ‘Building the Digital Economy’. Featuring the utility, airport, transport, logistics, retail and finance sectors with separate sessions exploring intelligent operations, connectedness and digitization of the customer experience. These themes align with the Victorian government themes of achieving sustainability, productivity and citizen engagement through technology as discussed in their 2014 ICT Strategy and Digital Strategy.

‘Intelligent Operations’ is a mouthful term, but it is really meant to describe how intelligent technologies, including machine learning and predictive analytics can be used by businesses to drive operational efficiencies, employee productivity and improved customer service. Guest speakers at our luncheon included Paul Bunker, Manager, Business Systems & ICT, Melbourne Airport; Sue O’Connor, Deputy Chair, Goulburn Valley Water Corporation as well as myself (Atakan Cetinsoy, V.P. – Predictive Applications, BigML). After Rebecca Campbell-Burns of AIIA set the stage for the afternoon, Mr. Bunker took the podium making a strong case on how the Melbourne International Airport’s track record of operational excellence has added to the continued economic vibrancy of the state of Victoria stressing that they run a 24/7 operation, where cargo planes take to the air precious commodities to Asian destinations every night after the passenger airliner traffic subsides. Managing physical assets efficiently in this fast-paced context, while targeting a world-class traveler experience from the point of arrival until departure requires an analytical blanket that can adapt to sudden changes that may be caused by inclement weather or tightened security, which makes for a very interesting predictive analytics challenges.

Sue O’Connor’s presentation focused on the need for Goulburn Valley Water’s efforts to maintain a very affordable price point for drinking water, the most basic of human needs, at a time of environmental challenges, all the while making the necessary infrastructure investments to ensure the ability to meet growing demand now and in the future despite tight capital and operational expenditure budgets.  Sue went on to stress they intend to invest in Internet-enabled sensor networks to the extent that there is a clear business case and attractive ROI.

As I alluded to in my presentation, utility and aviation industries have a huge economic upside ($95 billion USD in savings as per a recent GE study) in efficiency terms from being able to better manage their existing infrastructure with the help of real-time sensor measurements.  As long as there is a way to analyze and interpret this tsunami of data in order to detect key signals business value can be drawn in multiple ways. For instance, it may be wise to prioritize big data initiatives targeting cost savings first due to clear return on investment.  Predictive maintenance schemes can avoid unnecessary dispatches of field maintenance personnel saving utilities significant amounts in costs.  However, sensor data can also be interpreted in ways that help launch completely new context sensitive value added services that can create new revenue streams all together. Luckily, machine learning is here to help with all these use cases. BigML’s “API first” approach to massively scaling carefully curated and well proven machine learning algorithms has been designed to streamline the process from raw data ingestion to real predictive insights. If interested on the topic, you can view my presentation deck on Slideshare.

Up next for us is a trip to Sydney, where we will be presenting at two different events on Wednesday (March 25, 2015). Feel free to come by and join us at either forum by following the links below.

We will do BigML demos followed by interactive discussions on the promise of machine learning in Australia.  It should be fun!

Topic Modeling Coming to BigML

Machine learning and data mining play very nicely with data in a row-column format, where each row represents a data point and each column represents information about that point.  It’s a natural format, and is of course the basis for things like spreadsheets, databases, and CSV files.

But what if your data isn’t so conveniently formatted?  Let’s say you have an arbitrary pile of documents, like product reviews, and you’d like to classify each one.  A simple thing to do would be to use word counts as features, but then you’re forced to make arbitrary decisions about which words are important.  If you just use all words, you end up with thousands or maybe tens of thousands of features, which generally decreases the efficiency of machine learning.  Moreover, simply counting words gives no information about the context in which the word was used.

Gettysburg Topic Markup

Does machine learning know a better way?

Thankfully, there are technologies called topic models which take some steps towards solving this problem.  The general idea is to look for “topics” in the data, which are essentially groups of words that often occur together (this is a gross oversimplification, but gives the correct flavor).  For example, in a collection of news articles, you may discover a topic that has the words “Obama”, “Congress”, and “President”, which would correspond to the real-world topic of politics.  We can then assign each document a score for each topic, indicating that this document is “well-explained” by that topic.  When we do this, we transform possibly tens of thousands of words into a small number (~10-100) of features, each one packed with information.

This is a fairly general way of thinking about this problem.  For example, you could use the same technology on shopping baskets (arbitrary lists of product serial numbers, say), and the “topics” would be groups of serial numbers that are often purchased together.  The main limitation on the usefulness of this is the average length of each document.  Because we’re relying on word co-occurance, we’d like our documents to be as long as possible so that we have lots of co-occurances to work with.  Twitter-length documents is around the point where this stops being very useful.

All in all, topic modeling is basically just a fancy, automated form of feature engineering that often works nicely on arbitrarily structured documents.

Enter AdFormare

As a proof of concept, I’ve developed a small service called AdFormare.  To work with the website, you upload a collection of documents and we do some processing to figure out the topics in the datasets, and the topic scores for each document.  As a bonus, we produce a nice visualization that shows you things like which topics often occur together, and shows you examples of documents with high scores for each topic.

AdFormare Topics

Without going too deeply into it, here’s a sample visualization produced from a large collection of movie reviews:

https://www.adformare.com/view/movie_reviews

And here’s a little tutorial that tells you what you’re looking at:

https://www.adformare.com/tutorial

Coming Soon To BigML

We’re going to integrate the guts of this technology into BigML, so you can do topic modeling on the text fields in your BigML datasets, but I’m soliciting people to try this out on their own document collections so we can work out the bugs before we deploy.  If you’ve got a collection of documents you’d like to see processed like this, by all means e-mail me (parker@bigml.com).

The Need for Machine Learning is Everywhere!

by

In my work, which is predominantly information technology, the need for Machine Learning is everywhere. And I don’t mean just in the somewhat obvious ways like security or log file analysis. Consider, my work experience goes back to the tail end of the mainframe era. In fact one of my first jobs was networking together what were at the time considered powerful desktop computers as part of a pilot project to replace a mainframe. Since then I’ve seen the birth of the internet and the gradual transition of computing resources to the cloud. The difference between those two extremes is that adding capacity used to mean messing around in ceiling tiles for hours, where now when I need additional capacity I run a few lines of code and instantiate machines nearly instantly in the cloud.

This ability to scale up quickly to meet rising demand, and I would argue even more importantly, to scale down to save costs is a huge advantage. Everyone who is familiar with the cloud knows this, but there is more. Once the allocation of resources is programmatic, the next logical step is to make that allocation intelligent. To not just respond to current demand, but to be able to predict demand in real time and have an optimal infrastructure running at the exact instant that it is needed.

real_time_machine_learnin

Indeed, the need for Machine Learning is everywhere!

But what is driving this demand for Machine Learning, why now? One factor, as in the previous example, is the ready availability of cheap machine power, in particular the cloud, which is continually making computation cheaper. As computation gets cheaper, the impossible becomes possible. And as things become possible, people tend to start doing them and you end up with people like me making intelligent infrastructure.

But another related factor is the explosive growth of data in the last few years. The interesting thing is that there has been a lot of focus on collecting, storing and even doing computation with so called Big Data, but much less discussion of how to actually derive useful insights from it. As it turns out, Machine Learning is particular well suited to the task.

So as companies that have a big data initiative are realizing this, it is driving a big demand for Machine Learning. How big? To answer that, I think it is important to realize that this is not just about mining data to improve the company bottom line, although that’s certainly a probable outcome. No, this is a question of survival. We often see startups that are using a data centric approach to disrupt existing markets. Entrenched companies that don’t understand this trend risk extinction.

So we have a growing demand for Machine Learning, but what work needs to be done? Do we need new and better algorithms? Probably – I mean, it’s always possible that someone will invent a new algorithm tomorrow that blows everything else away, and it would be really hard to argue that this is a bad thing. On the other hand, there are already a lot of really good algorithms available, and lots of problems ready to be solved.

In fact, I recently read a paper titled “Do we need hundreds of classifiers to solve real world problems?”. In this paper, they evaluated the performance of 179 classifiers on a variety of datasets. That’s a 179 different algorithms that are available just to handle a classification problem. Interestingly, the best performer in that paper was the RDF, which is hardly new.

Now don’t get me wrong, the need for newer algorithms to advance the science of Machine Learning will never go away. But there is a huge amount of work that needs to be done right now to move existing algorithms from the lab and into the practical world. To make the existing algorithms more robust and consumable. This work is more important than the elusive perfect model algorithm because the reality is that for most projects, a perfect model will not be the final product; the model will form only a piece of the entire application that most companies need to implement and it’s often more important to deliver results. After all, data often has only a temporal value.

The need for Machine Learning is everywhere, and BigML is here to deliver it.

BigML + Informatica = Connected Machine Learning for the Enterprise

We’re very excited to share news that we’ve collaborated with Informatica to release the Informatica Connector for BigML. This connector was debuted yesterday at Informatica’s Data Mania event, where BigML was named a winner of their Connect-a-Thon competition!BigML INFA Connector

The connector, which is being made available through the Informatica Cloud Marketplace, can be used to access, blend, structure and connect data from any app in the Informatica Cloud directly to a source in BigML through the use of a simple visual wizard flow. This opens the gates to data stored in your ERP, CRM, HR, Marketing Automation, Business Intelligence, Procurement, Financial or many other mission-critical enterprise systems from marquee vendors (e.g. Salesforce, Oracle, NetSuite, Workday, Birst, Concur, Coupa etc.) to be pulled directly into BigML.  This of course augments our current source import options such as pulling from local files, using web links for remote resources and/or our web-based connectors with Dropbbox and the recently announced Google Cloud integration.

In addition, you will be able to use the connector in conjunction with BigML’s API to incorporate actionable models created in BigML into your enterprise applications and services.  As you know, creation of such end-to-end machine learning workflows is a critical component for development and deployment of predictive apps. With BigML Private Deployments available either on-premises or in the cloud, BigML is a highly versatile machine learning platform for enterprises of all shapes and sizes; the Informatica Connector makes our offering even more powerful and flexible.

We’ll be sharing more details on the connector and our partnership with Informatica in the near future, so stay tuned. In the interim, feel free to contact us to learn how you can tap into this exciting new functionality today!

Machine Learning Coming to your Mac OS X

Here at BigML our mission is to make machine learning easy, beautiful and understandable to everyone. We work hard to make sure that BigML’s REST API and web-based interface offer very intuitive workflows for many types of machine learning tasks. Today, we are proud to announce the BigML native app for Mac OS X, which streamlines machine learning workflows even further. With BigML for Mac OS X you can now generate predictions from your data by just dragging and dropping a file. It’s as simple as that: you do not even need to click once!

There are several key points we’d like to emphasize before giving you more details:

  • Optimized for Cloud Computing: Building machine-learned models is a computationally expensive task because it tends to go through many iterations in an effort to achieve a higher accuracy model. BigML leverages the Cloud’s power, and the native Mac OS X app is capable of handling all related cloud resources necessary.

  • Fast and Cost-free Predictions: Using our Mac OS X client, you can get predictions off of your local data not only faster than using cloud-only alternatives, but as a bonus, it will also cost you a sum total of NOTHING! On top of that, BigML models are white-box, meaning that you can interact with them to better explain why a specific prediction was made.

  • Workflow Templates: Creating Machine Learning models from your data involves some basic steps going from raw data to a final model structure. The app takes care of the workflow and creates all intermediate resources. It will also let you access, modify or reuse those resources to perform other evaluations or optimizations.

Introducing BigML for Mac OS X

BigML for Mac OS X goes to great lengths to ensure that you have all the resources you need in a single, comprehensive view.

BigML for Mac OS X

As seen above, BigML for Mac OS X’s main window has three areas:

  • Project Area: Allows you to create a new project or select an existing one that you can modify.
  • Workflow Area: This section lets you monitor the current state of any on-going operation, select a workflow type and start an operation. The Workflow Area is an “active” area, in that you can drag&drop files or BigML resources on to it to have them processed.
  • Resource Browser: Makes it possible to look up any existing resource (i.e. datasets, models, clusters etc.), inspect their state, and reuse them to start new workflows.

The most important part of BigML’s main window is the Workflow Area. If desired, it is possible to shrink down the Resource Browser area so it does not take any space on your desktop.

Your First Prediction

Creating your first prediction takes only a few seconds. Just locate a data file and drag it on to the central workflow area. After dragging the data file on the workflow area, you will observe how BigML connects to your remote account to create all intermediate resources such as a data source, a dataset, and finally a model, which will be eventually used to generate your predictions. The advantage of BigML for Mac OS X is that you do NOT necessarily need to understand how a model is created and what intermediate steps are taken (or, for that matter, how predictions are computed using the model): The app will take care of that for you.

Wine SaleS Predictions

Eventually, the prediction window will be displayed. There, you will be able to change your model input fields to produce a new prediction. BigML is able to recalculate a prediction “on the fly” while you change the values for the variables shown in the prediction window. It is important to keep in mind that your models are created remotely in the cloud, but they are also replicated locally so that the predictions based on your inputs can be computed locally. That means no network access is required in making new predictions, nor any BigML credits are spent.

Beyond Simple Workflows

BigML for Mac OS X  not only supports tree-based models but also enables cluster analysis (and very soon anomaly detection). Switching from creating models to creating clusters is very easy: you just have to select the corresponding workflow. Now, if you drag the same file as before on BigML main window, you will be able to make new predictions using BigML’s unsupervised clustering algorithms. BigML for Mac OS X is smart enough to automatically update all the related resources and workflows in the background, whenever you make UI selection changes.  This saves you the headache of manually maintaining the proper state of each resource and instead lets you concentrate on getting valuable insights from your data.  In the near future, you won’t only be able to configure your own workflows, but also to personalize them.

BigML Cluster Workflow

Remote and Local Resources

As mentioned above, when you drag a source file over to BigML, a set of resources is created remotely that is also mirrored locally. The Resource Browser allows you to keep an eye on the resources as you create, delete or rename them, and even when you use them to create new predictions, so you do not have to start over every time.

To access the Resource Browser, just click the down-arrow in the bottom-right corner of the BigML main window. You can select a specific resource type in the resource browser to see what resources of that type you have created. If you right-click on a resource, you will access a pop-up menu offering a few options such as renaming the resource, deleting it from the server etc. Finally, you can drag any resource in the workflow area, it will be used as a starting point to create a new model or cluster. This gives you an alternate way to create a prediction.

How Can I Get It?

We are about to kick off the BigML for Mac OS X private beta. If you are interested, drop a note to bigmlx@bigml.com and we’ll include you in our beta testers list as soon as the private beta starts.

A Note for Developers

BigML for Mac OS X has been implemented using BigML iOS bindings. That is, just plain vanilla BigML’s REST API.  So if you are in hacking mood, imagine how easy powering your Mac OS X application with machine learning can be.

What’s Next? Tell Me about the ‘Big Picture’ Already

You bet! It is our conviction that as soon as 5 years from now, we will be living in a world where interfacing with machine-learned models will be as natural and seamless for a business analyst as it is for any of us to interface with an iPhone today.  If we succeed, it’ll be all as natural (and taken for granted) as the air around us. BigML’s native Mac OS X client may be one small step in this direction, but make no mistake about it…in hindsight, it may also prove to be one giant step for 21st century business in the not so distant future.  Will you be there with us?  Or will you still be staring at an Excel spreadsheet? The choice is yours!

Divining the ‘K’ in K-means Clustering

The venerable K-means algorithm is the a well-known and popular approach to clustering. It does, of course, have some drawbacks. The most obvious one being the need to choose a pre-determined number of clusters (the ‘k’). So BigML has now released a new feature for automatically choosing ‘k’ based on Hamerly and Elkan’s G-means algorithm.

The G-means algorithm takes a hierarchical approach to detecting the number of clusters. It repeatedly tests whether the data in the neighborhood of a cluster centroid looks Gaussian, and if not it splits the cluster. A strength of G-means is that it deals well with non-spherical data (stretched out clusters). We’ll walk through a short example using a 2 dimensional dataset with two clusters, each has a unique covariance (stretched in different directions).

G-means starts with a single cluster. The cluster’s centroid will be the same as you’d get if you ran K-means with k=1.

projection0

G-means then tests the quality of that cluster by first finding the points in its neighborhood (nearest to the centroid).  Since we only have one cluster right now, that’s everything. Using those points it runs K-means with k=2 and finds two candidate clusters. It then creates a vector between those two candidates. G-means considers this vector to be the most important for clustering the neighborhood. It projects all the points in the neighborhood onto that vector.

projection1

Finally, G-means uses the Anderson-Darling test to determine whether the projected points have a Gaussian distribution. If they do the original cluster is kept and the two candidates are rejected. Otherwise the candidates replace the original. In our example the distribution is clearly bimodal and fails the test, so we throw away the original and adopt the two candidate clusters.

histogram1

After G-means decides whether to replace each cluster with its candidates, it runs the K-means update step over the new set of clusters until their positions converge. In our example, we now have two clusters and two neighborhoods (the orange points and the blue points). We repeat the previous process of finding candidates for each neighborhood, making a vector, projecting the points, and testing for a Gaussian distribution.

projection2

This time, however, the distributions for both clusters look fairly Gaussian.

histogram2

When all clusters appear to be Gaussian, no new clusters are added and G-means is done.

The original G-means has a single parameter which determines how strict the Anderson-Darling test is when determining whether a distribution is Gaussian or not. In BigML this is the critical value parameter. We allow ranges from 1 to 20. The smaller the critical value the more strict the test which generally means more clusters.

Our version of G-means has a few changes from the original as we built on top of our existing K-means implementation. The alterations include a sampling/gradient-descent technique called mini-batch k-means to more efficiently handle large datasets. We also reused a technique called K-means|| to quickly pick quality initial points when selecting the candidate clusters for each neighborhood.

BigML’s version of G-means also alters the stopping criteria. In addition to stopping when all clusters pass the Anderson-Darling test, we stop if there are multiple iterations of new clusters introduced without any improvement in the cluster quality. The intent is to prevent situations where G-means struggles on datasets without clearly differentiated clusters. This can results in many low utility clusters. This part of our algorithm is tentative, however, and likely to change. We also plan to offer a ‘classic’ mode that stops only when all clusters pass the Anderson-Darling test.

All that said, we’ve been happy with how well G-means handles datasets with complicated underlying structure. We hope you’ll find it useful too!

messy

PAPIs 2015 – Call for Proposals Begins!

Following up on the success of its inaugural event last year, PAPIs.io 2015 is fast approaching upon us. This year’s event will take place in down under in the beautiful “harbour city” of Sydney.  It is conveniently scheduled on the preceding Thursday and Friday (6-7 August, 2015) before KDD, the ACM conference on knowledge discovery and data mining which attracts 2000+ Big Data practitioners and researchers. As a founding member and initial sponsor of PAPIs.io, BigML will be participating in this year’s event too.

PAPIs.io

PAPIs.io is a unique event in that it has been able to bring together data scientists, developers and practitioners from 20+ countries representing many different industries and educational institutions. Past participants included large tech companies such as Amazon, AXA, Banc Sabadell, BBVA, IBM, ING, Intel, Microsoft, Samsung, SAP as well as leading startups in the field (e.g. BigML, Dataiku, Indico, RapidMiner) to discuss all things Predictive APIs and Predictive Apps. The very hands on and interactive approach of the agenda is centered on addressing the challenges of building real-world predictive applications based on a growing number of Predictive APIs that are making Machine Learning more and more accessible to developers.  As a bonus, this year’s event will also introduce a technical track.

Here is a reminder on some of the real life predictive applications that were showcased in great detail in last year’s event:

  • Real-time Online Ad Bidding Optimization
  • Overcoming Challenges in Sentiment Analysis
  • Winning Kaggle’s Yandex Personalized Web Search Challenge
  • Forecasting Bitcoin Exchange Rates
  • Paris Area Transportation Optimization via Predictive Analytics
  • Personalized Card-linked Offers for Consumers
  • Bikesharing Optimization and Balancing
  • Office 365 Infrastructure Health Engine

We urge you to consider presenting at this year’s event and to follow PAPIs.io on Twitter for further updates on  the matter.

Visualize Your Data with Dynamic Scatterplots

We have recently announced our Dynamic Scatterplot capability, which is one of many goodies to have come out of BigML Labs. You can utilize dynamic scatterplots to do a deeper dive into and interact with your multidimensional data points before or after your modeling.

In this post, we will visualize the clusters we build based on Numbeo’s Quality of Life metrics per country . This dataset only has 86 observations; each recording the following quality of life metrics per country:

  • Country
  • Quality of Life Index
  • Purchasing Power Index
  • Safety Index
  • Health Care Index
  • Consumer Price Index
  • Property Price to Income Ratio
  • Traffic Commute Time Index
  • Pollution Index

Quality of Life Index is a proprietary value calculated by Numbeo based on a weighted average of the other indicator fields in the dataset e.g. Safety Index, Pollution Index etc. Therefore, we removed this field prior to our clustering to get a better sense of how clusters are formed without any subjective weighting measures.

We used G-Means clustering with default settings to analyze all 86 records in the dataset. The process ended up with 2 clusters with 45 countries in Cluster 0 and 41 in Cluster 1 respectively — a pretty even split all in all. A quick gaze at the descriptive stats on the side panel shows that Cluster 1 tends to have a higher representation of wealthier, more developed nations, whereas Cluster 0 mainly consists of developing nations.

Cluster

This is a great start, but what if you want to dive deeper into the make up of each cluster? Well, in that case, BigML already offers you the ability to build a separate decision tree model from each cluster as an option before or even after you create your clusters. As your clusters are created, so are the corresponding trees that you can traverse to better understand which variables better explain the grouping of instances in a given cluster.

For example, the screenshot below reveals that Purchasing Power Index had the most influence for Cluster 0, where any country with PPI less than 42 (the short right branch) was automatically classified as belonging to Cluster 0 among other more complex rules (shown on the more complex left branch).

Tree

Now, we have a better idea about the method behind our clusters. However, at times, we may need to dive even deeper into the data and see how individual records are laid out on a plane in relation to each other much like the cluster visualization itself, but applied to individual instances. This is especially useful if there are thousands or more data points to be analyzed.

Our brand new Dynamic Scatterplot feature let’s you do just that. Once you navigate to the Dynamic Scatterplot screen, BigML asks you to specify which dataset it needs to use for plotting. As you type letters, matching datasets appear in the dropdown. After you select your dataset, you can pick the dimensions you would like to visualize. Up to 3 at a time is allowed between X and Y axes as well as the color coding.

DSP Menu

The example image below depicts how each country in our Numbeo dataset is positioned according to Purchasing Power Index (X axis), Health Care Index (Y axis) and Cluster identifier (Color dimension). The familiar Data Inspector panel on the right hand side shows the values for a particular data point you can mouse over.

PPI vs. Healthcare

As you can see, even though our cluster analysis took into account all available fields, the dispersion in this visualization still shows a pretty obvious concentration of Cluster 0 (Dark blue) on the left bottom quadrant and Cluster 1 (Light blue) on the right top quadrant. This confirms our gut feel expectation that countries with higher purchasing power would also have higher quality healthcare.

However, there are interesting exceptions to note. For instance, the dark blue dot near the coordinate (40,80) is Thailand. (Please note that we have manually superimposed relevant country flags on the actual output.) Thailand is a developing nation. Nevertheless, it is punching much above its weight in terms of health care services. A little research reveals that there is a growing healthcare tourism industry in Bangkok drawing many foreigners seeking more affordable care. Similarly, Dominican Republic is also presents us with an interesting case.

We then get curious about the group of dots that have relatively high purchasing power (PPI>=80), yet not as high a healthcare score (HCI<=60) as one would expect at that level of purchasing power. The zoom in feature of Dynamic Scatterplots comes in handy for this. Marking the aforementioned area with our mouse we can instantly visualize just that portion of our chart as follows. (Please note that we have manually superimposed relevant country flags on the actual output.)

Zoom in

The 4 light blue (Cluster 1) dots here represent Puerto Rico, United Arab Emirates, Saudi Arabia and Ireland. These turn out to be wealthier nations with subpar healthcare.

As seen in this straightforward example, playing with the Dynamic Scatterplot is both easy and very teaching at the same time. One cannot always find easy explanations when utilizing Machine Learning techniques, but effective visualizations can help provide additional “color” and confidence to our findings, where other methods may fail.

We hope that you will give a try to this cool new offering from BigML as part of your next data mining project. As always, please let us know how we can improve it further. The best part is, it comes FREE with all existing subscription levels, so have at it!

Filling the Blanks in Your Google Sheets with Machine Learning

It is no surprise by now that we are having to deal with lots of data in many different formats in our everyday life. From Database Managers taming growing quantities of data in large companies to the handy spreadsheet that can work just fine for small tasks or personal use, BigML has been on a mission to bring Machine Learning predictions to every dataset.  In that spirit, we continue this week’s Google integration theme with news on our upcoming Google Sheets add-on.

Google Sheets is a truly wonderful tool to store your datasets. It is fully functional as a spreadsheet, but it turns out that you can still improve its utility by taking advantage of add-ons. The add-ons are macro-like Google Apps Scripts that can interact with the contents of your Google Sheet (or Docs or Forms) and automate repetitive tasks, or connect to other services. At BigML, we’ve built our own add-on that will let you use your models or clusters to add predictions to your data in Google Sheets.

BigML users already know how easy it is to start making predictions in BigML. Basically, you register, upload your data to BigML (it can be in CSV local or remote files, Excel spreadsheets, inline data etc.) and in one click build a dataset, where all the statistical information is summarized. With a second click, you can build a model, where the hidden patterns in your data are unveiled. Those rules can later be used to predict the contents of empty fields in new data instances. With our new add-on, it’s now possible to perform those predictions directly in your Google Sheet.

The wine shop use case

The first time you login to BigML, you land in a development area with a bunch of sample data sources available for you to play with at no cost. Let’s use one of these to build an example: the fictional wine sales dataset. It contains historical wine sales figures and the related features for each wine such as the country,  type of grape, rating, origin, and price. Imagine you want to carry new wines in your store. It would be great to have an estimate of the total sales you can expect from each new wine, so that you can choose the ones that will sell better, right?

wines_list

Using the above dataset, you can easily create a BigML decision tree model that can predict the total sales for a wine given its features. Thus, for every new wine, you can use the model to compute the expected total sales and choose the new wines most likely to maximize your revenue.  But what if your list of new wines is in a Google Sheet? Good news! You can also use your BigML model from within your Google sheet to quickly compute the predicted sales values for the new wines.

Using BigML models from Google Sheets

To use this new functionality, you’ll need to first install the BigML add-on (coming soon).  Once installed, it will appear under the add-ons menu as seen below. You can now choose the ‘Predict’ submenu item, which will display the form needed to access all your models and clusters in BigML (provided that you’ve authenticated with your BigML credentials). In this case, you’ll sort through your list of models and select the one that was built on your historical wine sales data. Finally, you’ll be ready to add predictions to the empty cells in your Google Sheet.

wines_predicted

To do this, select the range of cells that contain the information available for your new wines list. Pressing the ‘Predict’ button on the right-hand side panel, the prediction for each row will be placed in the next empty cell on the right, and the associated error will also be appended in a second column. In this example, the prediction has been a number, but you can add predictions for categorical fields just as easily:

iris_post_predict

So how does the BigML add-on work behind the scenes? The add-on code is executed in Google Apps Script servers. Google Apps Script code can connect to BigML and download your models to those servers (after validating your credentials in BigML). It also can interact with your Google Sheet. The BigML model you choose is downloaded to the Google Apps Script server environment, where the script runs each row in your selected range through the model and updates the cells in your sheet with the computed predictions. Thus, no data in your sheet has to reach BigML to obtain predictions. It stays in Google servers the whole time.  This video shows the basic steps for this and other examples dealing with categorical models or clusters.

Our add-on will be visible under the add-ons menu in Google Sheets as soon as the Google add-ons approval process is completed.  We will update this post accordingly, however if you want to be an early adopter just let us know today!

Google Cloud + BigML = Easier Machine Learning!

bigml_google2

Attention Google power users: We have made a number of improvements to make BigML more compatible with Google services for your convenience.  Google Cloud is becoming the fastest growing cloud provider.  As such, we have been receiving requests from users all over the world, so we finally got our clue.

For starters, in addition to Amazon and GitHub, you can now login to BigML with your Google ID. Click on the Google option under the Login button and you will be authenticating right in to start with your machine learning project.

Login Google

Since our aim is to make it super easy to upload your data to BigML regardless of your cloud provider, we have added both Google Drive and Google Storage support as well. Similar to our integrations with Azure Marketplace and Dropbox, connecting to your cloud storage only takes few clicks starting from the cloud icon located on the Sources tab.

Google Drive I

The first time you go through this flow, you will be asked to allow BigML to access your Google Drive or Storage, which automatically generates and displays an access token.

Google Drive II

Next time you want to access one of your data sources stored on Google, you can use the same menu on the Sources tab and it will bring up all your folders on a modal window as shown below.

Google Drive III

Select the one you are interested in and it will be uploaded as a new source on BigML right away.  So you are off to the races with your machine learning project just like that.

Let us know how this works out for you. If you like it, please give a shout out to other fellow Google users, so they too can take advantage of it.

%d bloggers like this: