Skip to content

Predicting Box Office Performance

This is a guest post by Ravi Teja Kankanala, Senior Technical Lead at Xtream IT Labs (XIT Labs).

At XIT Labs, the innovation division of Xtream IT People Inc, we use machine learning and big data to more accurately predict whether any movie is likely to be successful at the box office upon its release. We have developed a tool which we call Trailer Momentum Score that makes use of BigML.

From Raw Data to Opening Weekend Prediction

We sought to construct a model which would provide a prediction for the opening weekend box office takings of any given movie by using a raw dataset. Before using BigML, we tried other approaches by applying algorithms to the raw data which certainly made improvements, but found that our prediction tended to be off too often. After being introduced to BigML, we found that not only the prediction process was greatly simplified but the prediction error was greatly reduced. While using BigML, we’ve been able to request fixes and improvements in the service to the support team. Francisco Martin, the CEO of BigML, made sure the issues we raised were immediately fixed so we were able to successfully move into production.


Data and Results

The data-gathering process began in September 2013, pulling from across various social media platforms and data feeds for the movie industry. Unfortunately, we cannot elaborate on the types of structured and unstructured data we collected and processed because of confidentiality agreements with our clients. The salient point is that we have used BigML machine-learned models to make more than 200,000 predictions so far, which generally had 90%+ accuracy.  These results helped the client to decide optimal timing and segmentation for marketing campaigns in order to improve their box office results.

How Has BigML Made a Difference?

BigML is very simple and intuitive. From the very first moment that we started using it, we were able to understand how we could leverage the tool. BigML has a very simple user interface which makes it easy to construct models which made our data scientists’ lives much simpler. We are also currently using BigML to predict sentiment analysis for various retailers and food chains across the world. The end results using BigML speak for themselves.

July 30 Webinar: BigML + Tableau in action!

In the few months since we announced a collaboration and technical partnership with Tableau, we’ve heard from many users about how they’re benefiting from the ability to incorporate and visualize predictive models built using BigML within their Tableau environments.  We’ve also built some videos that show you exactly how easy it is to use this killer functionality.

On July 30 Tableau is hosting a webinar where an analyst from interactive firm AKQA will discuss how he’s used both BigML and Tableau to gain predictive insight for clients’ media and marketing campaigns.

Learn more and register here today!


Machine Learning in the classroom: Training the next generation of data analysts

The next generation of business analysts will be required to not just understand advanced analytics, but to utilize them as part of their job function. While this traditionally has been impossible due to the complexity and cost of advanced statistical packages, BigML now makes it very viable. And we’re very excited by the fact that many universities are bringing BigML into their classrooms as part of their curricula — thereby helping educate tomorrow’s business leaders on the power and benefit of predictive modeling.

Two such universities that are using BigML are the University of Georgia and the University of Maryland University College (UMUC).

Creating Analytics Bulldogs at University of Georgia

Dr. Hugh Watson is a Professor of MIS at University of Georgia’s Terry College of Business, and has a deep professional and academic history of working with statistics and analytics. UGADr. Watson has taught various classes on Information Systems for over thirty years so has excellent perspective on how IT tools have evolved–as have the expectations of employers for matriculating students.

Starting with this year’s Spring Semester, Dr. Watson brought BigML into his undergraduate class on Business Intelligence. The typical incoming student has familiarity with information systems and data models, but likely hasn’t utilized machine learning and/or built advanced predictive models. Over the course of the semester, Dr. Watson used BigML along with related BI tools such as MicroStrategy and Tableau. For the BigML portion of the class, the UGA students walked through several use cases based on models and datasets published in BigML’s gallery, with the aim of learning how to build models and gleaning predictive insights from data. By the end of the class, Watson’s students were able to complete a project that required the students to build their own models and run predictions in BigML.

Draft quote from Hugh Watson: “A deep understanding of data systems and analysis is becoming an imperative in today’s job market, which is why MIS is a key focus area for our program at the Terry College of Business. BigML helped my students understand how historical data can be utilized to predict future outcomes–all in a fun and easy-to-understand workflow.”

BigML helping power Master’s Degree in Data Analytics at UMUC

Dr. Steve Knode is a professor at UMUC’s Graduate School who specializes in instruction in emerging technologies, decision support systems and artificial intelligence. UMUCUMUC offers a Master’s Degree in Data Analytics, which has seen a huge spike in interest over the past few years as more and more professionals in the workforce are seeing the benefit of bolstering their analytics skill sets so as to enhance their productivity and marketability.

Dr. Knode is using BigML as part of UMUC’s Decision Management Systems class with a focus on combining decision making with the use of analytics tools. BigML’s ease of use makes it a great fit for the course as students can spend time implementing decision models rather than trying to learn a new software package. For Dr. Knode’s class, students uploaded data that was generated from decision requirements models and quickly were able to perform a variety of predictive analyses. In future semesters Dr. Knode plans to leverage BigML’s newest features, including our cluster analysis functionality.

Draft quote from Steve Knode: “Students in UMUC’s Master’s Degree in Data Analytics are typically full-time professionals who want to learn and leverage the latest tools and methodologies for making data-driven decisions and analyses. BigML has been a hit with our students as it clearly shows them predictive relationships in data, and its ease-of-use is a huge benefit to students and instructors alike. I suspect that many of my students will utilize BigML in their jobs in the immediate future.”

Bring BigML into your classroom today!

Aside from these business-oriented programs, BigML’s API and wealth of underlying functionality makes it ideal for the next generation of developers and computer scientists.

In addition to BigML’s ease of use for undergraduate and graduate students alike, BigML has the added benefit of being a hosted solution–meaning that students can simply open a web browser or command line shell rather than having to install software on disparate systems. BigML works with professors to grant their students complementary subscriptions, and is also available to help out with appropriate course ware, lesson plans and syllabi.

If you’re interested in bringing BigML into your classroom, please contact us today!

The Missing V in Big Data for Healthcare

My name is Dean Hudson and I’m the President of EngageHi², a healthcare IT solutions provider and service delivery partner to BigML.

The phrase “Transforming Healthcare” looms as large as “Big Data” or “Business Intelligence” in the healthcare industry. For years the industry has been throwing around jargon and buzzwords to drive awareness, marketing and sales.  We have all heard about the three V’s of Big Data; oops wait now there are four!  When we talk about transforming healthcare by providing insight to data more is involved than just volume, veracity, velocity, and variety.  Let’s define what I feel the missing V really is: Value!


The true value of your data, the insight delivered from that data, and the evidence-based decisions derived from that data is what will transform healthcare!  Everyone seems to be focusing on the glitz and buzzwords but not the true building blocks. Those building blocks are the data quality, governance, provenance, and data model – along with other key enablers such as the EKW (enterprise knowledge warehouse). Once we have successfully ironed out the building blocks we can then move to a critical step that achieves true Value from data: the predictive model.

Critics in the marketplace have been less than enthusiastic as to the value of predictive modeling in healthcare. Some feel that it has minimal impact on improving quality of care, reducing costs, and improving population health. In my opinion, however, it will offer the greatest value and impact for improving outcomes and driving patient engagement. And if you look at what has occurred with IBM’s Watson and related innovations, I think it’s fair to say that the industry is finally beginning to gravitate towards the predictive and prescriptive as powered by machine learning. Let’s explore why…

Machine learning and predictive modeling will change how we build insight and relationships in healthcare.  Predictive analytics is the use of artificial intelligence modeling, statistics, and pattern detection that sifts through mounds of data to identify behavior patterns and using these patterns to gain new insights.  The barrier to adopting predictive technologies in healthcare has been cost and complexity – two things that BigML’s machine learning platform uniquely addresses. As someone who has been providing advanced IT solutions to the healthcare industry for 20+ years, I can state unequivocally that tools like BigML stand to revolutionize the way that health networks, hospitals and clinics run their businesses, and – most importantly – provide superior patient care.

EngageHi² is proud to partner with BigML to provide the real V (value!) to healthcare organizations of all shapes and sizes. We’re actively working on implementations that will leverage BigML’s intuitive workflow and rich visualizations for solutions ranging from patient diagnoses to readmission predictions to quality of care analysis. Healthcare will transform and outcomes will improve, but the industry requires solutions that accurately deliver the value, and not just buzzwords!

Predicting Startup Success

My name is Rahul Desai and I’m the CEO and co-founder of Trendify, a meta-startup that uses machine learning and big data to more reliably determine whether any given startup will succeed or not. I’d like to re-count the Trendify story, and elaborate on where BigML fits in.


From News Analytics to Startup Prediction

At first, I wanted to create a news analytics platform that could predict stock activity through social sentiment analysis. Not soon after, I gave up on it because of my non-tech background, a seemingly insurmountable obstacle. After getting a job at a local startup, Encore Alert, I sought mentorship from the CEO there. He knew what I was trying to do and sent me some interesting leads (iSentium social sentiment, and a neural network for stock-picking), at which point I realized people were doing stock prediction but no one was doing startup prediction. We focus on startup prediction because it’s a fairly open market and one where risk management tools have the potential for incredible impact. By helping both investors and entrepreneurs, we can bring some incredible technologies to life, that otherwise might not see the light of day.

The Data

In February, I began gathering open data from various sources on the internet, building a set of 10,000 data points regarding 130 companies: founder, company, and funding data. Although I’d love to elaborate more on the types of data I’ve collected, we feel that we owe it to our clients and potential clients that we protect their privacy and the confidentiality of their data. However, in broad strokes, I can mention that we do an extremely thorough data collection that matches or even surpasses Bloomberg in its scope, drawing from news, social media, and business databases.

A Predictive Model of Startup Success

After collecting this data, I ended up building a model on my own; this is where I used BigML. They have an easy-to-use, gorgeous interface that’s also incredibly powerful. We created an ensemble using training data regarding 65 companies. After training and testing, this model was able to predict the eventual success of Dell, Beats, and Box, as well as the failure of Fisker, with only the data that would be available during the first few years of operation. That initial model was built just to prove that we can accomplish our mission.

In the near future, we intend to create an ensemble around our new dataset: 1,000,000+ data points spanning thousands of companies. At that scale, we can show statistical significance. The output of this model will be directly actionable for our clients, indicating success/failure with confidence levels, and offering a print-out of the most contentious factors that led to a particular decision. After our beta launch later this year, we’re going to integrate real-time analytics so that investors and entrepreneurs can monitor milestones’ effects on their companies. With easily usable platforms like BigML, companies like Trendify can be viable. Team Trendify is very thankful to BigML for helping us prove that it’s actually possible to do what we set out to, and we look forward to continuing our relationship.

Enhancing your Web Browsing Experience with Machine Learned Models (part I)

The other day, I was showing off the Kiva model from BigML’s gallery to my wife. I got the comment that while it’s super easy to do the predictions in the BigML dashboard, it would be even better if the model results could appear directly in the Kiva loan page, without needing to flip between browser tabs. This got me thinking: this sounds like a job for a browser extension. I had never created a browser extension before, and it struck me as an intriguing project. Turns out, injecting predictions into the Kiva webpages was a piece of cake, thanks to BigML’s feature-ful API and export capabilities. In this blog post, I’ll walk you through two versions of the BigML-powered browser extension: the first using actionable models, and the second using the BigML API to grab predictions from a live model.

The Vision

Kiva is a micro-financing website which connects lenders to individuals and groups in developing countries who are seeking loans to improve their quality of life. Historical records about the loans created throughout the site’s history can be accessed via Kiva’s REST-ful API. The vast majority of Kiva loans are successfully repaid, but using data gathered through the Kiva API, we can train a BigML model to identify the circumstances which are more likely to result in a defaulted loan. This model was discussed at length in a previous blog post. Loans which are currently in the fundraising stage can also be queried by the Kiva API, and the resulting response can be fed to the BigML model to give a prediction of whether the final status of the loan will be paid or defaulted. Our goal is to create a browser extension which runs while viewing the page for an individual loan or the list of loans, and which will insert some sort of indicator for the predicted outcome of the loan.

Laying the Groundwork

We’ll be creating a Chrome browser extension here, but the same code could easily be ported to a Greasemonkey script for Firefox. The Chrome developer guide can tell you all about building extensions, but hopefully I can communicate everything you need to know in this tutorial. Also, the source code for this project can be found on Github. Feel free to grab a copy and follow along.

Every Chrome extension starts with a JSON manifest file, which gives the browser some info about what kind of extension we have, and which files it will need access to. In our case, the manifest is pretty short and sweet. Here are its contents in their entirety:

The first four items are just metadata about the extension. The next item is the more interesting bit. Chrome extensions are split into multiple categories depending on their behavior. We want an extension that modifies the content of a webpage, so we need to create a content  script. Within the definition of the content script, we give a regular expression to specify the URLs at which we want the script to run. This pattern matches both the pages for viewing individual loans, and for browsing a list of loans. Next we state the scripts and stylesheets which comprise the extension. Note that I’ve bundled JQuery with the extension, as I lean on it for DOM manipulation and AJAX calls. The last item in the manifest specifies that the extension will need to access some image files located in the given directory.

With the manifest squared away, we can move on to writing the script.

Making Predictions

One of the strengths of the classification and regression tree models created by BigML is that they can be readily represented by a series of nested if statements. This is precisely what we get when we export a model as an actionable model.

screen-download-actionableWe will be using the publicly available Kiva model that is in the BigML gallery (link). Selecting node.js as the export language gives us a JavaScript function which we can copy and paste, and call from within our content script. Here is the signature:

The actionable model accepts as parameters some data about a loan and returns the predicted status as a string. Our job now is to find the loan data that the model needs to do its thing. Kiva loans all have a unique ID number, so we’ll create a function which looks up a particular loan ID with the Kiva API, and uses the returned information to make a prediction, and create a status indicator which we will insert into the loan page’s DOM.

Finally, we need to differentiate the script’s behavior between the pages to view individual loans or browsing a list of loans. The URLs for individual loan pages end with the loan ID, which we can extract with a regular expression. In the list view, each item in the list contains a hyperlink to the individual loan page, which we can grab the link with JQuery and again get the loan ID from the destination URL.  From that point, it’s just a simple matter of calling predictStatus and inserting our indicator next to the “Lend” button.

Running the extension

To run the extension, you must first install it through Chrome’s extension configuration screen. The easiest way to get there is to navigate to chrome://extensions/. Once there, ensure that Developer mode is selected, then click “Load unpacked extension”, select the directory which contains your manifest.json,  and you’re good to go. If all goes according to plan, you will now see a green or red indicator icon beside every “Lend” button on


If you decide to do any tinkering with the extension’s source code, you will need to reload the extension to see the effect of your changes.

Coming up:  Using the BigML API

Using actionable models is arguably the easiest way to include BigML models in a browser extension, but having the model baked into the script can be inconvenient if the model is frequently changing. New loans are constantly being posted on Kiva, and so new data is ever available through the Kiva API. With BigML’s multi-dataset capabilities, we can continously refine our model with a growing body of training data. Keeping our browser extension up to date with our model building efforts would involve pasting in a new version of predictStatus every time we create a new model. In the next installment of this tutorial, I’ll show how we can use BigML’s REST-ful API to ensure that our extension is always using the freshest models. Stay tuned!

Instant Machine Learning for your Dropbox Files with BigML!

Dropbox has become the de facto mechanism to store and transfer large files. Idropboxn fact, it’s amazing how many customers send us Dropbox links for the purpose of analyzing their data. Now in BigML you can do all of this in an integrated fashion by simply granting BigML permission to access your Dropbox files (this permission can be revoked at any time).

Once you’ve granted BigML access, you can browse your Dropbox account for files that you’d like to bring into BigML for machine learning analyses. BigML automatically only enables download of file types that we can process–so don’t worry about accidentally asking BigML to ingest your pictures, movies and presentations!

See how easy it is to tap into this new functionality through the video below:

We hope that you enjoy this new capability!

BigML Clusters in Action!

After a few weeks in Beta, we’ve now released BigML Clusters in final form into the BigML interface and also as part of our Python bindings—many thanks to all of you who gave feedback on this important new piece of BigML functionality!

Curious to see what you can do with BigML clusters? Check out our archived webinar where Poul Petersen gives a high-level overview of clustering concepts in general, and then digs deeper to walk through several use cases ranging from grouping similar tasting whiskies (Item Discovery) to identifying high-value mobile gaming users (Customer Segmentation) to disease diagnoses (Active Learning). The webinar showcases how to leverage clusters both in the BigML interface, as well as through our underlying API via an iPython Notebook.

You can find the YouTube video here; the SlideShare presentation here, and the iPython notebook (from the Active Learning usecase) here.


Got more questions about clustering? Feel free to email us, or better yet, join one of our weekly Google Hangouts on Wednesday at 10:30 AM Pacific where we interact with BigML customers to answer all sorts of machine learning and predictive analytics questions.

The People Who Would Teach Machines to Learn

Have you ever had an idea about how the human mind works? Douglas Hofstadter has had that idea. He’s also thought of all of the arguments against it and all of the counter arguments to those arguments. He’s refined it, polished it, and if it was really special, it’s in one of his books, expressed with impeccable clarity and sparkling wit. If you haven’t attempted to read his book Gödel, Escher, Bach, try now. For sheer intellectual scope, it’s a singular experience.

A few months ago, there was an article in the Atlantic profiling Dr. Hofstadter and his approach to AI research. It was well written and I thought it gave unfamiliar readers a reasonable sense of Hofstadter’s work. It contrasted this work with machine learning research, however, in a way that minimized the scope and quality of the work being done in that area.

In this little riposte, I’ll present my case that the machine learning research community is doing work that is just as high-minded as Hofstadter’s when viewed from the proper angle.  One caveat: It may at times sound like I am unnecessarily minimizing Hofstadter’s work. I can assure you I have neither the inclination nor the intelligence to offer that or any such opinion. All I contend here is that his approach to the problem he’s trying to solve isn’t the only one, and there are reasons to believe the machine learning approach is also valid.

AI brain

Everybody Loves A Good Story

The Atlantic story goes something like this: Dr. Hofstadter has a singular ambition, to build a computer that thinks like a human. This leads to a lot of speculation, research, and programming, but little in the way of what academics call “results”. He doesn’t publish papers or attend conferences. He is content to work with his small group of graduate students on the “big questions” in artificial intelligence, trying to replicate mind-level cognition in computers, and have his work largely ignored by the rest of the AI community.

Meanwhile, the rest of the AI community has enjoyed considerable scientific and commercial success, albeit on smaller questions. Techniques born in AI planning save billions of dollars and countless person-hours by automating logistics for many industries. Computer vision systems allow individual faces to be picked out from a database of tens of thousands. And machine learning has turned raw data into useful systems for speech recognition, autonomous vehicle navigation, medical diagnosis, and on and on.

But none of these systems come close to mimicking the human mind, especially in its ability to generalize. And here, some would say, we have settled for less. We have used techniques inspired by cognition to craft some nice solutions, but we’ve fallen short of the original goal: To make a system that thinks.

So Dr. Hofstadter, unimpressed, continues to work on the big questions in relative solitude, while the rest of AI research, lured by the siren’s song of easy success, attacks lesser problems.

It’s a nice, clean story:  The eccentric but brilliant loner versus the successful but small-minded establishment, and I suppose if you’re reading a magazine you’d prefer such a story to a messy reality. But while there’s certainly some truth to it, I don’t see it in nearly such black and white terms. Hofstadter’s way of working is to observe the human mind at work, then write a computer program that you can observe doing the same thing. This is, of course, a very direct approach, but I’m not convinced it’s the only or even necessarily the best one.

In the other corner, we have machine learning, frequently implied in the article as being one of the easier problems that Hofstadter avoids. Surely research into such a (relatively) simple problem would never contribute directly towards creating a human-level intelligence, would it?

I’m going to say that it just might. Here are three reasons why.

Accidental Intelligence

Remember where cognition came from; A system (evolution) did its best to produce a solution to the following optimization problem:

  1. Survive long enough to have as many children as possible
  2. (Optional) Assist your children in executing step 1

Out of this (this!) came a solution that, as a side effect, produced things like Brahms’ Fourth Symphony and the Reimann hypothesis.

With this in mind, it doesn’t seem unreasonable to think that a really marvelous solution to problems as general and useful as those in machine learning might lead to a system that exhibits some awfully interesting ancillary behaviors, one of which might even be something like creativity.

Am I saying that this will happen, or that I have any idea how it might? Absolutely not. But look at how the human mind was created in the first place, and tell me honestly that you would have seen it coming. The question is whether the set of problems that machine learning researchers tackle are interesting enough to inspire solutions with mind-level complexity.

There are suggestions in recent research that they might be.  In the right hands, the process of solving “simple” machine learning problems can produce deep insights about how information in our world is structured and how it can be used. I’m thinking of, for example, Josh Tenenbaum and Tom Griffiths’ ideas about inductive bias, Andrew Ng‘s work on efficient sparse coding, and Carlos Guestrin’s formulation of submodular optimizations. These ideas say fundamental things about many problems that used to require human intelligence to solve. Do these things have any relationship to human cognition? Given the solutions they’ve led to, it would certainly be hasty to dismiss the idea out of hand.

The Opacity of the Mind

Several years ago, a colleague and I were developing a machine learning solution that could figure out if a photograph was oriented right-side up or not.  After much algorithmic toying and feature engineering, we reached a point where I said, “Well, it works, but this is totally unrelated to the way people do it”.

My colleague’s reply was “How do you know? You’re imagining that you can keep track of everything that happens in your own mind, but you could just be inventing a story after the fact that explains it in a simpler way.  When you recognize that a photo is upside-down, it probably happens far too fast for you to recognize all of the steps that happened in between.”

There’s a fundamental problem in trying to model human cognition by observing cognitive processes, which is that the observations themselves are results of the process.  If we create a computer program consistent with those observations, have we modeled the actual underlying process, or just the process we used to make the observations?  Hofstadter himself admits the difficulty of the problem, and says that his understanding of his own mind’s operation is based on “clues” like linguistic errors and game-playing strategies.  As of now, our understanding of our own subconscious is vague at best.

Rather than trying to match observations that might be red herrings, why not use actual performance as the similarity metric?  Machine learning algorithms can learn to recognize handwritten digits nearly as well as humans can and often make the same mistakes.  Isn’t it at least possible that the inner workings of such an algorithm bears some relationship to human cognition, regardless of whether that relationship is easily observed?

To Learn is to Become Intelligent

Hofstadter, while working on the big questions, generally works on them in toy domains, like word games, trying to make his machines solve them like humans might. Machine learning research focuses on a more universal problem:  How to make sense out of large amounts of data with limited resources.

Consider the data that has been fed into the brain of a five-year old. We can start with several years of high resolution stereo video and audio. Hours of speech. Sensory input from her nose, mouth, and hands. Perhaps hundreds of thousands of actions taken and their consequences. Using this data, a newborn who cannot understand language is transformed into an intelligence that can play basic word games.  Can there be any doubt that life is at least in part a big data problem?

Moreover, consider how important learning is to our perception of intelligence. Suppose you had two computer systems that could answer questions in fluent Chinese (with a nod to John Searle). The first was built by a fleet of programmers and turned on yesterday. The second was programmed five years ago, had no initial ability to speak Chinese, but slowly learned how by interacting with hundreds of Chinese speakers. Which system would you say has more in common with the human mind?

Big Answers from Small Questions

Is the machine learning research community suffering from a lack of ambition? I don’t buy it. Just because the community isn’t setting out, in general, to build a human-level intelligence doesn’t mean they’re headed in a different direction. After all, the systems we’ve built so far can translate between languages, recognize your speech, face, and handwriting, pronounce words it has never seen, and do it all by learning from examples.

If the machine learning community does create a human-level intelligence, it may not be any one person who had the “aha!” idea that allowed it to happen. It might be more like Minsky envisioned in “Society of Mind“; not a single trick but a collection of specialized processes, intelligent only in combination.

If that sounds like a cop-out, like an excuse not to look for big answers, consider the path followed by children’s cancer research: Science has yet to find a single wonder drug or treatment that cures all childhood cancers, yet decades of piecemeal research on the parts of the problem has driven cure rates from about 10% to nearly 90%. A bunch of people working on smaller problems may not look like much in situ, but the final product is what matters most.

Follow your data’s inner voice! Evaluation-guided techniques for Machine Learning

Spring has come, and the steady work of gardening data is starting to bloom in BigML. We’ve repeatedly stressed in our blog posts the importance of listening to what data has to tell us through evaluations.  A couple of months ago, we published a post explaining how you could achieve accuracy improvements in your models by carefully selecting subsets of features used to build them. In BigMLer we stick to our evaluations to show us the way, and we’d like to introduce you to the ready-to-use evaluation-guided techniques that we’ve recently added to BigMLer’s wide list of capabilities: smart feature selection and node threshold selection. Both are available via a new BigMLer subcommand, analyze, which gives access runnersto these new features through two new command line options.  Now, the power of evaluation-directed modeling starts in BigMLer, your command line ML tool of choice.

k-fold cross-validation

To measure the performance of the models used in both procedures, we have included k-fold cross-validation in BigMLer. In k-fold cross-validation, the training dataset is divided in k subsets of equal size. One of the subsets is reserved as holdout data to evaluate the model generated using the rest of training data. The procedure is repeated for each of the k subsets and finally, the cross-validation result is computed as the average of the k evaluations generated. The syntax for the k-fold cross-validation in BigMLer is:

where dataset/536653050af5e86d9c01549e is the id of the training dataset  (if you need help for creating a dataset in BigML with BigMLer you can read our previous posts).

Smart feature selection

Those who read our previous post on this matter will remember that, following the article by Ron Kohavi and George H. John, the idea was to find the subset of available features that will produce a model with better performance. The key thing here is being clever about the way of searching for this feature subset, because the number of possibilities grows exponentially with the number of features. Kohavi and John use an algorithm that finds a shorter path by starting with one-feature models and scoring their performance to select the best one. Then, new subsets are repeatedly created by adding each remaining feature to the best subset, and its scoring is used to select again. The process ends when the score stops improving or you end up with the entire feature set. In that post, we implemented a command tool using BigML’s python bindings to help you do that process using the accuracy (or r-squared for regressions) of BigML’s evaluations, but now BigMLer has been extended to include it as a regular option.

So let’s say you have a dataset in BigML and you want to select the dataset’s features that produce better models. You just type:
and the magic begins. BigMLer:

  • Creates a 5-fold split of the dataset to be used in evaluations.
  • Uses the smart feature selection algorithm to choose the subsets of features for model construction
  • Creates the 5-fold evaluation of the models created with this subset of features and use its accuracy to score the results , chosing the best subset.
  • Outputs the optimal feature subset and its accuracy.

You probably realize that this procedure generates a large number of resources in BigML. Each k-fold evaluation generates k datasets and, for each feature subset that is being tested, k more models and evaluations are created. Thus, you will probably like to tune the number of folds, as well as other parameters like the penalty per feature (used to avoid overfitting) or the number of iterations with no score improvement that causes the algorithm to stop. This is easy:

will use a penalty of 0.002 per feature and stop the search the third time that score does not improve in 2-fold evaluations. You can even speed up the computation by parallelizing the k models (and evaluations) creation. Using

all models and evaluations will be created in parallel.

But sometimes a good accuracy in evaluations will not lead you to a really good model for your Machine Learning problem. Such is the case for spam or anomaly detection problems, where you are interested in detecting the instances that correspond to very sparse classes. The trivial classifier that predicts always false for this class, will have high accuracy in these cases, but of course won’t be of much help. That’s why  it may sometimes be better to focus on precision, recall or composed metrics, such as phi or f-measure. BigMLer is ready to help here too:

will extract the subset of features that maximizes the evaluations recall.

Node threshold selection

Another parameter that can be used to tune your models is their node threshold (that is, the maximum number of nodes in the final tree). Decision tree models usually grow a large number of nodes to fit the training data by maximizing some sort of information gain on each split. Sometimes, though, you can be interested in growing them until they maximize a different evaluation measure. BigMLer offers the command, that will look for the node threshold that leads to the best score. The command will:

  • Generate a 5-fold split of a dataset to be used in evaluations
  • Grow models using node thresholds going from 3 (the minimum allowed value) to 2000, in steps of 100
  • Create the 5-fold evaluation of the models and use its accuracy as score
  • Stop when scores don’t improve or the node threshold limit is reached, and output the optimal node threshold and its accuracy

As shown in the last section, the associated parameters, i.e. --k-folds, --min-nodes, --max-nodes and --nodes-step can be configured to your liking. Also the --maximize option is available to choose the evaluation metric that you prefer as score. The full fledged version looks like this:

where you can choose the value of every parameter.

These are the first evaluation-based tools available in BigMLer to tend to your data, but we plan to add some more to help you get the most of your models. Spring has only just begun, and more of BigML’s buds are about to burst into blossoms, providing colorful new features for you and your data. Stay tuned!

%d bloggers like this: