Skip to content

BigML Summer 2017 Release and Webinar: Deepnets!

BigML’s Summer 2017 Release is here! Join us on Thursday October 5, 2017, at 10:00 AM PDT (Portland, Oregon. GMT -07:00) / 07:00 PM CEST (Valencia, Spain. GMT +02:00) for a FREE live webinar to discover the latest update of the BigML platform. We will be presenting Deepnets, a highly effective supervised learning method that solves classification and regression problems in a way that can match or exceed human performance, especially in domains where effective feature engineering is difficult.

Deepnets, the new resource that we bring to the BigML Dashboard, API and WhizzML, are an optimized version of Deep Neural Networks, the machine-learned models loosely inspired by the neural circuitry of the human brain. Deepnets are state-of-the-art in speech recognition, text classification, image classification, and object detection tasks, among other use cases. To avoid the difficult and time-consuming work of hand-tuning the algorithm, BigML’s unique implementation of Deep Neural Networks offers first-class support for automatic network topology search and parameter optimization. BigML makes it easier for you by searching over all possible networks for your dataset and returning the best network found to solve your problem.

As any other supervised learning model, you need to evaluate the performance of your Deepnets to get an estimate of how good your model will be at making predictions for new data. To do this, prior to training your model, you will need to split your dataset into two different subsets (one for training and the other one for testing). When your Deepnets model is trained, you can use your pre-built test dataset to evaluate its performance and easily interpret the results with BigML’s evaluation comparison tool.

One of the main goals of any BigML resource is making predictions, and Deepnets are no exception. As Deepnets have more than one layer of nodes between the input and the output layers, the output is the network’s prediction: an array of per-class probabilities for classification problems, or a single, real value for regression problems. Moreover, BigML provides a prediction explanation whereby you can request a list of human-readable rules that explain why the network predicted a particular class or value.

Want to know more?

Find out how Deepnets work on Thursday October 5, 2017, at 10:00 AM PDT (Portland, Oregon. GMT -07:00) / 07:00 PM CEST (Valencia, Spain. GMT +02:00) in our live webinar. Be sure to reserve your FREE spot today as space is limited!

Sending off the BigML #Summer2017 Interns

During the summer of 2017, a group of interns joined the BigML Team. They came from different backgrounds, different countries, and we wanted to let them briefly share their experiences interning at BigML.

Barbara Martin Summer 2017 InternMy name is Barbara Martin, and I had the opportunity to complete a 5-month internship at BigML, as a part of my 2nd year Engineer School program at ISIMA in France. I spent the first four months of my internship in Valencia, Spain working on several Machine Learning projects. For the last month of my internship, I went to Corvallis, USA to do a thrilling project with other interns. This topic suits my major in Computer Science, and also brought me to the interesting area of Machine Learning. During my internship, I gained a lot of knowledge and had a great chance to sharpen my skills in a professional working environment. I learned new technologies and had opportunities to practice my communication skills by giving presentations and having discussions with my supervisors, experts in the field, and the other staff within and outside BigML.

Jeremiah Lin Summer 2017 InternMy name is Jeremiah Lin. I had the opportunity working for BigML this summer as a marketing intern. I am going into my third-year as an undergraduate at the University of Oregon with a major in Business Administration and a minor in Product Design. During my internship, I worked on translating and reviewing BigML resources that will be released in other languages. I also worked on a research project about how Machine Learning can be implemented to improve the efficiency of personalized marketing. I had little knowledge about Machine Learning going in my internship, but I have learned a lot about it throughout my time at in BigML, and it was eye-opening to see its power to revolutionize the marketing world and many other industries. I loved working with the interns and other team members, and I am grateful for working in such supportive environment that we had at the office. This internship has been a great opportunity to learn about Machine Learning, but more importantly, to learn from the hard-working, team-oriented culture that is embedded in this company.
María Peña Summer 2017 InternMy name is María Peña and I am a last year student of Industrial Design at the UPV. During my internship at BigML, I have discovered and learned how Machine Learning can change our lives and help people to develop their companies. I have also had the opportunity to meet the BigML Team, which includes people from all over the world with the same dream working together. I do everything related to design and I usually spend my time in web design and graphic design, so thanks to BigML I have improved my skills in HTML and CSS. Therefore, working at BigML is the perfect experience to be part of the change and start your career in an amazing way.Mohan Kumar Janapareddi Summer 2017 InternMy name is Mohan Kumar Janapareddi, and I graduated from Ferris State University in the Information Security and Intelligence program. During my internship at BigML, I completed the BigML Certification Program, which helped me learn all the major concepts of Machine Learning in a short period of time. I also got to be a part of the intern’s group project. We worked on building a dynamic website where users can easily make predictions for selected features by simply uploading their data to this website. My portion of the project involved learning how to work with the Django web framework to build a website from scratch. Overall, working at BigML has been a great opportunity to strengthen my technical knowledge of Machine Learning and programming.Ryan Alder Summer 2017 InternMy name is Ryan Alder, and I worked at BigML as an intern for just over a month this summer. I am a second year student at Oregon State University studying Computer Science. I joined BigML because I plan to write my thesis for the OSU Honors College on Machine Learning. I did not know much about Machine Learning when I first joined, and this internship has taught me a great deal regarding what actually happens behind the scenes, and how companies such as BigML are able to take a dataset and convert it accurately into predictions. During my internship, I became a Certified Engineer through the BigML Certification Program, and I worked on an application with other interns. My responsibility for this project was working on the front end, mainly the website. I incorporated HTML, CSS, and javascript to make a working prototype of the website. I had a great deal of fun working with the people here at BigML, and they taught me a significant amount as well.

INTERNSML

Interested in becoming a BigML intern next summer? Let us know at openings@bigml.com and we’d love to hear if you have project ideas. We are always looking for energetic, team-oriented, self-starters to join our team and help bring Machine Learning to everyone!

3rd Valencian Summer School in Machine Learning: More Graduates!

The third edition of our Valencian Summer School in Machine Learning broke our records across the board. A big thank you to all our attendees who made this event one for the books! Over 200 BigMLers from 14 countries representing 92 companies and 28 academic organizations gathered in Valencia for two fun, intense days of Machine Learning training.

VSSML17 Group Day 1

The Valencian Summer School gave BigMLers a quick and practical introduction to Machine Learning, all the while enjoying beautiful ocean views from the elegant Veles e Vents at the Valencian harbor. Not a bad way to dive into the world of Machine Learning, eh?

VSSML17 Networking

Based on our exchanges during the breaks and the BigML Genius Bar session, we’ve found that many businesses are very willing to incorporate Machine Learning in their operations as they see the value it can deliver on a continual basis. However, they are often struggling to formulate their problems in a way that can be addressed with a combination of ML tasks and algorithms. This style of new and more “scientific” thinking requires a bit of an out-of-the-box, critical perspective in the early stages before it becomes well-entrenched in the organization. This change in mindset is a big part of the commitment to create a ML-literate knowledge worker class.

During his keynote, perhaps not so coincidentally, Prof. Enrique Dans also chimed in on the same theme of commitment to go through an organizational transformation. Asking that the audience momentarily step away from the technological aspect of Machine Learning so as to step into the human aspect, he highlighted the need for an organization to “Unlearn” its old processes to make room for the kind of creative thinking that can’t otherwise flourish.

Enrique Dans Featured Talk

He went on to define the sociological concept of Institutional Isomorphism, which pertains to the emergence of familiar, copy cat business processes in various industries due to external and internal forces (professional accreditation requirements, government regulations and plain old imitation of successful business practices). You may be asking, why it makes sense to care about some esoteric social sciences concept all of a sudden? That’s because, with Machine Learning in the mix, managers finally have a much more evidence-driven, objective tool in their hands to break from the norms and make new rules that can fuel new, hard to replicate, sustainable competitive moats that have a compounding effect over time. Of course, success is not guaranteed, and it still takes a good amount of courage and bolder risk-taking culture. With that said, we’ll increasingly witness that companies not willing to take such calculated risks will join their brethren in the dust bins of corporate history.

VSSML17 Classes

Don’t get the impression that it was all algorithms and intellectual musings, the BigML Team works hard and plays hard, so we also invited our attendees to join us for some fun activities before and after the lectures. The event packed in coffee breaks, cocktails, and plenty of other chances for the attendees to mingle with one another. Those who could survive the information binge were impressively able to make it to the 6:30 AM morning runs along the well-regarded Turia Gardens.

The BigML Team extends a huge “Thank You!” to our event collaborators at Veles e Vents, and our co-organizers at VIT Emprende, València Activa, and Ajuntament de València. VSSML would not be possible without your support. We look forward to break new records and bring Machine Learning to everyone one event at a time…

The presentation slides from the course are now available on the VSSML17 event page as well as on BigML’s SlideShare account. Want to see more pictures from the event? Please check out our VSSML17 photo albums on Facebook and Google+.

 

On your Marks: Kicking off the 3rd Valencian Summer School in Machine Learning

The dates are set, the applications are in, the attendee list is finalized, travel plans are made, the curriculum is ready to rock-and-roll and so are we for this week’s VSSML17. At BigML, we believe in the power of education when it comes to widening the impact zone of Machine Learning across the global economy. As cliché as it sounds, Machine Learning is truly changing the world in front of our eyes, except it is doing so in few corners of the planet, beige cubicles, and data centers hidden away from our everyday stomping grounds. So what to do to make it mainstream? Easy, just pack all the basics into a 2-day crash course and invite the whole world to it! That’s precisely what we’ll be doing later this week for the 3rd time in Valencia, Spain.

VSSML'17 Veles e Vents

Every year we find new reasons to get excited about what can be achieved by bringing people together from far flung places of the “3rd rock”, giving them a solid understanding of how they can set up a business problem, tame relevant data into a proper structure, and apply best-practices Machine Learning to arrive at unique and actionable insights that can be used not only to better interpret what happened in the past but also to predict the future.

A highlight of this year’s event will be a special talk given by Enrique Dans, BigML’s Strategic Advisor, prolific Spanish blogger, and IE Business School Professor. On September 14 at 06:00 PM CEST, Enrique Dans will explain how Machine Learning is transforming business organizations around the world. 

This year’s edition also holds a few changes to the format including:

  • The BigML Genius Bar: we will be setting up a separate area staffed with our resident experts, who will gladly tailor the discussion to your business needs and your use cases for 30-minutes.
  • Morning Runs: runners will meet before lectures start on Thursday 14th and Friday 15th at the Hemisfèric IMAX, in the City of Arts and Sciences of Valencia, Spain at 06:30 AM CEST. We will go for a pleasant 30 minute-run along the Turia Gardens, one of the largest urban parks in Spain!
  • “Graduation” Cocktail: as the event wraps up, we will celebrate over some drinks close to the venue. This is a great opportunity to say goodbyes and provide feedback on ways to improve next year’s event.

We’re very excited to meet over 200 BigMLers this Thursday, who are coming from 14 countries and are representing 92 companies and 28 academic institutions.

 

50,000 Customers and Counting!

We’re thrilled to announce that in August we reached a new milestone of 50,000 registered customers on our multi-tenant platform thanks to the accelerating demand for practical Machine Learning worldwide. Back in 2011, we set out with the singular focus of making Machine Leaning beautifully simple for everyone. In sharp contrast with when we launched the first version of our Machine Learning offering, creating an ML-literate professional class has finally become a business imperative across many industries. The growth prospects for BigML look great, so we fully expect to onboard the next 50,000 a magnitude faster since the platform is more complete, the learning tools are in place and the market demand is at an all time high.BigML Reaches 50K Registered Users

Best-in-Class Machine Learning Algorithms

The first version of BigML only featured decision trees as part of a very simple workflow that supported file imports and the ability to make form-based single predictions. Over time, BigML has evolved to not only support more algorithms but also multiple options for automation of workflows all the while abstracting infrastructure layer concerns from the analytical end-user in a scalable manner. BigML’s current version supports highly optimized ensembles (bagging, random decision forest, boosted trees), logistic regression, cluster analysis, anomaly detection, association discoverytopic modeling, and the latest addition, time series forecasting.  All of these were implemented from scratch in Clojure (including the Magnum Opus algorithms for association discovery) as opposed to gluing disparate open source libraries to avoid a fragmented and broken user experience.

Leading the Machine Learning Platform Market

Building a platform has proven very difficult and time-consuming even for those with access to talent and deep pockets.  We’ve seen many technology companies attempt at cracking the commercial Machine Learning tools opportunity over the years.  However, the ambitious press releases and polished on-stage claims and presentations at developer conferences were seldom followed by complete, widely adopted, easy-to-use products that really gained traction in the market.

The smaller, distributed yet highly devoted BigML team’s unadulterated, “hype-free” best practices approach to Machine Learning is especially attractive to companies that prioritize the cost effective delivery of real-life custom Machine Learning solutions above all else. These trailblazing businesses possess proprietary data sources that are transformed into insights that improve operational efficiencies or enable brand new products or services. 

Notably, this is achieved by training their own knowledge workers on BigML instead of relying on expensive “hired guns” that may build working systems for isolated use cases, but fail to deliver the longer term transformational impact that Machine Learning promises. Despite little or no Machine Learning experience, BigML trainees reach a level of proficiency that allows them to make Machine Learning part of their everyday problem-solving skills. Over time, this approach introduces a compounding effect through many different predictive use cases that were initially overlooked.

In fact, this self-sufficient path built on top of a standardized Machine Learning framework is nothing new. Some of the leading corporations such as Facebook, Uber and AT&T have already heavily invested in their own platforms (FbLearner FlowMichelangeloAT&T Machine Learning System respectively) and achieved widespread Machine Learning usage that goes way beyond small teams of scientists or researchers.  However, tens of thousands of other businesses can’t afford to follow the same strategy, which makes it an imperative to evaluate alternatives such as BigML to get a head start.

Given the mandate to open Machine Learning to many more employees, BigML provides an ideal platform to initiate them by teaching the fundamentals of Machine Learning. BigML’s distributed framework is unique in offering Serverless Machine Learning that is accessible in the cloud, in a Virtual Private Cloud, or on-premises. Machine Learning is made easy through BigML’s:

  • Dashboard: intuitive web UI with interactive, easy-to-understand visualizations
  • API: full programmatic access for developers complete with bindings for popular languages such as Python as well as BigMLer, our command-line interface for the platform.
  • WhizzML: our Domain Specific Language handles more complex Machine Learning workflows and the creation of higher level algorithms
  • Tools: integration options with other platforms such as Google Sheets, AlexaMac OS X, and Zapier

BigML Education Programs

In addition to new features, our educational programs have also been engines accelerating the pick up in users in 2017. Our resident Machine Learning experts have put together many assets to help aspiring professionals start their learning journey in multiple modalities, e.g., Machine Learning Summer Schools, custom workshops, free subscriptions for active educators and students, online tutorials, partner webinars, and professional certifications. In continuing the trend, BigML’s 3rd edition of the Valencian Summer School in Machine Learning (September 14-15) was upgraded to a larger venue this year in anticipation of record demand.

50k BigML Users

We’re delighted to bring Machine Learning to more people every day, especially given that the BigML community includes everyone from big name brands to individual users around the world. This remarkable strong growth we have been experiencing further validates the importance of our mission to make Machine Learning accessible for everyone. Here’s to the next 50,000!

BigML Advisor Prof. Dr. Ramon López de Mántaras receives IJCAI Award

We’re proud to share that long-time BigML Advisor Ramon López de Mántaras, Ph.D., has been awarded the Donald E. Walker Distinguished Service Award at the 26th International Joint Conference on Artificial Intelligence (IJCAI-17) recently held in Melbourne, Australia.

Prof. Dr. Ramon Lopez de Mantaras

Ramon López de Mántaras, Ph.D. in Physics University of Toulouse III 1974, is the developer of one the first expert systems in Spain and one of the earliest in Europe. He is currently serving as Research Professor of the Spanish National Research Council (CISC) and Director of the Artificial Intelligence Research Institute (IIIA). Previously, he served as Editor-in-Chief of Artificial Intelligence Communications and President of the Board of Trustees of the International Joint Conferences on Artificial Intelligence. Additionally, he was the first European receiving the AAAI Robert S. Engelmore Award in 2011. Thanks to his deep perspective of the field of Machine Learning, his contributions have been material in shaping BigML’s overall vision and the evolution of our platform roadmap since the early years of our journey.

The IJCAI Distinguished Service Award was presented to Professor López de Mántaras by Michael Woolridge, President of the IJCAI Board of Trustees, for his substantial contributions and service to the field of Artificial Intelligence throughout his career. This award was established in 1979 by the IJCAI Trustees to honor senior scientists in AI for contributions and service to the field during their careers. Dr. López de Mantaras joins a very remarkable field of past winners in: Bernard Meltzer (1979), Arthur Samuel (1983), Donald Walker (1989), Woodrow Bledsoe (1991), Daniel G. Bobrow (1993), Wolfgang Bibel (1999), Barbara Grosz (2001), Alan Bundy (2003), Raj Reddy (2005), Ronald J. Brachman (2007), Luigia Carlucci Aiello (2009), Raymond C. Perrault (2011), Wolfgang Wahlster (2013), Anthony G. Cohn (2015), and Erik Sandewall (2016).

We look forward to further strengthening our collaboration with this source of world-changing innovation, and wish him continued success in his academic endeavors!

Using a Customized Cost Function to deal with Unbalanced Data

As pointed in this Kdnuggets article, it’s often the case that we only have a few examples of the thing that we want to predict in our data. The use cases are countless: only a small part of our website visitors purchase eventually, only a few of our transactions are fraudulent, etc. This is a real problem when using Machine Learning. That’s because the algorithms usually need many examples of each class to extract the general rules in your data, and the instances in minority classes can be discarded as noise, causing some useful rules to never be found.

Unbalanced Dataset WhizzML

The Kdnuggets article explained several techniques that can be used to address this problem. Almost all those techniques rely on resampling the data so that all the possible outcomes (or classes) are uniformly represented. However, the last suggested method takes a different approach: adapting the algorithm to the data by designing a function, which penalizes the more abundant classes and favors the less populated ones using a per-instance cost function.

In BigML, we already have solutions that can be applied out-of-the-box to balance your unbalanced datasets thus improving your classification models. The model configuration panel offers different options for this purpose:

  • a balance objective option, that will weight the instances according to the inverse of the frequency of their class.
  • an objective weight option, where you can associate the weight of your choice to each particular class in your objective field (the one to be predicted when classifying)
  • a weight field option, where you can use the contents of any field in your dataset to define the weight of each particular instance.

By using any of these options, you are telling the model how to compensate for the lack of instances in each class.

The first two options offer a way of increasing the importance that the model gives to the instances in the minority class in a uniform way. However, the article in Kdnuggets goes one step further and introduces the technique of using a cost function to also penalize the instances that lead to bad predictions. This means that we need to tell our model when it’s not performing well, either because it’s not finding the less common classes or because it’s failing in the prediction of any of its results. For starters, we can add to our dataset a new field containing the quantity to be used in penalizing (or increasing the importance of) each row according to our cost function. We can then check if the model results improve as we introduce this field as the weight field.

Fortunately, we have WhizzML, BigML’s domain-specific language that allows the creation of customized Machine Learning solutions. And it’s perfect for this task. So we’ll apply it to build a model that depends on a cost function and check whether it performed better than the models built from raw (or automatically balanced) data.

Scripting automatic balancing

The way to prove that balancing our instances is improving our model is evaluating its results and comparing them to the ones you’d obtain from a model built on the original data. Therefore, we’ll start by splitting our original data into two datasets and keeping one of them to test the different models we’re going to compare. The 80% of data will then form a training dataset that will be used to build the models and we will hold out the remaining 20% to evaluate their performance. Doing this in WhizzML is a one-liner.

ds-split

The create-dataset-split procedure in WhizzML takes a dataset ID, a sample rate and a seed, which is simply a string value of our choice that will be used to randomly select the instances that go into the test dataset. Having the same seed will ensure that, even if the selection of instances is random, it will be deterministic and the same instances will be used every time you run the code.

Once we have our separate training data, we can build a simple decision tree from it.

create-model

The model-options argument is a map that can contain any configuration option we want to set when we create the model. The first attempt creates the model by using default settings, so model-options is just an empty map. This gives us the baseline for the behavior of models with raw unbalanced data.

Then we evaluate how our models perform using the test dataset. This is very easy too:

 create-eval

The model-id variable contains the ID of any model we evaluate.

We’re interested in predicting a concrete class (when evaluating, we name this the positive class). If the dataset is unbalanced, the positive class is usually the minority class. In this case, the default model tends to perform poorly. As a first step to improve our basic model, we try to create another model that uses automatic balancing of instances. This method assigns a weight to each instance that is inversely proportional to the frequency of the class it belongs to. This assigns a constant higher weight to all instances of the minority class and a lower one for instances in the abundant classes. In WhizzML, you can easily activate this automatic balancing with model-options {"balance_objective" true}. Usually, for unbalanced data, this second model will give better evaluations than the unbalanced one. However, if the performance of this second model is still not good enough for our purpose we can further fine tune the contribution of each instance to the model as described before. Let’s see how.

Scripting a cost function as a weight per instance

The idea here is that we want to improve our model’s performance, so besides assigning a higher weight to the instances of the minority class uniformly, we would like to be able to weight higher those instances that contribute to the model being correct when predicting. How can we do that?

Surely, the only way to assert a model’s correctness is evaluating it, so we need to evaluate our models again, but in this case, we don’t need the average measures, like accuracy, precision, etc. Instead we need to compare one by one the real value of the objective field against the value predicted by the model for each instance. Therefore, we will not create an evaluation, but a batch prediction.

create-batch

A batch prediction receives a model ID and a test dataset ID and runs all the instances of the test dataset through the model. The predictions can be stored together with the original test data in a new dataset, and also the confidence associated with them. Thus, we’ll be able to compare the value in the original objective field column with the one in the predicted column. Instances whose values match should then receive more weight than instances that don’t.

At this stage, we’re ready to create a cost function that ensures:

  • instances of the minority class weigh in more than the rest
  • instances that are incorrectly predicted are penalized with a higher cost (so a lower weight in the model)

There’s room for imagination to create such a function. Sometimes your predictions will be right when they predict the positive class (TP = true positives) and sometimes otherwise (TN = true negatives). There are two possibilities for the predictions to be wrong: instances that are predicted to be of the positive class and are not (FP = false positives), and instances of the positive class whose prediction fails (FN = false negatives). Each of these classes, TP, TN, FP, FN have an associated cost-benefit.

Let’s assume your instance belongs to the class of interest and the model predicts it well. This is a TP and we should add weight to the instance. On the contrary, if it isn’t predicted correctly we should diminish its influence, which means for a FN the weight should be lower. The same happens with TN and FN. Following this approach, we come up with a different formula for each of the TP, TN, FP, FN outcomes. To simplify, we set such a weight as:

  • when the prediction is correct, its confidence is multiplied by the inverse frequency of the class (total number of instances in the dataset over the number of instances of the class).
  • when the prediction isn’t correct, the inverse of the confidence is multiplied by the frequency of the class.

If we create a dataset with a new column that has that weight, we can use the weight_field option when creating the model. Then, each instance is weighed differently during the model construction. Hopefully, this will improve our model. So let’s see if that’s indeed the case.

We start with the dataset obtained from our batch prediction, which contains the real objective value as well as the predicted value and confidence. We create a new dataset by adding a weight field, and that’s exactly what the following command does:

create-batch-ds

Using the new_fields attribute we define the name of our new column and its contents. The weight value should contain an expression that describes our weight function. To achieve this, we will use Flatline, BigML’s on-platform feature engineering language. The dataset we use is the batch prediction dataset, so it gets two additional columns: __prediction__ and __confidence__.

 flatline

We won’t discuss here the details of how to build this expression, but you can see that in each row we compare the value of the objective field (f {{objective-id}}) to the predicted value (f \"__prediction__\") and use the confidence of the prediction (f \"__confidence__\"), the total number of instances {{total}} and the instances in the objective field class {{class-inst}} to compute the weight.

Now we have a strategy to weight our instances, but there’s an important detail that we need to keep in mind. We can’t use the same test dataset that we’ll use to evaluate the performance of our models to compute the weights. Otherwise, we’d be leaking information to our model, which it can use to cheat rather than generalizing well to our problem. To prevent this while avoiding splitting out our data again, we use another technique: cross validation.

Using k-fold cross validation predictions to define weights

In case you aren’t familiar with the k-fold cross validation technique, it splits your training dataset into k parts. One of them is held out for testing and you build a model with the remaining k-1 parts. You do so with one different part at a time, so you end up with k models and k evaluations and all of your data is used for training or testing in some of the evaluations.

Applying the same idea here, you split your dataset in k parts. Hold out one of the parts to be the test dataset and create a model with the rest. Then, use the model to create a batch prediction for the held out part. The weights that we want to assign to each instance in the hold out can be computed from the result of this batch prediction. The process is repeated with a different holdout each time, so every instance is weighted and the models that create the predictions are built on data completely independent from any particular test set.

In BigML we already offer scripts to do k-fold cross validation, so we don’t need to code the entire cross validation algorithm all over again. We just need to generate another copy by tweaking the existing script.

new-script

The change involves the creation of batch predictions instead of evaluations at the end of the process, so we simply change the code from this

evaluations

to this

batch

where the changes are mostly related to the fields we want to see in our newly created datasets and their names.

This small change in the script provides the datasets that we need to apply our weight computation function to. So, let’s sum up our work.

  • We’ve split our original data into training and test datasets to compare the performance of different models.
  • We’ve used the training data to create a model with the raw data and another one with our instances uniformly weighted to compensate the unbalanced situation.
  • We have additionally divided the training data into k datasets and used the k-fold cross validation technique to generate predictions for all the instances therein. This process uses models built on data never used in the test procedure and also allows us to match the real value to the predicted result for each instance individually.
  • With this information, we added a new column to our training dataset that contains a weight that is applied to each instance when building the model. This weight is based on the values of the frequency of the objective field class that the instance belongs to and also on the evaluation of its predictions by using the k-fold cross validation.
  • The new weight column contained a different value per instance that is used when the weighted model was built.
  • We finally used the test dataset to evaluate the three models: the default one, the automatically balanced one, and the one with a cost function guided weight field.

After testing with some unbalanced datasets, we achieved better performance using the weight field model than with either the raw or the automatically balanced ones. Even with our simple cost function, we’ve been able to positively guide our model and improve the predictions. Using WhizzML, we only needed to add a few lines of code to an already existing public script. Now it’s your time to give it a try and maybe customize the cost function to really make a difference in your objective functions’ gain vs. loss balance. You can access our script at https://gist.github.com/mmerce/cd87dc119bfbf6dcc4ef0c7d9be0bf1d and easly clone it in your account. Enjoy and let us know of your results!

How to create a WhizzML Script – Part 3

In this third post about WhizzML basics, you’ll learn more about tools to create WhizzML scripts. We already covered how to manipulate WhizzML scripts from the Gallery. We also learned how to do the same via Scriptify and the Script Editor. For a quick reminder, go to the previous posts, How to create a WhizzML script – Part1 and Part 2. In this tutorial, you’ll discover two new powerful scripting scenarios that will help you build complicated workflows: GitHub and BigMLer. Let’s dive in!

GitHub & WhizzML

Once you are in the scripts section, you can create a new script by using the editor, create a new script from an existing one, or (you guessed it) import one from GitHub.

git-whizz

It’s easy as pie! You just have to go to the WhizzML Script section and click on the ‘IMPORT SCRIPT FROM GITHUB’ button in the top right menu. Then you enter a WhizzML Github URL. You’ll see the script uploaded in your dashboard window ready be created upon clicking the ‘Create’ button. You may be wondering where you can find a WhizzML Github URL. No problem! There is a WhizzML example repository in Github with a lot of handy example scripts.

BigMLer

bigmler-whizzml

Using BigMLer (BigML’s command line tool), you can execute WhizzML scripts too. You just have to remember to use the sub command execute. Let’s see some examples.

  1. In this first example, you will execute the code written directly in the command line by using the --code option. The code has to be written between the quotation marks. Below, we execute a real basic piece of code: adding two numbers. In this example, the output will be stored in the simple_exe directory, because we used the flag --output-dir. This directory will contain whizzml_result.txt, where you can easily read the output of your code.
    bigmler execute --code “(+ 1 2)” --output-dir simple_exe
  2. You can execute a script that you already have in your dashboard by using the script ID. For this, you have to use the flag --script:
    bigmler execute --script script/50a2bb64035d0706db000643
  3. If your script has inputs, you can pass it as a JSON file (my_inputs.json) by using the flag --inputs:
    bigmler execute --script script/50a2bb64035d0706db000643 \
                      --inputs my_inputs.json

    where the my_inputs.json file can contain something like below. The inputs would be a with value 1 and b with value 2 :jsonfile3

  4. In the following example of a BigMLer command, you will create your script but you won’t execute it through the flag  --no-execute. In this script, you are declaring inputs and outputs and specifying them in the corresponding JSON files, with the flags --declare-inouts and --declare outputs
        bigmler execute --code “(define addition (+ a b)” \
                        --declare-inputs my_inputs_dec.json \
                        --declare-outputs my_outputs_dec.json \
                        --no-execute

    bigmler execute –code “(define addition (+ a b)” \ –declare-inputs my_inputs_dec.json \ –declare-outputs my_outputs_dec.json \

  5. In this example, the my_inputs_dec.json file could contain:jsonfile1.pngand the my_outputs_dec.json file could be like the example below. The value of the addition variable would then be returned as output in the execution results:jsonfile2.png

So it isn’t rocket science. Now it’s your turn to try it! Of course, remember to install and authenticate yourself before typing your BigMLer command lines.

Now you know how to create a WhizzML script in more ways. But we’re not quite finished yet! The next step will be discovering how to use the bindings to create scripts and libraries.  Stay tuned.

New Features for the BigML Predict App for Zapier

by

Thanks to the feedback provided by early adopters, our BigML app for Zapier has been acquiring new useful features, including improved support for additional ML algorithms and dynamic resource selection.

Support for new ML algorithms

The first version of our BigML Predict app only included support for model, ensembles, and logistic regressions. Now, you can also use clusters and anomaly detectors for your predictions and execute a WhizzML script, which can be great to automate more advanced use cases.

When you try to add an action from the BigML Predict app, you will now see a longer list of choices as shown below:

img-3

  • Create Prediction (legacy): if you haven’t used the BigML Predict app before, you can safely ignore the “Create Prediction (legacy)” action.
  • Execute WhizzML script: this action allows you to run a WhizzML script with a given set of arguments. Due to the way Zapier requires users to specify input fields, you will only be able to run WhizzML scripts that take “scalar” arguments.
  • Create Anomaly Score: computes the anomaly score associated with a data instance by using an anomaly detector.
  • Create Centroid: identifies the cluster that is closer to your input data instance.
  • Create Ensemble Prediction: uses an ensemble to make a prediction.
  • Create Prediction: uses a model or logistic regression to make a prediction.

When you include one of the actions listed above into your workflow, you will be given the chance to specify a few input fields:

  • the Resource ID: a simple value of the form resource-type/resource-id, e.g., ensemble/123456. The resource must exist in you BigML account, otherwise, the workflow execution will fail. You can either hard-code this value or use the “Find a
    resource” option to select the proper resource dynamically based on a number of criteria. This will be further detailed below.
  • Input Data: a list of values to be used as a prediction input. For each of them, you specify both the feature name and its actual value.
  • Additional input arguments: to specify how the prediction should be calculated. Allowed arguments will vary with the prediction algorithm you choose. For example, an ensemble prediction allows you to specify how to handle missing values, as well as what kind of combiner to use, etc.

img-4

Dynamically selecting ML resources

If you joined our beta program, chances are you’ve noticed that the biggest new feature in our BigML Predict app is the “Find a resource” search option. It simply lets you specify a number of search criteria to identify a resource to use for predictions.

img-1

This means you can, for example, specify a project name and a resource type to identify the most recent resource of that type belonging to the specified project. The result of this operation is a Resource ID that you can use in any subsequent step of your Zapier workflow to further manipulate that resource. The image below displays all the search criteria you can use.

img-2

At a bare minimum, to search for a resource, you should provide its type, e.g., anomaly detector, ensemble, etc. If you only specify a resource type, the “Find a resource” search will select your latest resource of that type among all of your resources. Alternatively, you can make your search more specific by also providing any of the following information:

  • Resource Name: the name of the resource you would like to select or a part of it.
  • Resource Name: a tag associated with the required resource.
  • Project Name: the name of the project your resource should belong to. You can also specify only a part of the project name.
  • Resource Name: a tag associated with the project your resource should belong to.
  • Mode: either Production or Development mode. If you don’t specify anything, the “Find a resource” search will look into your production resources by default.

If your search criteria aren’t specific enough to identify just one resource, the most recent one will be used.

To effectively include a search step in one of your Zapier workflows, you should link the Resource ID field of your, e.g., Create Prediction action to the output provided by the search step, as shown in the picture below.

img-5

Get access to the improved BigML Predict app

We hope that the new features we added to BigML Predict for Zapier will help you better use Zapier to solve your ML automation problems.

If you are interested in giving the new BigML Predict app a try, please get in touch with us at support@bigml.com.

A Stupidly Easy Speed Detector

My family and I recently moved into a new house in the center of our little college town. We love it, but our new location also puts us next to a busy residential street. All too often passersby would tear through well above the posted 25 mph limit. It brought out my latent grumpy side.

Being a data-oriented guy, I wanted to have some hard numbers for how many folks were speeding and I wasn’t going to spend hundreds of dollars on a radar system. So instead I threw together a web camera, some simple video processing, and anomaly detection to make a system for tracking vehicle speeds. The diagram below shows the process from a high level, but I’ll also dive into a few of the details.

workflow_v03

I’ve previously toyed with combining videos and BigML’s anomaly detection (an extended variety of isolation forests) as a way to do motion detection. By tiling a video and building an anomaly detector for each tile, I made a motion detector that could disregard common movement. For example, in this video the per-tile detectors don’t trigger for the oscillating fan, but they do for the nefarious panda interloper (on loan from my daughter).

To do this I extracted features for each tile, such as the average red, green, and blue pixel values (shout-out to OpenIMAJ for making this easy). This gave me tile-specific datasets where each row represents a video frame, like so:

Once collecting data in a row/column form, it’s easy to create an anomaly detector with the BigML language bindings. I do my hobby projects in Clojure, so the BigML Clojure bindings let me transform the data above into an anomaly detection function with only this snippet of code:

The tiny example above loads the data from the previous gist, builds an anomaly detector as a Clojure function, and uses that function to score a new point. Scores from BigML anomaly detectors are always between 0 and 1. The stranger a point, the larger the score. Generally scores greater than 0.6 are the interesting ones. The green highlighted tiles in the panda example represent scores above 0.65.

So I took this tiling+detectors approach and applied it to video of cars passing my house. My intuition was that while tracking cars can be tricky, learning the regular background should be easy. Then all I’d need is to track the clumps of anomalies which represented the cars.

Instead of tiling the entire video, I only tiled two rows. Each row captured a vehicle lane. I tracked the clumps of anomalies and timed how long it took them to sweep across the video. Those times let me estimate vehicle speeds.

By tracking clumps of anomalies the system is more robust to occasional misfires by individual tile detectors. Also, as expected, the detectors helped ignore common motion like the small tree swaying in the foreground.

An approach like this is far from perfect. It can be confused by things like lane changes, bicycles, or tall trucks (which can register as two vehicles or occlude other cars).

truck

Nonetheless, I was pleasantly surprised how well it did given the simplicity. With occasional retraining of the detectors, it also handled shadows and shifting lighting conditions. In some cases it tracked vehicles even when I had a hard time finding them. There is, believe it or not, a car in this image:

shadow

So I had a passable vehicle-counter/speed-detector using a webcam. To culminate the project, I collected vehicle speeds over a typical Saturday afternoon. The results surprised me.

speed-bars

I expected speeders to be much more common than they actually were. In fact, the significant speeders (which I deemed as 30+ mph) made up only about 3% of the total. So I’ve done my best to lose the grumpiness. Without the data, I’d just be one more victim of confirmation bias.

For the Clojure-friendly and curious, feel free to check out the project on GitHub.

%d bloggers like this: