BigMLer in da Cloud: Machine Learning made even easier
We have open-sourced BigMLer, a command line tool that will let you create predictive models much easier than ever before.
BigMLer wraps BigML’s API Python bindings to offer a high-level command-line script to easily create and publish Datasets and Models, create Ensembles, make local Predictions from multiple models, and simplify many other machine learning tasks. BigMLer is open sourced under the Apache License, Version 2.0.
Let’s see a few examples. You can create a new predictive model just with:
bigmler --train data/customers.csv
If you check your dashboard at BigML, you will see a new Source, Dataset and Model. Just like magic!
You can also create a Model and generate Predictions for a test set in only one step:
bigmler --train data/customers.csv \ --test data/new_customers.csv \ --objective 'will churn in 3 months'
The example above will generate a prediction for each entry in the file indicated by the –test argument. A different objective field (the field that you want to predict) can be selected using the –objective argument. If you do not explicitly specify an objective field, BigMLer will default to the last column in your dataset. The predictions are computed locally. If you prefer to compute then remotely, then you can add the –remote option and they are computed in the cloud.
No programming involved, just issue a one line simple command and you will know if your new customers will churn soon or not. Even more magic!
You can create Models using Remote Sources as well. You just need a valid URL that points to the data you want to model. BigML recognizes a growing list of protocols (http, https, s3, azure, odata, etc) and formats (csv, arff, etc). For example:
bigmler --train https://test:firstname.lastname@example.org/csv/iris.csv bigmler --train s3://bigml-test/csv/iris.csv?access-key=AKIAIF6IUYDYUQ7BALJQ&secret-key=XgrQV/hHBVymD75AhFOzveX4qz7DYrO6q8WsM6ny bigmler --train azure://csv/diabetes.csv?AccountName=bigmlpublic bigmler --train odata://api.datamarket.azure.com/www.bcn.cat/BCNOFFERING0005/v1/CARRegistration?$top=100
Can you feel the power? You can create predictive models for huge amounts of data without using you local CPU, memory, disk or bandwidth. Welcome to the cloud!!!
Using exisiting Sources, Datasets, and Models
You don’t need to create a Model from scratch every time that you use BigMLer. For example, you can generate predictions for a test set using a previously generated model as follows:
bigmler --model model/50a1f43deabcb404d3000079 \ --test data/new_customers.csv
You can also use a number of models providing a file with a model/id per line:
bigmler --models models.ids \ --test data/customers.csv
Or all the models that were tagged with a specific tag::
bigmler --model_tag customers2011 \ --test data/new_customers.csv
When BigMLer uses multiple models to generate predictions, the predictions from classification trees are combined via “voting” (i.e., the predicted class will be the majority prediction among all the trees) and the predictions from regressions are combined via averaging.
You can also easily create ensembles (a compound of multiple models). For example, using bagging is as easy as:
bigmler --train data/customer.csv \ --test data/new_customer.csv \ --number_of_models 10 \ --sample_rate 0.75 \ --replacement \ --tag my_ensemble
We recommend to tag resources when you create multiple models at the same time so that you can then retrieve them together to generate predictions locally using the multiple models feature from BigML’s API Python bindings.
To create a random decision forest you can use the –randomize option. Then the fields to choose from will be randomized at each split helping you create multiple models that when used together will increase the prediction performance of the individual models.
Making your Model Public
Creating a Model and making it public in BigML’s gallery is as easy as:
bigmler --train data/public_expenses.csv \ --white_box
If you just want to share it as a black-box model you can use the –black_box option. If you want to add a price (i.e., what other users must pay to clone your model) you can use the –model_price argument. You can also change the numer of credits per prediction (i.e, . the credits that other users will consume to make a Prediction with your Model) using the –cpp argument. But remember to read our previous post about BigML’s marketplace so that you know the specifics of how pricing works.
Before making your model public, you probably want to add a name, a category, a description and some tags to your resources. This is easy, too. For example::
bigmler --train data/most_sold.csv \ --name "Most sold products this month" \ --category 7 \ --description description.txt \ --tag 'most sold'
What if your raw data isn’t necessarily in the format that BigML expects? There too we have good news: you can use a number of options to configure your sources, datasets and models. You can read the full documentation and see more examples at BigMLer’s github repo or you can use the –help option to get a listing of all the available options.
What’s the fuss about BigMLer?
BigMLer provides a higher-level API to BigML’s API. BigMLer mixes local and distributed processing seamlessly and transparently. BigMLer allows you to:
- Create remote models accessible from everywhere with local data.
- Create local models with remote data.
- Create local predictions with remote models.
- Create local predictions from multiple remote models.
- Create multiple models in parallel without exhausting your local computational resources.
If you compare BigMLer against other Machine Learning services or packages you will soon see that in most of them you can either:
- Do everything remotely (e.g., Google Prediction API). This implies higher prices, higher latency, black-boxed models; or
- Do everything locally (e.g., Weka). This implies that you end up with a number of installation and configuration issues, exhausting your local resources, and cannot access your results from everywhere.
BigMLer offers you the best of both worlds: the power of cloud-based applications combined with the freedom and low-latency of local resources.
How to get BigMLer?
You can fork it on Github or you can directly install the latest stable release with pip as follows:
pip install bigmler
Python 2.6 and Python 2.7 are currently supported by BigMLer. BigMLer requires bigml 0.4.7 or higher.
BigMLer will look for your username and API key in the environment variables BIGML_USERNAME and BIGML_API_KEY respectively or you can just input them using the –username and –api_key options
Also, you can instruct BigMLer to work in BigML’s Sandbox environment by using the –dev option.
BigML is not only the easiest and fastest way to embrace data-driven decisions and get actionable predictive models from your data, it also comes with some unmatchable perks:
- Creating a BigML account is free. You don’t need to subscribe or even give your credit card info.
- You get a few free megabytes to train your first models and check if you like the service.
- You can use a development mode where everything under 1MB is free.
- Your predictive models are downloadable and exportable to many languages. You can run them in your own systems or application to make local predictions.
- You can share your models with the rest of the world or even make money with them using BigML’s gallery.
Please feel free to contact us and let us know what else BigML should do to help you benefit from your data.