Skip to content

Building Information Modeling (BIM): Machine Learning for the Construction Industry

This guest post is originally authored by David Martínez, CEO at Ibim Building Twice S.L. and Pedro Núñez, I+D+I Manager at Ibim.

Building Information Modeling (BIM) is revolutionizing the construction industry. Unlike the data generated by computer-aided design (CAD), which represent flat shapes or volumes and 2D drawings consisting of lines, BIM data represent the reality of the built structure. This new way of digitizing the real world is superior in operational terms, and the structure of its data is ideal for analytical purposes and the application of Machine Learning techniques.

BigML enables BIM consultancies, Project Management Offices (PMO), construction companies, and developers to apply Machine Learning to BIM (even experimentally). Its user-friendly platform makes modeling possible without any in-depth knowledge of Machine Learning and enables previously unimaginable automated processes and knowledge.

BIM model example.

Building Information Modeling uses data organized in a similar way to a database to create digital representations of real-life structures. BIM includes the geometry of the building, its spatial relationships and geographic information, and also the quantities and properties of its components. This information can be used to generate drawings and schedules that express the data in different ways.

BIM model example.

The possibilities of applying Machine Learning techniques to BIM are countless. Classification algorithms, anomaly detection, and even time series analysis can be used with BIM. It is worth mentioning that BIM data are used throughout the lifespan of a building (i.e., during the design, construction, maintenance phases) and can even include real-life sensor data. This is a good example of how classification algorithms can be used by combining data from many buildings, the characteristics, and location of the flats to predict how well they might sell, or even the likelihood of construction delays. On the other hand, anomaly detection is very useful to pinpoint modeling errors, and with regards to time series analysis, we can apply it to real-time data to make better maintenance predictions.

More specifically, Ibim Building Twice S.L. has conducted research into how the use of a room in a flat can be predicted based on its geometry and other BIM data. The findings are so remarkable that the company has decided to publish them as a contribution to the digitization of the construction industry. The different types of rooms in BIM are usually labeled entirely by hand by the expert modeler. The use of Machine Learning algorithms to automate this type of task could reduce the necessary time and outlay considerably. The experiment was based on data about residential buildings in BIM generated with Autodesk Revit®. The data about the rooms in the flats were extracted and re-processed using data schedules plus C# programming with the Revit API.

Model of flat using Revit. Left: names of rooms suggested by the logistic regression algorithm. Right: final names assigned by additional programming.

The extracted data were used as source data in BigML, which we first explored with dynamic scatterplots:

Graphs of rooms according to area of rooms or housing unit and hierarchy / quadrature.

Later on, we created several structured data sets for training decision trees, logistic regressions, and deepnets, all of which are classification algorithms.

BigML makes it possible to measure the performance of each model easily. Although all three algorithms were used to solve the same problem (i.e., labelling rooms according to their function on the basis of their geometry and other data), the accuracy and suitability of the algorithms may vary considerably depending on the problem in hand, so it is advisable to evaluate them all in order to determine which one yields the best predictions.

In our experiment, the top models were about 90% accurate in predicting room use. Those were evaluated against data obtained from different architects and buildings, suggesting quite a promising technique for use in production. The findings of the study were presented at the EUBIM 2018 congress held in Valencia, on May 17-19, 2018. For more details, please watch the video of the presentation and check the corresponding slideshow and original article in English and Spanish that include full details of the experiment.

The 4th Valencian Summer School in Machine Learning is Open for Enrollment

We are excited about our upcoming Summer School in Machine Learning 2018, the fourth edition of this international event. Hundreds of decision makers, industry practitioners, developers, and curious minds will delve into key Machine Learning concepts and techniques they need to master to join the data revolution. All of this will take place on September 13-14 in a great location, La NAU Cultural Center, one of the most beautiful and historic buildings from the University of Valencia.

The VSSML18 aims to cover a wide spectrum of needs as BigML’s main focus is to make Machine Learning beautiful and simple for everyone. Regardless of your prior Machine Learning experience, with this two-day course you will be able to:

  • Learn the foundational ideas behind Machine Learning theory with Master Classes that emphasize putting them into practice in your business or project.
  • Choose your preferred option between two parallel sessions: Machine Learning for business users or for developers. With these options, the VSSML18 can serve a diverse audience while providing customized content. Check out the full schedule for more details.
  • Practice with the BigML platform the concepts learned during the course via practical workshops. We recommend that you bring your laptop to create your own Machine Learning projects and start applying best-practices Machine Learning to find valuable insights in your data. Only a browser is required.
  • Understand how Machine Learning is currently being applied in several industries with real-world use cases. To provide a complete curriculum, in addition to the theoretical and hands-on part, it’s important to find out how real companies are benefitting from Machine Learning. This year we see how Barrabés.Biz uses BigML to evaluate ICOs, or how CleverData.io works on a predictive maintenance problem, or how Bankable Frontier Associates uses Machine Learning for social good, among other use cases.
  • Discuss your project ideas with the BigML Team members at the Genius Bar. We are happy to help you with your detailed questions about your business or projects. You can contact us ahead of time at vssml18@bigml.com to book your 30-minute slot with a designated BigML expert.
  • Enhance your business network. International networking is the intangible benefit at the VSSML18. Join the multinational audience representing 13 countries so far, including Spain, Portugal, Italy, Germany, Austria, Belgium, Netherlands, United Kingdom, Russian Federation, Turkey, India, United States, and Canada.
  • Stay fit during the event with our morning runs! Before the event starts we will go for a 30 minute-morning run along the Turia Gardens, one of the largest urban parks in Spain. The meeting point on Thursday 13 and Friday 14 will be at the main entrance of the venue, La Nau Cultural Center, at 06:30 AM CEST. We are counting on you to join!

As preparations are being wrapped up, please check the VSSML18 page for more details on the hotels we recommend for your stay in Valencia, in case you come from outside the city. APPLY TODAY, and reserve one of our spots before we reach full capacity!

The Fusions Webinar Video is Here: Improve your Model Performance Through Algorithmic Diversity!

The new BigML release brings Fusions to our Machine Learning platform, the new modeling capability that combines multiple models to achieve better results. A Fusion combines different supervised models (models, ensembles, logistic regressions, and deepnets) and aggregates their predictions to balance out the individual weaknesses of single models.

Since yesterday, on July 12, 2018, Fusions are available from the BigML Dashboard, API, and WhizzML, and they follow the same principle as ensembles, where the combination of multiple models often provides better performance than any of the individual components. All these details, along with the new and more complete text analysis options, are explained in the official launch webinar. You can watch it anytime on the BigML YouTube channel.

For further learning on Fusions and other new features, please visit our release page, where you will find:

  • The slides used during the webinar.
  • The detailed documentation to learn how to use Fusions with the BigML Dashboard and the BigML API.
  • The series of six blog posts that gradually explain Fusions to give you a detailed perspective of what’s behind this new capability. We start with an introductory post that explains the basic concepts, followed by several use cases to understand how to put Fusions to use, and then three more posts on how to use Fusions through the BigML Dashboard, API, WhizzML and Python Bindings. Finally, we complete this series with a technical view of how Fusions work behind the scenes.
  • An extra section with a blog post and documentation on the new text analysis enhancements released.

Thanks for your watching the webinar, for your support, and your nice feedback! For more queries or comments, please contact the BigML Team at support@bigml.com.

Text Analysis Enhancements: 22 Languages and More Pre-processing Options!

We’re happy to share new options to automatically analyze the text in your data. BigML has been supporting text fields as inputs for your supervised and unsupervised models for a long time, which pre-process your text data in preparation for Machine Learning models. Now, these text options have been extended as new ones have been added to further streamline your text analysis and enhance your models’ performance.

  • BigML supports 15 new more languages. The total number has increased from 7 to 22 languages! Now you can upload your text fields in Arabic, Catalan, Chinese, Czech, Danish, Dutch, English, Farsi/Persian, Finish, French, German, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Turkish, Romanian, Russian, Spanish, or Swedish. BigML will auto-detect the language or languages (in case your dataset contains several languages in different fields) in your data. The detected language is key for the text analysis because it determines the tokenization, the stop word removal, and the stemming applied.
  • Extended the stop words removal options: You can now opt to remove stop words for the detected language or for all languages. This option is very useful when you have many languages mixed in the same field. For example, social media posts or public reviews are usually written in several languages. Another related enhancement helps you decide the degree of aggressiveness that you want for the stop words removal i.e., light, normal, or aggressive. Depending on your main goal, there can be some stop words that may be useful for you, e.g., the words “yes” and “no” may be interesting since they express affirmative and negative opinions. A lower degree of aggressiveness will include some useful stop words in your model vocabulary.
  • One of the greatest new additions to the text options is n-grams! Although you could already choose bigrams before, we’ve extended it so you can now include bigrams, trigrams, four-grams, and five-grams in your text analysis. Moreover, you can also exclude unigrams from your text and make the analysis based only on higher size n-grams (see the filter for single tokens below).
  • Lastly, a number of optional filters to exclude uninteresting words and symbols from your model vocabulary have been added since specific words can introduce noise into your models depending on your context.
    • Non-dictionary words (e.g., words like “thx” or “cdm”)
    • Non-language characters (e.g., if your language is Russian, all the non-Cyrillic characters will be excluded)
    • HTML keywords (e.g., href, no_margin)
    • Numeric digits (all numbers containing digits from 0 to 9)
    • Single tokens (i.e., unigrams, only n-grams of size 2 or more will be considered if selected)
    • Specific terms (you can input any string and it will be excluded from the text)

new_text_options.png

You can set up these options by configuring two different resources: sources and/or topic models. By configuring the source and then creating a dataset, you propagate the text analysis configuration to all the models (supervised or unsupervised) that you create from that dataset. Hence, an ensemble or an anomaly detector trained with this dataset will use a vocabulary shaped by the options configured at the source level. Topic models are the only resources for which you can re-configure these text options. This is because the topic model results are greatly impacted by the text pre-processing so BigML provides a more straightforward way to iterate on such models so you don’t need to go back to the source step each time.

Let’s illustrate how these new options work using the Hillary Clinton’s e-mails dataset from Kaggle. The main goal is to know the different topics in these e-mails without having to read them all. For this, we will create a topic model using this dataset. We assume you know how topic models work in BigML, otherwise, please watch this video.

We’re only using two fields from the full Kaggle dataset (“ExtractedSubject” and “ExtractedBodyText”) to create the topic model. First, we create the topic model with the BigML 1-click option which uses the default configuration for all the parameters.

1-click-topic-model

When the model is created, we can inspect the different topics by using the BigML visualization for topic models. You can see that we have some relevant topics like topic 36, which is about economic issues in Asia (mostly China). But most of the topics, even if they contain relevant terms, are also mixed with lots of frequent words, numbers, and acronyms (for example “fw”, “fyi”, “dprk”, “01”, “re”, “iii”, etc.) that don’t tell us much about the real content of the e-mails.

topic-model-1-click

Let’s try to improve those results by configuring the text options offered by BigML. We can observe in our first model that there were some stopwords that we don’t really care much about such as “ok” or “yes”. Therefore, we set the stop words removal as “Aggressive” this time. We also had many terms and numbers that are not telling us anything about the e-mail themes, such as “09”, “iii” or “re”. To exclude those terms from our analysis, we’ll use the non-dictionary words and numeric digits filters. Finally, in order to get some more context, we’ll also include bigrams, trigrams, four-grams and five-grams to our topic model.

topic-model-configuration.png

So we create the new topic model and… voilà! In a couple of clicks, we have a much more insightful model with more meaningful topics that help us better interpret the content of the underlying e-mails.

You can see that most of the meaningless words have disappeared and the terms within the topics are much more thematically related. For example, now we have five topics that contain the word “president” and talk about five different themes: European politics, current US government, US elections, US politics and Iran politics. In the model we built before, the minority thematics like Iran politics didn’t feature a topic of their own as they were mixed with other topics while other more frequent (but meaningless) words had topics of their own.

topic-model-filters.png

We may clean this model even further, and filter uninteresting words like “pm”, “am”, “fm”, etc. However, we feel satisfied enough with these topics and we prefer to spend the time creating a new model with a totally different approach.

Sometimes, the meaning of a single word can change if you look at the terms around it. For example, “great” may be a positive word, but “not so great” is indicating a bad experience. We can make this kind of analysis by using BigML n-grams and at the same time excluding unigrams from the topic model. The resulting model only includes topics that contain bigrams, trigrams, four-grams and five-grams. All topics show delimited themes that may be slightly different than the topics we obtained before. For example, the topic about the English politics was too broad before, it was mixed with European politics, however now it has two topics for its own.

topic-model-ngrams.png

Ok, topic models and the new text options on BigML are great, but what is the main goal of all this? We could use these topics for many purposes. For example, to analyze the most mentioned topics in Hillary Clinton e-mails by calculating per-topic distributions for each e-mail (very easy with BigML’s topic distribution feature). Moreover, you could use the per-topic distributions as inputs to build an Association and see which topics are more strongly correlated. In summary, when you have a topic model created and the per-topic distributions calculated, you can use them as inputs for any supervised or unsupervised models.

Thanks for reading! As usual, we look forward to hearing your feedback, comments, or questions. Feel free to send them to support@bigml.com. To learn more, please visit this page.

To Fuse or Not To Fuse Models?

The idea of model fusions is pretty simple: You combine the predictions of a bunch of separate classifiers into a single, uber-classifier prediction, in theory, better than the predictions of its individual constituents.

As my colleague Teresa Álverez mentioned in a previous post, however, this doesn’t typically lead to big gains in performance. We’re typically talking 5-10% improvements even in the best case. In many cases, OptiML will find something as good or better than any combination you could try by hand.

So, then, why bother? Why waste your time fiddling with combinations of models when you could spend it on doing things that will almost certainly have a more measurable impact on your model’s performance, like feature engineering or better yet, acquiring more and better data?

Part of the answer here is that looking at a number like “R squared” or “F1-score” is often an overly reductive view of performance. In the real world, how a model performs can be a lot more complex than how many answers it gets wrong and how many it gets right. For example, you, the domain expert, probably want a model that behaves appropriately with regards to the input features and also makes predictions for the right reasons. When a model “performs well”, that should mean nothing less than that the consumers of its predictions are satisfied with its behavior, not just that it gets some number of correct answers.

If you’ve got a model that has good performance numbers, but it’s exhibiting some wacky behavior or priorities, using fusions can be a good way to get equivalent (or occasionally better) performance, but with the added bonus of behavior that perhaps appears a little saner to domain experts. Here are a few examples of such cases:

“Minding The Gap” with tree-based models

There are loads of literature out there showing that, for many datasets, ensembles of trees will end up performing as good or better than any other algorithm. However, trees do have one unfortunate attribute that may become obvious if someone observes many of their predictions: The model takes the form of axis-aligned splits, and so large regions of the input space will be assigned the same output value (resulting in a PDP that looks something like this):

treepdp

Practically, this will mean that small (or even medium-sized) changes to the input values will often result in identical predictions, unless the change crosses a split threshold, at which point it will change dramatically. This sort of stasis / discontinuity can throw many people for a loop, especially if you have a human operator working with the model’s predictions in real-time (e.g., “Important Feature X went up by 20% and the model’s prediction is still the same!”).

A solution to the problem is to fuse the ensemble to a deepnet that performs fairly well. This changes the above PDP to look more like this:

diffpdp

You’ll still see thresholds where the prediction jumps a bit, but there’s no longer complete stasis within those partitions. If the deepnet’s performance is close to the ensemble, you’ll get a model with more dynamic predictions without sacrificing accuracy.

The Importance of Importance

In the summary view of the models learned when you use OptiML, you’ll see a pull-down that will give you the top five most important fields for each model.

importance

For non-trivial datasets, you may often see that models with equivalent or nearly-equivalent performance have very different field importances. The field importances we report are exactly what they say on the tin: they tell you how much the model’s predictions will change when you change the value of that feature.

This is where you, the domain expert, can use your knowledge to improve your model, or at least save it from disaster. You might see a case where a high performing model is relying on just a few features from the dataset, and another high performing model is relying on a few different features. Fusing the models together will give you a model guaranteed to take all of those features into account.

This can be a win for two reasons, even if the performance of the fused model is no better than the separate models. First, people looking at the importance report will find that the model is taking into account more of the input data when making its prediction, which people generally find more satisfying than the model taking into account only a few of the available fields. Second, the fused model will be more robust than the constituent models in the sense that if one of the inputs becomes corrupt or unreliable, you still have a good chance of making the right prediction because of the presence of the other “backup” model.

(Mostly) Uncorrelated Feature Sets with Different Geometries

Okay, so that’s a mouthful, but what I’m talking about here is situations where you’ve got a set of features, where some are better modeled separately from the others.

Why would you do this? It’s possible that a subset of the features in your data is amenable to one type of modeling and others to different types of modeling. If this is the case, and if those different features are not terribly well-correlated with one another, a fusion of two models, each properly tuned, may produce better results than either one on its own.

A good example is where you have a block of text data with some associated numeric metadata (say, a plain text loan description and numeric information about the applicant). Models like logistic regression and deepnets are generally good at constructing models that are algebraic combinations of continuous numeric features, and so might be superior for modeling those features. Trees and ensembles thereof are good at picking out relevant features from a large number of possibilities, and so are often well-suited to dealing with word counts. It seems obvious, then, that carefully tuning separate models for each type of data might be beneficial.

Whether the combination outperforms either one by itself (or a joint model) depends again on the additional performance you can squeeze out by modeling separately and the relative usefulness of each set of features.

That Is The Question

Hopefully, I’ve convinced you that there are reasons to use fusions that go beyond just trying to optimize your model’s performance; they can also be a way to get a model to behave in a more satisfying and coherent way while not sacrificing accuracy. If your use case fits one of the patterns above, go ahead and give fusions try!

Want to know more about Fusions?

If you have any questions or you would like to learn more about how Fusions work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

Automating Fusions with WhizzML and the Python Bindings

by

This blog post is the fifth in our series of posts about Fusions and focuses on how to automate processes that include them using WhizzML, BigML’s Domain Specific Language for Machine Learning workflow automation. Summarizing what a Fusion is, we can define them as a group of resources that predict together in order to reduce each resource’s individual weakness.

In this post, we are going to describe how to automate a process that creates a good predictor by employing Fusions in a programmatic way. As we have commented in other posts related to WhizzML, WhizzML allows you to execute complex tasks that are computed completely on the server side with parallelization. This eliminates connection issues and takes care of your account limits regarding the maximum number of resources you can create at the same time. We will also describe the same operations with our Python bindings as another option for client-side control.

As we have mentioned, this release of Fusions puts the focus on the power of the many models working together. Starting from the beginning, suppose we have a group of trained models (trees, ensembles of trees, logistic regressions, deepnets, or another fusion), and we want to use all of them to create new predictions. Using multiple models will remove the weakness of a singular model. The first step is to create a Fusion resource passing the models as a parameter. Below is the code for creating a Fusion in the simplest way: passing a list of models in this format [“<resouce_type/resource_id>”, “<resouce_type/resource_id>”, …] and without any other parameter, that is, taking the default parameters.

;; WhizzML - create a fusion
(define my-fusion (create-fusion {"models" my-best-models}))

If you choose to use Python to code your workflows and run the process locally, instead of running completely in the server,  the equivalent code is below, where the models are also passed as the unique parameter in a list.

# Python - create a fusion
fusion = api.create_fusion(["model/5af06df94e17277501000010",
                            "logisticregression/5af06df84e17277502000019",
                            "deepnet/5af06df84e17277502000016",
                            "ensemble/5af06df74e1727750100000d"]})

Just like all BigML resources, Fusions have parameter options that the user can add in the creation request to improve the final result. For instance, suppose that we want to add different weights to each one of the models that compose the Fusion because we know that one of the models is more accurate than others. Another point to highlight is that the creation of the resource in BigML is done in an asynchronous way, which means that most of the time, the creation request doesn’t return the resource as it will be when it’s completed. In order to get it completed, you have two main options: to make an iterative retrievement until it’s finished, or use the functions created by that effect. In WhizzML, the function create-and-wait​ stops the workflow execution while the Fusion is not completed.

Let’s see how to do it specifying weights for the models and assign the variable when the resource is completed. Looking at the code below, you can see how the list of models persists, but now we are also passing a set of dictionaries:

;; WhizzML - create a fusion with weights and wait for the finish
(define my-fusion
  (create-and-wait-fusion {"models"
                           [{"id" "model/5af06df94e17277501000010"
                             "weight" 1}
                            {"id" "deepnet/5af06df84e17277502000016"
                             "weight" 4}
                            {"id" "ensemble/5af06df74e1727750100000d"
                             "weight" 3}]}))

In Python bindings, the asynchronism is managed by the ok function, and the weights are added to each model’s object in the Fusion. Here is the code for the Python binding that is equivalent to the WhizzML code above.

# Python - create a fusion with weights and wait for the finish
fusion = api.create_fusion([
    {"id": "model/5af06df94e17277501000010", "weight": 1},
    {"id": "deepnet/5af06df84e17277502000016", "weight": 4},
    {"id": "ensemble/5af06df74e1727750100000d", "weight": 3}])
api.ok(fusion)

To see the complete list of arguments for Fusion creation, visit the corresponding section in the API documentation.

Once the Fusion has been created, the best way to measure its performance, as with every type of supervised model, is to make an evaluation. To do so, you need to choose data that is different than the one used to create the Fusion, since you want to avoid overfitting problems. This data is often referred to as a”test dataset”. Let’s see first how to evaluate a Fusion by employing WhizzML:

;; WhizzML - Evaluate a fusion
(define my-evaluation
    (create-evaluation {"fusion" my-fusion "dataset" my-test-dataset}))

and now how it should be done with the Python bindings:

# Python - Evaluate a fusion
evaluation = api.create_evaluation(my_fusion, my_test_dataset)

In both cases, the code is extraordinary simple. With the evaluation, you can determine if the performance given is acceptable for your use case, or if you need to continue improving your Fusion by adding models or training new ones.

As with any supervised resource, once the model has a good level of performance, you can start using it to make predictions, which is the goal of the built Fusion model. Following the line of the post, let’s write the WhizzML code to make single predictions, that is predict just the result of a “row” of new data.

;; WhizzML - Predict using a fusion
(define my-prediction
    (create-prediction {"fusion" my-fusion
                        "input-data" {"state" "kansas" "bmi" 32.5}}))

To do exactly the same with Python bindings, your code should be like the following. The first parameter is the Fusion resource ID and the second one is the new data to predict with.

# Python - Predict using a fusion
prediction = api.create_prediction(my_fusion, {
    "state": "kansas",
    "bmi": 32.5
})

Here we are showing the most simple case to make a prediction, but prediction creation has a large list of parameters in order to bring a good fit to the result, according to your needs.

When your goal is not only to predict a single row but a group of data, represented as a new dataset (that you previously uploaded to the BigML platform), you should create a batchprediction resource, which only requires two parameters: the Fusion ID and the ID of this new dataset.

;; WhizzML - Make a batch of predictions using a fusion
(define my-batchprediction
    (create-batchprediction {"fusion" my-fusion "dataset" my-dataset))

It couldn’t get any easier. The equivalent code in Python is almost the same and very simple too. Here it is:

# Python - Make a batch of predictions using a fusion
batch_prediction = api.create_batch_prediction(my_fusion, my_dataset)

Want to know more about Fusions?

Stay tuned for the next blog post to learn how Fusions work behind the scenes. If you have any questions or you would like to learn more about how Fusions work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

Programming Fusions with the BigML API

As part of our Fusions release, we have already demonstrated a use case and walked through an example using the BigML Dashboard. Our fourth of six blog posts on Fusions will demonstrate how to utilize Fusions by directly calling the BigML REST API. As a reminder, Fusions can be used for both classification and regression supervised Machine Learning problems, and function by aggregating the results of multiple models (decision trees, ensembles, logistic regressions, and/or deepnets), often achieving better performance as result.

Authentication

Using  the BigML API, requires that you first set up the correct environment variables. In your .bash_profile, you must set BIGML_USERNAME, BIGML_API_KEY, and BIGML_AUTH to the correct value. BIGML_USERNAME is simply your BigML username. Additionally, the BIGML_API_KEY can be found on the Dashboard by clicking on your username to pull up the account page, and then clicking on ‘API Key’. BIGML_AUTH requires the combination of these elements.

Upload your Data and Create a Dataset

For this tutorial we are using the same dataset of home sales from the Redfin search engine used in our previous tutorial of the BigML Dashboard, and available in the BigML Gallery. Preparing our data for Machine Learning requires two major steps: first creating a source followed by creating a dataset. It is important to make sure that the objective field of the dataset is “LAST SALE PRICE” before creating any predictive models.

curl "https://bigml.io/source?$BIGML_AUTH" -F file=@Redfin.csv

curl "https://bigml.io/dataset?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"source": "source/5b3fa219983efc5ae5000055"}'

Create Two Simple Models

Because Fusions are aggregates of component models, we first need to create these models. In this case, we will create both an ensemble model and a deepnet model using the default parameters for each. Fusions typically work best when the component models are both high-performing and diverse.

curl "https://bigml.io/ensemble?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"dataset": "dataset/5b3fa2ec983efc5bde000037"}'

curl "https://bigml.io/deepnet?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"dataset": "dataset/5b3fa2ec983efc5bde000037"}'

Create your Fusion from Existing Models

In order to create the Fusion, it is as straightforward as providing the resource IDs for the models that you would like to include. Here we are selecting both the ensemble and deepnet created in the previous step, and weighing them equally. However, it is possible to include any number of models in a Fusions, including other Fusions, as well as adjusting the weight of each model on the final result.

curl "https://bigml.io/fusion?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"models": [
  "ensemble/5b4003a1983efc5a8e00002b",
  "deepnet/5b4003b3983efc5bde00005f"]}'

Evaluate your Fusion

Fusions are designed for both classification and regression problems, and can be evaluated to check for performance metrics, as well as to investigate aspects of the model such as field importance. Fusion evaluations require specifying both the trained Fusion, as well as the dataset to be evaluated.

curl "https://bigml.io/evaluation?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"model": "fusion/5b40d4a4983efc5c13000003",
  "dataset": "dataset/5b3fa2ec983efc5bde000037"}'

Create and Retrieve a Fusion Prediction

Once created, Fusions function like any other class of predictive model in BigML with regard to predictions. In the below example, we are providing values for SQFT, BEDS, and BATHS as input data, and then retrieving the result, which should yield $395,470 in this case. Of course, it is also possible to perform evaluations on Fusions or batch predictions, and further examples can be found in the BigML API Documentation.

curl "https://bigml.io/prediction?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"fusion": "fusion/5b40d4a4983efc5c13000003", 
  "input_data": {"SQFT": 3000, "BEDS": 3, "BATHS": 2}}'

curl "https://bigml.io/prediction/5b40f017983efc5ae50000c2?$BIGML_AUTH"

Want to know more about Fusions?

Stay tuned for the next blog post to learn how to automate Fusions with WhizzML and the BigML Python Bindings. If you have any questions or you would like to learn more about how Fusions work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

 

Fusions with the BigML Dashboard: Improving the Performance of your Models

The BigML Team is bringing Fusions to the BigML platform on July 12, 2018. As explained in our introductory post, Fusions are a supervised learning technique that can be used to solve classification and regression problems. Fusions combine multiple Machine Learning models (decision trees, ensembles, logistic regressions, and/or deepnets) and average their predictions. Fusions rely on the same principle of ensembles that the combination of several models often outperform the individual models it is composed of.

In this post, we will take you through the four necessary steps to train, analyze, evaluate, and predict with a Fusion using the BigML Dashboard. We will use the data of the for sale homes in Corvallis, Oregon from the Redfin search engine. You can find the full dataset in the BigML Gallery. We want to predict the home prices using as inputs the home characteristics in the dataset (size, number of bedrooms, baths, etc.). We filtered from the dataset the houses with a higher price than $600,000 because there were eight outlier houses with very high prices that could introduce noise to our models.

1. Create a Fusion

To create a Fusion you need at least one existing model (decision tree, ensemble, logistic regression or deepnet). In this case, we already trained an ensemble and a deepnet, each of them yielded an R squared of 0.77 and 0.78, respectively. The main goal is to improve these single model performances by creating a Fusion.

You can create a Fusion using multiple options, which are listed below so you can select the one that best fits your needs:

  • If you want to start by selecting one model, click the “Create Fusion” option from the 1-click menu.
  • If you prefer to select multiple models at the same time you can do it from the model list and then click the “Create Fusion” button.
  • If you want to select multiple models that belong to an OptiML, you can do it from the OptiML model list.

In this case, we are using the first option:

create-fusion-ensemble

Any of these options will redirect you to the New Fusion view (see image below). From this view you will be able to configure a set of options:

  • See the models selected, select more models or remove them. All the models must have compatible objective fields. That’s why BigML sets the first selected model objective field name as a filter so you can only select models with the same objective field name. However, you can remove this filter in case two models have the same objective field but different names.
  • Assign different weights to the selected models. At the prediction time, BigML will use these weights to make a weighted average of the model predictions.
  • Map your model fields with the Fusion fields in case two models have the same field but they have different names.

In this example, we are selecting our two models, the ensemble and deepnet, to create the Fusion. Since the deepnet seems to perform a bit better than the ensemble, we are giving it a weight of 2. This means that the deepnet will have two times more impact than the ensemble in the Fusion predictions.

fusion-configuration

To make sure your Fusion improves the single model performances, you need to select high-performing models that are as diverse as possible. If you use several identical models or models with sub-par performance, the Fusion will not be able to improve on the results of the component models.

2. Analyze your Results

When the Fusion is created, you will be able to inspect the results in a Partial Dependence Plot (PDP). The PDP allows you to view the impact of the input fields on the objective field predictions. You can select two different input fields for both axes and the predictions are represented in different colors in a heat map format.

You can see below how the square feet and the parking spots impact the predictions: higher values for both measures results in a much higher price.

pdp-fusions.png

BigML also calculates the field importances for your Fusion as we do for other supervised models (except for logistic regressions). The Fusion field importances are calculated by averaging the per-field importances of each single model (decision trees, ensembles, and deepnets) composing the Fusion. If your Fusion only contains logistic regressions, the field importances cannot be calculated. In our case, you can see below that, by and large, the most important field to predict the house prices is the square feet (66%).

fusion-importances

3. Evaluate your Fusion

After analyzing our Fusion, we need to evaluate it with a dataset that has not been used to train the models similar to other supervised learning models. The performance measures for Fusions are the same as any other classification or regression model.

In our case, we evaluate by using the same dataset that we used to evaluate our ensemble and deepnet so the results are comparable. This happens to be a holdout set containing the 20% of the instances of our original dataset. In the end, we were able to achieve an R squared of 0.81 which is higher than the one we obtained for our single models (the deepnet was the most performant with a maximum of 0.78). You may think that the difference in performance is quite small. Although this may not be true for other use cases, it is not unusual to find only small gains in performance using Fusions. In most problems, the biggest gains in performance usually come from feature engineering or adding more data rather than solely from model configuration. However, to the extent that every % point in model accuracy and performance measures count, Fusions can be a quick solution to squeeze the juice further because they are so easy to execute on BigML.

fusion-evaluation

4. Make Predictions

Predictions work for Fusions exactly the same as for any other supervised method in BigML. You can make predictions for a new single instance or multiple instances in batch fashion.

Single predictions

Click on the Predict option from your Fusion view. A form containing all your input fields will be displayed and you will be able to set the values for a new instance. Before clicking the Predict button, you can also enable the prediction explanation option. By using this option, you will obtain the importance of each input data in your prediction.

fusions-predict

At the top of the view, you will get the prediction along with the expected error (for numeric objective fields) or the class probabilities (for categorical objective fields). Below the prediction, you can see the histogram with the field importances for this prediction. In this case, the most important field is the square feet (94%) because it is less than 2,986.

prediction-explanation

Batch predictions

If you want to make predictions for multiple instances at the same time, click on the Batch Prediction option and select the dataset containing the instances for which you want to know the objective field value.

fusions-batch-prediction.png

You can configure several parameters of your batch prediction like the possibility to include all class probabilities in the output dataset and file. When your batch prediction finishes, you will be able to download the CSV file and see the output dataset.

Want to know more about Fusions?

Stay tuned for the next blog post to learn how to use Fusions programmatically with the BigML API. If you have any questions or you would like to learn more about how Fusions work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

Small Differences Matter: Maximizing Model Performance with Fusions of Models

The results of many Machine Learning competitions, perhaps most famously the Netflix prize, have demonstrated that a combination of models often yields better performance than any single component model. The techniques used to aggregate multiple models go by several terms depending on the precise methodology, and include boosting, bagging, and stacking. At their core, the efficacy of such methods can be attributed to being able to somewhat mitigate the inherent performance costs due to bias-variance tradeoff. More colloquially, these methods are based on the same principle that guides the “wisdom of the crowd,” in which the collective opinion of a group tends to provide superior estimations than that of a single expert. In order to easily take advantage of this phenomenon, we have introduced Fusions to the supervised Machine Learning options available at BigML.

This blog post, the second in a series of 6 posts exploring Fusions, illustrates how Fusions can be applied to improve Machine Learning performance. For that, we will use a dataset of wine quality.

The Dataset

The dataset is available for download at both Kaggle and the UCI machine learning repository. This dataset consists of 1599 examples of wine samples, each characterized by 12 different variables of chemical properties of wine, including measurements such as pH, citric acid, and residual sugar. The objective field in this data set is wine quality, which ranges in value from 0 (low quality) to 10 (top quality), with most values falling at an intermediate level. Our task will be to treat this data as a regression problem in which we predict the quality score as accurately as possible using only these chemical properties. Unfortunately, the true identity or price point of the wines in this dataset are not available.

Wine Data

Overview of the fields included in the wine quality dataset

Use Case #1 – Fusion of Model Types

One way that Fusions can be used to improve the overall predictive performance is by combining different types of models. To demonstrate the concept, we split our dataset into training (80%) and test (20%) sets and trained both a deepnet and an ensemble decision tree, before combining the two into a Fusion and evaluating on the test set.

Create Fusion

BigML Dashboard view for creating a new Fusion from existing models

Regression models in BigML can be evaluated using Mean Absolute Error (MAE), Mean Square Error (MSE) and R-Squared (R2), and all 3 are included in the table comparing evaluation metrics below. While the improvement is not dramatic, the simple process of aggregating the two weaker models into a Fusion yields a predictive model with superior performance with regards to both MSE and R2, and essentially equivalent performance in terms of MAE.

MAE

MSE

R2

Ensemble

0.46

0.44

0.35

Deepnet

0.50

0.41

0.39

Fusion

0.47

0.39

0.42

Use Case #2 – Fusion with Different Training Data

Fusions can be created from models as long as the objective fields are identical. There is no requirement for the same training data to be used for the component models, as illustrated in the previous examples. In this scenario, let’s consider two models created from different sets of features, one being a simple univariate regression that consists of solely the most important field, “alcohol”, according to our previous Fusion model, and another trained using the remaining 11 variables.

Fusion Feature Importance

Field importance for the Fusion model previously created

In this case, we found that both the “Alcohol Only” and the “No Alcohol” models perform modestly, but as components of a Fusion, these relatively poor models have a performance on par with an optimized deepnet. While the result of this example may seem trivial given what we have already learned about this fairly straightforward dataset, it illustrates the flexibility in deploying Fusions in scenarios where existing models ingest different forms of training data for a common objective.

MAE

MSE

R2

Alcohol Only

0.58

0.56

0.17

No Alcohol

0.54

0.66

0.02

Fusion

0.50

0.43

0.36

Use Case #3 – Fusions and OptiML

If Fusions can improve the performance of two simple models, the next logical step is to evaluate how much optimized models can be improved. Using OptiML, we generate a variety of top performing models in a single click, and select any of the results to combine into a Fusion model. In this case, we will use all 9 of the OptiML-generated models, which include 6 different ensemble tree models and 2 different deepnets.

Create Fusion - OptiML

Creation of a Fusion using all models generated from OptiML

The Fusion created from the OptiML models yielded a MSE of 0.36 and an R squared value 0.47 when evaluated against our held out test set. This performance is not only greater than the other Fusions we created, but also is better than any of the 9 component models generated by OptiML. It quickly becomes apparent that chaining OptiML and Fusions together in a Machine Learning workflow simplifies the complicated iterative process of model tuning and aggregation into a concise and straightforward pipeline.

Final Thoughts

Machine Learning performance will almost largely depend on the quality of training data and the ingenuity involved in feature engineering, rather than the sophistication of the models themselves. Regardless, in applications where narrow margins in predictive performance matter greatly, aggregating the results of multiple models can often provide a much desired improvement in performance metrics. In your next Machine Learning project, we encourage you to use the power of BigML Fusions to combine the various models you may have created along the way and see if it yields an overall more reliable and accurate result.

Want to know more about BigML Fusions?

Stay tuned for the next blog post to learn how to use Fusions with the BigML Dashboard.  If you have any questions or you would like to learn more about how Fusions work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

Better Model Performance Through Algorithmic Diversity

With the upcoming release on Thursday, July 12, BigML keeps pushing the envelope for more people to gain access to and make a bigger impact with Machine Learning. This release features a brand new resource providing a novel way to combine models: BigML Fusions.

In this post, we’ll do a quick introduction to Fusions before we move on to the remainder of our series of 6 blog posts (including this one) to give you a detailed perspective of what’s behind this new capability. Today’s post explains the basic concepts that will be followed by an example use case. This will be followed by three more blog posts focused on how to use Fusions through the BigML Dashboard, API, and WhizzML and Python Bindings in an automated fashion. Finally, we will complete this series of posts with a technical view of how Fusions work behind the scenes.

Understanding BigML Fusions

Classification and regression problems can be solved using multiple Machine Learning methods on BigML, such as models, ensembles, logistic regressions, and deepnets. In previous blog posts, we’ve covered strengths and weaknesses of these resources, e.g, logistic regression tends to perform best when the relationship between input fields and the target variable is linear/smooth in nature, whereas deepnets are capable of handling curvilinear decision boundaries.

BigML Fusions Board

A typical Machine Learning project involves multiple iterations. All else being equal, we advocate starting your project with the application of models (aka decision trees) and logistic regressions as they are easy to interpret and tend to train very fast even for large datasets. You may even want to rely on the 1-click modeling options for those to get to acceptable baseline models faster with intelligent default parameters.

Next up, you can try more complex algorithms such as ensembles or deepnets to improve model performance without modifying your dataset or adding new features that may be costly.  Each of these algorithms can separately be applied in 1-click, manually configured and automatic optimization (i.e., hyperparameter tuning) mode.

With the most recent release, BigML has gone on to add the OptiML resource, which in a way combines the automatic optimizations for each classification and/or regression algorithm supported by the BigML platform into a single task by using Bayesian Optimization. Combined, these alternatives serve the needs of users from all levels of sophistication with differing model performance and time constraints.

However, in situations requiring you to squeeze out every last ounce of model accuracy from the available data, you may find useful the simple yet effective concept of ensembling different types of Machine Learning models by averaging their predictions to balance out individual weaknesses of any single underlying model. In this regard, Fusions are based on the same “wisdom of the crowd” principle as ensembles under which the combination of multiple models often leads to a stronger performance than any of the individual members.

Of course, there is no guarantee that every time you try a Fusion, you will end up with performance improvements. For better results, the base models each have to be as accurate as possible but ultimately the incremental performance gain results from the diversity of mathematical representations across heterogeneous algorithms. The extent to which the Fusion will enhance performance is use case specific, but given the ease-of-use BigML offers in creating Fusions (either via point-and-click on the Dashboard or the API), one can find some easy pickings fairly quickly. Finally, even in those cases where model accuracy measures show little improvement, you can still benefit from Fusions because they tend to be more stable than single models.

Creating a Fusion by selecting models with the same objective field.

Creating a Fusion by selecting models with the same objective field.

Model Weights

Without going into great detail in this post, when working with Fusions, you can assign different weights to the selected models before creating them. In this case, at the prediction time, BigML will compute a weighted average of all model predictions as per the model weights specified. Therefore, a model with a higher weight will have more influence on the final prediction. That also means if you assign a weight of 0 to a particular model in your Fusion, the results from that model will not be taken into account.

Want to know more about BigML Fusions?

If you have any questions or you would like to learn more about how Fusions work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

%d bloggers like this: