Skip to content

Data Transformations with the BigML Dashboard: Get your Machine Learning-Ready Data in a Few Clicks

Data preparation is a key task in any Machine Learning workflow, but it’s often one of the most challenging and time-consuming parts. BigML’s upcoming release brings new data transformation features that make it faster and easier than ever before to get your data ready for Machine Learning.

These features significantly expand the data preparation options that BigML already provides, such as missing values treatment, categorical values encoding, date-time fields expansion or NLP techniques for your text fields.

All the new data transformation features can be classified into two groups:

  • SQL queries: The capability of writing SQL queries to create new datasets opens up an infinite number of transformations to prepare your data for Machine Learning. Although the ability to freely write SQL statements will be an API-only feature for now, we are bringing some common transformations to the Dashboard for users that prefer to transform their data in a few clicks: aggregate instances, join and merge datasets. The idea is to add more options in the Dashboard on an ongoing basis; for example, the ability to order instances and remove duplicates. Please send an e-mail to if you have any particular request.
  • Feature engineering: new sliding windows feature, and significant improvements to the Flatline Editor, enabling more ways to easily create fields for your datasets.

Aggregating Instances

The aggregating instances option in BigML allows you to group the rows of a dataset by a given field.

For example, imagine you have customer data stored in a dataset where each purchase is a different row. If you want to use this dataset to train models to analyze customers purchase behaviors, you need a dataset where each row is a customer instead of a purchase. This is the case of the dataset in the image below where we can aggregate the instances by the field “customerID” to get a row per unique customer. You can also see that we needed to use some aggregation functions for the rest of the fields in order to add them to the new dataset such as the total purchases per customer (“Count_customerID”), the total units purchased (“Sum_Quantity”), the first purchase date (“Min_Date”) or the average price per unit spent per customer (“Avg_UnitPrice”).


You can easily do this on the BigML Dashboard by following these steps:

  • Click the “Aggregate instances” option from the dataset configuration menu:


  • Select the “CustomerID” as the grouping field:


  • Configure the aggregation operations for the fields you want to include in the final dataset. For example, in the image below we are including the count of rows per customer and the total amount of units purchased:


  • When you have all the operations defined, click on the “Aggregate instances” button:


This will create a new dataset containing a customer per row and the columns that you defined using the aggregation functions described above. From this dataset, you can also see the SQL query under the hood by clicking the option highlighted in the image below.


Joining Datasets

BigML allows you to join several datasets to combine their fields and instances based on one or more related fields between them. This is very useful when your data is scattered in two or more datasets.

For example, imagine we want to predict employee performance and we have two different sources of data: a dataset containing employees’ data (employee name, salary, age, etc.) and another dataset containing departments data (department name, budget, etc.). If we want to include the department data as an additional predictor for our employees’ analysis, we can use a common field in both datasets (department_id) to add the department characteristics to the employee dataset (see image below).



You can easily do this on the BigML Dashboard by following these steps:

  • Click the “Join datasets” option from the dataset configuration menu:


  • Then select the type of join: left join if you want to get all the instances from the current (left) dataset and the matched instances from the selected (right) dataset; or inner join if you want to get the instances that have matching values in both datasets. In this case, we are selecting the left join because we want all the employees regardless if they have a matching department or not.  Next, select the dataset you want to make the join with:


  • Select one or more pairs of joining fields to match the instances of both datasets. In this example, we select the department_id to make the match:


  • Decide which fields of the selected dataset (the departments dataset in our case) you want to include in the final joined dataset:


  • Optionally, you can filter the joined dataset by selecting fields from the current or the selected dataset and setting up different filtering conditions. Then go ahead and click the “Join datasets” button.


This will create a dataset that will contain the matched instances and fields from both datasets. From this dataset, you can also see the SQL query under the hood by clicking the option highlighted in the image below.


Merging Datasets

The merging datasets option in BigML allows you to include the instances of several datasets in one dataset.

This functionality can be very useful when you use multiple sources of data. For example, imagine we have employees data in two different datasets and we want to merge them into one dataset.


You can easily do this on the BigML Dashboard by following these steps:

  • Click the “Merge datasets” option from the dataset configuration menu:


  • Select the datasets you want to merge. The datasets should have the same fields so the instances of one dataset can be added after the instances of the other dataset. You can select up to 32 datasets to merge. You can sample each of the datasets selected for the merge by configuring the typical BigML sampling parameters like the percentage rate, replacement, out-of-bag, and seed parameters.


  • Click on the “Merge datasets” option:


This will create a dataset that will contain the instances from the merged datasets. From this dataset, you can also see the merging information by clicking the option highlighted in the image below.


Feature Engineering

Feature engineering, i.e., the creation of new features that can be better predictors for your models, is one of the most important tasks in Machine Learning because it is usually the biggest source of model improvement. That’s why we also focused our efforts on bringing sliding windows to the BigML Dashboard and improving the Flatline Editor.

Sliding windows

Creating new features using sliding windows is one of the most common feature engineering techniques in Machine Learning. It is usually applied to frame time series data using previous data points as new input fields to predict the next time data points.

For example, imagine we have one year of sales data to predict sales. As domain experts, we know that past sales can be key predictors to predict today’s sales. Therefore, we can use our objective field “sales” to create additional input fields that contain past data. We can create an infinite number of fields: last day sales, the average of last week sales, the difference between last month and this month sales, etc. In the image below, we are creating a new predictor that calculates the average sales of the last two days (see the field in green “avgSales_L2D”).


This can easily be done on the BigML Dashboard by following these steps:

  • Click the “Add fields” option from the dataset configuration menu:


  • Select the mean out of the Sliding windows operations in the selector:


  • Select the field you want to apply the operation to, a window start -2 and a window end -1 (the window start and end define the first and last instances to be considered for the defined calculation; negative values are previous instances, positive values are next instances, with zero being the current instance). Then click on “Create dataset” button.


This will create a dataset with a new field that contains the average sales of the last two days and can be used as a new predictor.

Flatline Editor Improvements

The Flatline editor allows you to easily create new fields for your dataset by using BigML’s domain-specific language Flatline. You can access the editor by selecting the option “Add fields” from the dataset configuration menu, then select the Flatline formula operation and click on the editor icon (see image below).


You can see that the dataset preview now includes a table view where you can easily see a sample of your instances.


When you write a formula and you want to view its result, the preview only shows the fields involved in the formula. That way you can easily check if your formula is being calculated correctly. For example in the image below, you can see only two fields in the preview, the one used in the formula as input (the field “duration”) and the new field result of the formula (if the duration of the movie is higher than 100 minutes it is classified as “long”, otherwise it is “short”). You can also change this view to show all the dataset fields again using the green switcher on top of the table preview.


Want to know more about Data Transformations?

Stay tuned for the upcoming blog post to learn how to perform data transformations with SQL via the BigML API. If you have any questions or you would like to learn more about these new features, please join our free, live webinar on Thursday, October 25, 2018, at 10:00 AM PDT/ 7:00 PM CEST. Register today as space is limited! Also, check out the release page for the series of blog posts, the BigML Dashboard and API documentation, and the webinar slideshow and full recording which will be published following the webinar

Introduction to Data Transformations

BigML’s upcoming release on Thursday, October 25, 2018, will be presenting our latest resource to the platform: Data Transformations. In this post, we’ll do a quick introduction to Data Transformations before we move on to the remainder of our series of 6 blog posts (including this one) to give you a detailed perspective of what’s behind the new capabilities. Today’s post explains the basic concepts that will be followed by an example use case. Then, there will be three more blog posts focused on how to use Data Transformations through the BigML DashboardAPI, and WhizzML for automation. Finally, we will complete this series of posts with a technical view of how Data Transformations work behind the scenes.

Understanding Data Transformations

Transforming your data is one of the most important, yet time-consuming and difficult tasks in any Machine Learning workflow. Of course, “data transformations” is a loaded phrase and entire books are authored on the topic. When mentioned in the Machine Learning context, what we mean is a collection of actions that can be performed compositionally on your input data to make it more responsive to various modeling tasks — if you will, these are the methods to optimally prepare or pre-process your data.

data transformations

To remind, BigML already offers several automatic data preparation options (missing values treatment, categorical fields encoding, date-time fields expansion, NLP capabilities, and even a full domain-specific language for feature generation in Flatline) as well as useful dataset operations such as sampling, filtering, and the addition of new fields. Despite those, we’ve been looking to add more capabilities for full-fledged feature engineering within the platform.

Well, the time has come! This means the powerful set of supervised and unsupervised learning techniques we’ve built from scratch over the last 7 years all stand to benefit from data better prepared to make the most of them. Without further adieu, let’s see what goodies made it to this release:

Aggregating Datasets

  • Aggregating instances: at times aggregating highly granular data at higher levels can be necessary. When that happens, you can group your instances by a given field and perform various operations on the other fields. For example, you may want to aggregate sales figures by product and perform further operations on the resulting dataset before applying Machine Learning techniques such as Time Series.
  • Joining datasets: if your data comes from different sources, in multiple datasets you need to join said datasets by defining a join field. For instance, imagine you have a dataset containing user profile information such as account creation date, age, sex, country, and another dataset that contains transactions that belong to those users with critical fields like transaction date, payment type, amount and more. If you’d rather have all those fields in a single dataset, you can join those datasets based on a common field such as customer_id.
  • Merging datasets: if you have multiple datasets to process to create with the same fields, then you may want to concatenate those before you continue your workflow. Take for example a situation where daily files of sensor data need to be collated into a single monthly file before you can proceed. This would be a breeze with the new merge capability built into the BigML Dashboard.
  • Sliding windows: Creating new features using sliding windows is one of the most common feature engineering techniques in Machine Learning. It is usually applied to frame time series data using previous data points as new input fields to predict the next time window data points. For instance, in predicting hospital re-admissions, we may want to break the healthcare data into weekly windows and see how those weekly signals are correlated with the likelihood of re-admission in the following weeks after the patient is released from the hospital.
  • SQL support: This is big! BigML now supports all the operations from PostgreSQL, which means you have the full power of SQL at your disposal through the BigML REST API. You can choose between writing a free-form SQL query or use the JSON-like formulas that the BigML API supports. You can also easily see the associated SQL queries that created a given dataset and even apply them to other datasets — more on those in the subsequent blog posts.

NOTE: Keep this under the wraps for now, but before you know it, the Dashboard will be supporting other capabilities such as ordering instances, removing duplicate instances, and more!

Want to know more about Data Transformations?

To learn more about BigML’s upcoming release, please join our free, live webinar on Thursday, October 25, 2018, at 10:00 AM PDT. Register today as space is limited! Stay tuned for our next 6 blog posts that will present step by step how to transform your data with the BigML platform.

Data Transformations: Machine Learning-Ready Data!

BigML’s new release is here! Join us on Thursday, October 25, 2018, at 10:00 AM PDT (Portland, Oregon. GMT -07:00) / 07:00 PM CEST (Valencia, Spain. GMT +02:00) for a FREE live webinar to discover the new Data Transformation options added to the BigML platform, which will yield better results in your projects by simplifying a key part of any Machine Learning workflow.

In previous releases, BigML has focused on presenting a wide range of choices on new algorithm implementations and automation to help you solve a wide array of Machine Learning problems. Now, with the Data Transformations release, we reach an important milestone in our roadmap by enhancing our offering in the area of data preparation as well. Typically, data do not come in a format ready to start working on a Machine Learning project right away. Data from many different sources come in different formats, and with plenty of information that does not add any value for the algorithm that will learn from it. Therefore, preparing your data for your Machine Learning project is a key part of the process to obtain the best predictive model.

Although BigML already offers several automatic data preparation options (missing values treatment, categorical fields encoding, date-time fields expansion, NLP capabilities, and even a full domain-specific language for feature generation), we knew we still had more tools to add for full-fledged feature engineering within the platform. That is why BigML is adding new capabilities that greatly expand the functionality related to prepare your Machine Learning-ready dataset. The latest version of BigML lets you perform SQL-style queries over your datasets, significant improvements of the editor used to write feature generators in Flatline (our feature engineering DSL), and new ways of further improving feature engineering.

Up to now, the main ways of transforming datasets were sampling, filtering, and the addition of new fields. All of them work by scanning the input dataset and performing actions based on a finite number of rows at once. However, you cannot perform global operations like ordering, joins, or aggregations in this fashion. In this release, we introduce SQL-like queries that are able to perform such global transformations, among others. This set of operations is crucial for transforming the data you have into the data you actually need. With queries, you will be able to aggregate instances of your dataset, join datasets, as well as merge them. You can also easily execute your queries in a few clicks as you have the full power of SQL at your disposal through the BigML REST API. There’s more: the Dashboard will shortly support other capabilities like ordering instances, removing duplicates, and more!

The BigML Flatline Editor has been upgraded to easily help you create new fields and validate existing Flatline expressions in your Dashboard in an even more friendly editor. For new BigML users who are not familiar with the term, Flatline is BigML’s domain-specific language for data generation and filtering, which helps you transform your datasets and engineer new features in a wide variety of ways. Apart from the Flatline editor, we also offer some common predefined operations from the Dashboard that allow you to create new features with a few clicks instead of writing formulas. Finally, we are adding sliding windows, one of the most common feature engineering techniques used in Machine Learning. Sliding windows are frequently applied to frame time series data by using previous data points to predict the next data points, e.g., sales for product X in the last rolling 14 days.

Do you want to know more about Data Transformations?

To learn more about BigML’s upcoming release, please join our free, live webinar on Thursday, October 25, 2018, at 10:00 AM PDT / 07:00 PM CEST. Register today as space is limited! Stay tuned for our next 6 blog posts that will present step by step how to transform your data with the BigML platform.

Machine Learning School in Seville, Spain: First Edition!

Seville, the capital of Andalusia (the southern region of Spain), is a place known for its beauty, charming people, and immense cultural heritage. Now, the BigML Team intends to spread the word and promote the adoption of Machine Learning among their citizens, organizations, companies, and academic institutions so the region can also become a more attractive technology hub.

BigML, in collaboration with the EOI Business School, is launching the First Edition of our Machine Learning School in Seville, which will take place on March 7 and 8, 2019. The #MLSEV will be an introductory two-day course optimized to learn the basic Machine Learning concepts and techniques that are impacting all economic sectors. This training event is ideal for many professionals that wish to solve real-world problems by applying Machine Learning in a hands-on manner, e.g., analysts, business leaders, industry practitioners, and anyone looking to do more with fewer resources by leveraging the power of automated data-driven decision making.

Besides the basic concepts, the course will cover a selection of state of the art techniques with relevant business-oriented examples such as smart applications, real-world use cases in multiple industries, practical workshops, and much more.


EOI Andalucía, Leonardo da Vinci Street, 12. 41092. Cartuja Island, Seville, Spain. See map here.


2-day event: on March 7-8, 2019 from 8:30 AM to 6:30 PM CET.


Please complete this form to apply. After your application is processed, you will receive an invitation to purchase your ticket. We recommend that you register soon, space is limited and as per our previous editions in other locations, the event may sell out quickly.


Check out the full agenda and other details of the event here

Beyond Machine Learning

In addition to the core sessions of the course, we wish to get to know all attendees better since they will make up tomorrow’s creative forces. As such, we are organizing the following activities for you and will be sharing more details shortly:

  • Genius Bar. A useful appointment to help you with your questions regarding your business, projects, or any ideas related to Machine Learning. If you are coming to the Machine Learning School in Seville and would like to share your thoughts with the BigML Team, please book your 30-minute slot by contacting us at  
  • Fun runs. We will go for a healthy and fun 30-minute run after the sessions. We will soon share the details on the meeting point and time. Stay tuned!
  • International networking. Meet the lecturers and attendees during the breaks. We expect hundreds of local business leaders and other experts coming from several regions of Spain as well as from different countries.

The BigML Team is excited to launch more series of our Machine Learning Schools across the globe! The next one will take place in one month from now, on November 4 and 5 in Doha, Qatar. To know more about BigML’s previous Machine Learning schools please read this blog post. Do not hesitate to contact us at if you would like to co-organize with us a Machine Learning school in your city, we look forward to growing the Machine Learning Schools series!

Computer Vision Internship in Corvallis with BigML

We are pleased to introduce Dimitrios Trigkakis, a Computer Vision Intern who worked with the BigML Team for the summer of 2018. On the other side the world from Efe Toros, BigML’s other summer intern, Dimitrios gained industry experience while applying his years of research knowledge, which he shares first-hand here.

BigML Computer Vision Internship Corvallis

Image sourced from Visit Corvallis, Oregon:

I had the opportunity of being introduced to BigML as I searched for summer internships, and I quickly realized this company would be a great fit. I joined the team as a Computer Vision Intern in Corvallis, Oregon, and this role proved to be a nice change of pace between my Master’s degree work last year, and my ongoing research for my Ph.D. at Oregon State University.

BigML Intern Dimitrios Trigkakis

During the interview process, it became apparent that BigML has a team of intelligent and ambitious people, who share a lot of the interest and motivation that originally lead me to study data science and Machine Learning at the beginning of my academic career.

For the initial phases of my training, I discussed potential projects with my mentor, Dr. Charles Parker, and we developed a plan for practical and important contributions focused on Computer Vision problems, as well as strengthening the pre-existing foundations for image-based Machine Learning.

All of my projects were aimed towards developing methods for identification of the content of images. Typical Machine Learning algorithms can find patterns given the features of a dataset, but in computer vision, such features are not present and have to be constructed from lower level information (the image’s pixels). BigML can provide a platform for re-training of image-based models, with great benefits in all areas where images are involved. Some examples of tasks involving computer vision include:

  • Recognizing car plates
  • Reading subtitles in other languages
  • Identifying people in videos
  • Recognizing objects in scenes
  • Identifying medically relevant visual features in x-rays or other scans

There are many tasks where computer vision can provide excellent assistance in automating labor-intensive image identification tasks that are currently being assigned to large groups of people, costing a lot of time and money. My internship projects involved implementing or expanding the existing ingredients enabling large-scale image recognition for BigML. All of these projects are aimed towards image classification, with future potential for object detection, image segmentation, and other computer vision tasks. More specifically, my projects revolved around the following:

  • Expanding BigML’s infrastructure for employing several pre-trained convolutional neural network models, which form the basis for later fine-tuning on many computer-vision related datasets.
  • Developing a similar infrastructure for support of our models on web-pages, for an accurate and fast web inference that is user-friendly.
  • Training models on image datasets that do not revolve around fine-grained classification of object categories in photographic images (e.g. Imagenet dataset). Moving in a different direction, we trained a model for classification of images that occur in an artistic setting, where object classification is challenging.
  • We developed a deep learning model that is capable of identifying structure, clusters and visualizations for datasets without labels (unsupervised learning).

Overview of Projects

  • On the software engineering side of things, we wanted to implement pre-trained models for various architectures, each with different strengths and weaknesses. We now support five different neural network architectures, one of which (namely Mobilenet) is very lightweight and was designed in order to run on mobile phone hardware. All architectures achieve competitive performance on the Imagenet dataset, while their differences focus on the trade-off between accuracy and size/inference time.
  • As an extension of the work on the server side, I developed a similar infrastructure for classifying images online. The javascript re-implementation of the above work allows a user to submit their own network definition files, and proceed to classify images that they upload to the webpage.
  • For expanding our repertoire of provided models, I trained a neural network for classification on the BAM (Behance Artistic Media dataset), an artistic dataset with labels for three categories: content, artistic medium and emotion. The network learns image features that are relevant for correctly predicting not only the content of an artistic image but also the emotion and style that the artwork represents. The learned network features can be reused by only training the predictive part of the network, enabling re-training for a new dataset that may not contain real-life photographic material. We did notice a respectable performance gap between training the entire network, or re-training only the last layers of the network (from image features to prediction). Both networks were pre-trained on the Imagenet.

    BAM dataset: examples for content category ‘dog’


    The prediction task includes three categories (content, medium and emotion)

  • I developed a variational autoencoder architecture for unsupervised learning on simple feature datasets. Datasets like the Iris or text-based datasets contain patterns that a neural network is able to extract, given the task of reproducing its input to its output. Using the feature vector that unlabelled images are assigned to when the network attempts to reconstruct them is a way to reduce the dimensionality of the input without losing a lot of information about the content of the input data. By using the T-SNE algorithm, we can further reduce the dimensionality of the input data into two dimensions for easy visualization. Finally, k-means clustering can identify membership into classes, grouping the input data together and giving hints about the regularities in the dataset, without requiring any labelled examples.

Inspection of unsupervised categories gives insight into the regularities in the input data

I have been very grateful for this opportunity to work with BigML, as it was a great culture fit, full of vibrant people who guided me through my first contact with the industry. Applying my knowledge to real-world problems was very satisfying, and I learned a lot about communication skills, software development, collaboration, and I have gained confidence in myself and in my future. All in all, BigML has provided a great experience, by people who work very hard to make approachable and intuitive Machine Learning a reality.

Interested in a BigML Internship?

More internship positions will be available at BigML in 2019. Keep an eye on the BigML Internship page and feel free to contact us at with any questions or project ideas. We look forward to hearing from you!

Breaking Records at the 4th Valencian Summer School in Machine Learning

Starting a brand new week and still feeling the excitement of VSSML18’s success! 220 attendees arriving from 18 countries did not want to miss BigML’s annual event to learn the latest about Machine Learning. In this edition, we have hosted 178 attendees from 101 companies and 42 attendees representing 26 academic institutions.

The 4th edition of our Machine Learning school took place in one of Valencia’s most prestigious buildings: La NAU Cultural Center. Hundreds of Machine Learning practitioners, decision makers, and developers from Andorra, Austria, Belgium, Canada, Denmark, France, Germany, Iceland, India, Italy, Netherlands, Portugal, the Russian Federation, Spain, Switzerland, Turkey, United Kingdom, and the United States were in attendance. All of them had something in common: they want to adopt Machine Learning in their organizations to remain competitive in the market.

The VSSML18 offered 13 master classes to introduce the basic Machine Learning concepts and techniques, 5 practical workshops for hands-on practice of the lessons learned, 5 case studies to understand how several companies are currently applying Machine Learning, and all this content divided into two parallel sessions for both attendee profiles business and developers.

In addition to the scheduled lectures, the attendees of this two-day event had the chance to discuss their Machine Learning projects and ideas with the BigML Team members at the Genius Bar, and increased their network by meeting international professionals. They also met some of BigML’s partners who presented use cases and shared some of their projects at booths, such as Barrabés.Biz evaluating ICOs, working on a predictive maintenance problem, Bankable Frontier Associates using Machine Learning for social good, and Talento Corporativo spreading the Machine Learning word in the north of Spain. But not everything was hard work, the VSSML18 attendees could also enjoy two energetic morning runs to start the days and enjoy some drinks and beverages at the end of both days.

To conclude the fourth edition of the Machine Learning school held in Valencia, the BigML Team can only thank all attendees who participated in this unforgettable event, as well as the co-organizers VIT Emprende, València Activa, and Ajuntament de València for helping us make it happen. Thank you all for the great feedback! And please check out the #VSSML18 photo albums on Facebook and Google+  to see the event’s highlights.

For those who could not make it to the VSSML18, we hope to see you at the MLSD18 in Doha, Qatar, on November 4 and 5, where we will be holding the inaugural edition of our Machine Learning School series in the Middle East region!

Machine Learning School in Doha, Qatar: Launching the First Edition!

BigML and the Qatar Computing Research Institute (QCRI), part of Hamad Bin Khalifa University, are excited to announce the first edition of our Machine Learning School in Doha, Qatar! The MLSD18 will take place on November 4-5 at the Qatar National Convention Centre, and it will be the first Machine Learning course that BigML and QCRI are organizing together in the Middle East. This event will be one of the first activities of the new center established by QCRI, the Qatar Center for Artificial Intelligence (QCAI).

No prior Machine Learning knowledge is required to attend the MLSD18 sessions. Attendees from all backgrounds will be welcome at this two-day event, from business leaders, industry practitioners, and developers, to graduate students, as well as advanced undergraduates, seeking a quick, practical, and hands-on introduction to Machine Learning to solve real-world problems. The MLSD18 is an ideal event to learn the basic as well as more advanced Machine Learning concepts and techniques that you will need to master to take your business or project to the next level.

The course will present an introduction to what Machine Learning is, where we are and where we are going. It will also cover the main techniques of classification, regression, time series, clusters, anomaly detection, association discovery, and topic models. You will also be able to put in practice all the concepts learned with interactive exercises and use cases. Additionally, more technical topics will be addressed in detail, such as data transformations, feature engineering, how to work programmatically using the API, bindings, how to automate Machine Learning workflows via WhizzML, and more!


Qatar National Convention Centre, Room 104: Al Luqta St, Ar Rayyan, Education City. Doha, Qatar. See map here.


2-day event: on November 4-5, 2018 from 08:00 AM to 07:30 PM AST.


Please complete this form to apply. After your application is processed, you will receive an invitation to purchase your ticket. We recommend that you register soon; space is limited and as per our previous editions in other locations, the event may sell out quickly.


You can check out the full agenda and other details of the event here.

Additional Activities

In order to feel the full Machine Learning experience, the BigML Team and QCRI have additional activities set for you:

  • The Genius Bar. We are looking forward to helping you solve your questions regarding your business, projects, or any ideas related to Machine Learning. Feel free to book your 30-minute private session by contacting us at
  • The morning runs. We will go for a healthy and fun 30-minute run before the event starts. The meeting point on Sunday 4 and Monday 5 will be announced shortly.
  • Get to know the lecturers and other attendees during the networking breaks. We expect hundreds of locals as well as Machine Learning practitioners and experts coming from other Middle East countries, Asia, and more!

The BigML Team is very happy to continue growing our Machine Learning Schools, and we are looking forward to celebrating many editions in Doha together with QCRI. Please read this blog post to know more about BigML’s previous Machine Learning schools.

Machine Learning Internship Abroad with BigML

We are pleased to introduce Efe Toros, who joined the BigML Team this summer as a Data Science Intern. This post shares his experience as a BigML Intern and how he contributed to help us make Machine Learning beautifully simple for everyone. We’ll let Efe take it from here…

Machine Learning Internship Abroad in Valencia

I had the amazing opportunity to work in Valencia this Summer as a Data Science Intern for BigML. I am going into my senior year at the University of California Berkeley studying Data Science, and I can say that my experience at BigML was exactly what I needed to motivate me to finish my last stretch of school. It was a great environment where I got to apply the skills I learned in the classroom into the real world.

Efe Toros BigML Intern

During my time at BigML, I got to experience both a level of freedom and guidance while conducting my work. When I first I arrived, my mentor and I laid out an informal roadmap of the tasks and goals for my internship. My main job was to create technical use cases for various industries that would show the benefits of using Machine Learning, specifically what BigML has to offer, in helping businesses solve problems, increase efficiency, or improve processes.

All my projects took the format of descriptive Jupyter notebooks that explained the workflow of processing data, constructing additional preparatory code, and most importantly using BigML’s API and Python bindings to create the core of the predictions. You can find all of the notebooks in this Github repository of BigML use cases and a summary of each use case below. My projects involved five different use cases (including demos on the BigML Dashboard and Jupyter notebooks): 

Overview of Projects

Predicting home loan defaults and customer transaction value:

The first two projects were sourced from Kaggle competitions using datasets provided by Santander Bank.

  • For the first project, I organized and cleaned multiple datasets of bank data in order to create a predictive model that would identify if an individual would default on their home loan, as shown in the BigML Dashboard tutorial video below. 
  • The second project involved analyzing high dimensional customer data to create a model that would identify a customer’s transaction value. Most of the work was done while organizing and summarizing the high dimensionality for better performance of the trained model. 

Building a movie recommender:

  • This project involved using two of BigML’s unsupervised learning algorithms, clustering and topic modeling, to create recommendation systems. The projects are separated into two notebooks where each algorithm derives information from a dataset, organizing each instance in a higher dimension. This enables a better search for similarities. For instance, one of my recommender systems used BigML’s topic modeling algorithm in a batch topic distribution on the training data. BigML’s topic models can predict unknown data, but if used on the data that the topic model was trained on, you can derive more information from the text fields. With this new topic batch distribution, every instance that was just composed of a title and description of a movie now had additional numerical fields that broke apart the text, almost like a DNA (see image below). This allowed the movies to be organized in a high dimensional plane where I compared movies and found similarities with a distance function.

Topic Modeling use case

Predicting flight delays:

  • The fourth project combined two datasets that were sourced from the Bureau of Public Transportation and American weather data. The main objective was to accurately identify airplanes that were likely to be delayed. The transportation data alone was not sufficient enough for a great model, as it had many repetitive and uninformative fields; therefore, engineering of the weather dataset enabled the joining of both datasets, leading to a more accurate model. Two models were created, one that labeled if a flight would be delayed upon takeoff, and one that labeled delayed flights before takeoff.

Predicting engine failure:

  • The final project focused on sensory data that NASA had generated to analyze engine failures. After creating a target field of remaining useful life of an engine, two models were trained. The first model focused on predicting the exact cycles an engine had before failing. This model performed fairly well but had room for improvement since the model had trouble predicting the cycles for engines that were far from failing. Therefore, the second model focused on the predicting whether an engine would fail in the frame of the next 30 cycles, and the model performed very well since it did an excellent job of identifying engines which started to show signs of malfunction. The idea for this project could be generalized to any sensory data such as machinery used in production lines.  

Not only did BigML give me the opportunity to experience a new culture and meet great people, but also I developed a solid foundation in Machine Learning to pursue a data-driven career. I know these news skills will give me a valuable perspective to approach problems that exist all around us.

BigML Intern Efe Toros Valencia Spain

Efe Toros, a Data Science Intern at BigML in San Sebastián, Spain during the summer of 2018.

Ready to apply to become the next BigML Intern?

BigML will have more internship opportunities in 2019, so we encourage you to check out the BigML Internship page and contact us with any inquiries at

Building Information Modeling (BIM): Machine Learning for the Construction Industry

This guest post is originally authored by David Martínez, CEO at Ibim Building Twice S.L. and Pedro Núñez, I+D+I Manager at Ibim.

Building Information Modeling (BIM) is revolutionizing the construction industry. Unlike the data generated by computer-aided design (CAD), which represent flat shapes or volumes and 2D drawings consisting of lines, BIM data represent the reality of the built structure. This new way of digitizing the real world is superior in operational terms, and the structure of its data is ideal for analytical purposes and the application of Machine Learning techniques.

BigML enables BIM consultancies, Project Management Offices (PMO), construction companies, and developers to apply Machine Learning to BIM (even experimentally). Its user-friendly platform makes modeling possible without any in-depth knowledge of Machine Learning and enables previously unimaginable automated processes and knowledge.

BIM model example.

Building Information Modeling uses data organized in a similar way to a database to create digital representations of real-life structures. BIM includes the geometry of the building, its spatial relationships and geographic information, and also the quantities and properties of its components. This information can be used to generate drawings and schedules that express the data in different ways.

BIM model example.

The possibilities of applying Machine Learning techniques to BIM are countless. Classification algorithms, anomaly detection, and even time series analysis can be used with BIM. It is worth mentioning that BIM data are used throughout the lifespan of a building (i.e., during the design, construction, maintenance phases) and can even include real-life sensor data. This is a good example of how classification algorithms can be used by combining data from many buildings, the characteristics, and location of the flats to predict how well they might sell, or even the likelihood of construction delays. On the other hand, anomaly detection is very useful to pinpoint modeling errors, and with regards to time series analysis, we can apply it to real-time data to make better maintenance predictions.

More specifically, Ibim Building Twice S.L. has conducted research into how the use of a room in a flat can be predicted based on its geometry and other BIM data. The findings are so remarkable that the company has decided to publish them as a contribution to the digitization of the construction industry. The different types of rooms in BIM are usually labeled entirely by hand by the expert modeler. The use of Machine Learning algorithms to automate this type of task could reduce the necessary time and outlay considerably. The experiment was based on data about residential buildings in BIM generated with Autodesk Revit®. The data about the rooms in the flats were extracted and re-processed using data schedules plus C# programming with the Revit API.

Model of flat using Revit. Left: names of rooms suggested by the logistic regression algorithm. Right: final names assigned by additional programming.

The extracted data were used as source data in BigML, which we first explored with dynamic scatterplots:

Graphs of rooms according to area of rooms or housing unit and hierarchy / quadrature.

Later on, we created several structured data sets for training decision trees, logistic regressions, and deepnets, all of which are classification algorithms.

BigML makes it possible to measure the performance of each model easily. Although all three algorithms were used to solve the same problem (i.e., labelling rooms according to their function on the basis of their geometry and other data), the accuracy and suitability of the algorithms may vary considerably depending on the problem in hand, so it is advisable to evaluate them all in order to determine which one yields the best predictions.

In our experiment, the top models were about 90% accurate in predicting room use. Those were evaluated against data obtained from different architects and buildings, suggesting quite a promising technique for use in production. The findings of the study were presented at the EUBIM 2018 congress held in Valencia, on May 17-19, 2018. For more details, please watch the video of the presentation and check the corresponding slideshow and original article in English and Spanish that include full details of the experiment.

The 4th Valencian Summer School in Machine Learning is Open for Enrollment

We are excited about our upcoming Summer School in Machine Learning 2018, the fourth edition of this international event. Hundreds of decision makers, industry practitioners, developers, and curious minds will delve into key Machine Learning concepts and techniques they need to master to join the data revolution. All of this will take place on September 13-14 in a great location, La NAU Cultural Center, one of the most beautiful and historic buildings from the University of Valencia.

The VSSML18 aims to cover a wide spectrum of needs as BigML’s main focus is to make Machine Learning beautiful and simple for everyone. Regardless of your prior Machine Learning experience, with this two-day course you will be able to:

  • Learn the foundational ideas behind Machine Learning theory with Master Classes that emphasize putting them into practice in your business or project.
  • Choose your preferred option between two parallel sessions: Machine Learning for business users or for developers. With these options, the VSSML18 can serve a diverse audience while providing customized content. Check out the full schedule for more details.
  • Practice with the BigML platform the concepts learned during the course via practical workshops. We recommend that you bring your laptop to create your own Machine Learning projects and start applying best-practices Machine Learning to find valuable insights in your data. Only a browser is required.
  • Understand how Machine Learning is currently being applied in several industries with real-world use cases. To provide a complete curriculum, in addition to the theoretical and hands-on part, it’s important to find out how real companies are benefitting from Machine Learning. This year we see how Barrabés.Biz uses BigML to evaluate ICOs, or how works on a predictive maintenance problem, or how Bankable Frontier Associates uses Machine Learning for social good, among other use cases.
  • Discuss your project ideas with the BigML Team members at the Genius Bar. We are happy to help you with your detailed questions about your business or projects. You can contact us ahead of time at to book your 30-minute slot with a designated BigML expert.
  • Enhance your business network. International networking is the intangible benefit at the VSSML18. Join the multinational audience representing 13 countries so far, including Spain, Portugal, Italy, Germany, Austria, Belgium, Netherlands, United Kingdom, Russian Federation, Turkey, India, United States, and Canada.
  • Stay fit during the event with our morning runs! Before the event starts we will go for a 30 minute-morning run along the Turia Gardens, one of the largest urban parks in Spain. The meeting point on Thursday 13 and Friday 14 will be at the main entrance of the venue, La Nau Cultural Center, at 06:30 AM CEST. We are counting on you to join!

As preparations are being wrapped up, please check the VSSML18 page for more details on the hotels we recommend for your stay in Valencia, in case you come from outside the city. APPLY TODAY, and reserve one of our spots before we reach full capacity!

%d bloggers like this: