Skip to content

Computer Vision Internship in Corvallis with BigML

We are pleased to introduce Dimitrios Trigkakis, a Computer Vision Intern who worked with the BigML Team for the summer of 2018. On the other side the world from Efe Toros, BigML’s other summer intern, Dimitrios gained industry experience while applying his years of research knowledge, which he shares first-hand here.

BigML Computer Vision Internship Corvallis

Image sourced from Visit Corvallis, Oregon: https://youtu.be/2sC9np3f1kc

I had the opportunity of being introduced to BigML as I searched for summer internships, and I quickly realized this company would be a great fit. I joined the team as a Computer Vision Intern in Corvallis, Oregon, and this role proved to be a nice change of pace between my Master’s degree work last year, and my ongoing research for my Ph.D. at Oregon State University.

BigML Intern Dimitrios Trigkakis

During the interview process, it became apparent that BigML has a team of intelligent and ambitious people, who share a lot of the interest and motivation that originally lead me to study data science and Machine Learning at the beginning of my academic career.

For the initial phases of my training, I discussed potential projects with my mentor, Dr. Charles Parker, and we developed a plan for practical and important contributions focused on Computer Vision problems, as well as strengthening the pre-existing foundations for image-based Machine Learning.

All of my projects were aimed towards developing methods for identification of the content of images. Typical Machine Learning algorithms can find patterns given the features of a dataset, but in computer vision, such features are not present and have to be constructed from lower level information (the image’s pixels). BigML can provide a platform for re-training of image-based models, with great benefits in all areas where images are involved. Some examples of tasks involving computer vision include:

  • Recognizing car plates
  • Reading subtitles in other languages
  • Identifying people in videos
  • Recognizing objects in scenes
  • Identifying medically relevant visual features in x-rays or other scans

There are many tasks where computer vision can provide excellent assistance in automating labor-intensive image identification tasks that are currently being assigned to large groups of people, costing a lot of time and money. My internship projects involved implementing or expanding the existing ingredients enabling large-scale image recognition for BigML. All of these projects are aimed towards image classification, with future potential for object detection, image segmentation, and other computer vision tasks. More specifically, my projects revolved around the following:

  • Expanding BigML’s infrastructure for employing several pre-trained convolutional neural network models, which form the basis for later fine-tuning on many computer-vision related datasets.
  • Developing a similar infrastructure for support of our models on web-pages, for an accurate and fast web inference that is user-friendly.
  • Training models on image datasets that do not revolve around fine-grained classification of object categories in photographic images (e.g. Imagenet dataset). Moving in a different direction, we trained a model for classification of images that occur in an artistic setting, where object classification is challenging.
  • We developed a deep learning model that is capable of identifying structure, clusters and visualizations for datasets without labels (unsupervised learning).

Overview of Projects

  • On the software engineering side of things, we wanted to implement pre-trained models for various architectures, each with different strengths and weaknesses. We now support five different neural network architectures, one of which (namely Mobilenet) is very lightweight and was designed in order to run on mobile phone hardware. All architectures achieve competitive performance on the Imagenet dataset, while their differences focus on the trade-off between accuracy and size/inference time.
  • As an extension of the work on the server side, I developed a similar infrastructure for classifying images online. The javascript re-implementation of the above work allows a user to submit their own network definition files, and proceed to classify images that they upload to the webpage.
  • For expanding our repertoire of provided models, I trained a neural network for classification on the BAM (Behance Artistic Media dataset), an artistic dataset with labels for three categories: content, artistic medium and emotion. The network learns image features that are relevant for correctly predicting not only the content of an artistic image but also the emotion and style that the artwork represents. The learned network features can be reused by only training the predictive part of the network, enabling re-training for a new dataset that may not contain real-life photographic material. We did notice a respectable performance gap between training the entire network, or re-training only the last layers of the network (from image features to prediction). Both networks were pre-trained on the Imagenet.
    bam_samples

    BAM dataset: examples for content category ‘dog’

    bird

    The prediction task includes three categories (content, medium and emotion)

  • I developed a variational autoencoder architecture for unsupervised learning on simple feature datasets. Datasets like the Iris or text-based datasets contain patterns that a neural network is able to extract, given the task of reproducing its input to its output. Using the feature vector that unlabelled images are assigned to when the network attempts to reconstruct them is a way to reduce the dimensionality of the input without losing a lot of information about the content of the input data. By using the T-SNE algorithm, we can further reduce the dimensionality of the input data into two dimensions for easy visualization. Finally, k-means clustering can identify membership into classes, grouping the input data together and giving hints about the regularities in the dataset, without requiring any labelled examples.
Kmeans

Inspection of unsupervised categories gives insight into the regularities in the input data

I have been very grateful for this opportunity to work with BigML, as it was a great culture fit, full of vibrant people who guided me through my first contact with the industry. Applying my knowledge to real-world problems was very satisfying, and I learned a lot about communication skills, software development, collaboration, and I have gained confidence in myself and in my future. All in all, BigML has provided a great experience, by people who work very hard to make approachable and intuitive Machine Learning a reality.

Interested in a BigML Internship?

More internship positions will be available at BigML in 2019. Keep an eye on the BigML Internship page and feel free to contact us at internships@bigml.com with any questions or project ideas. We look forward to hearing from you!

Breaking Records at the 4th Valencian Summer School in Machine Learning

Starting a brand new week and still feeling the excitement of VSSML18’s success! 220 attendees arriving from 18 countries did not want to miss BigML’s annual event to learn the latest about Machine Learning. In this edition, we have hosted 178 attendees from 101 companies and 42 attendees representing 26 academic institutions.

The 4th edition of our Machine Learning school took place in one of Valencia’s most prestigious buildings: La NAU Cultural Center. Hundreds of Machine Learning practitioners, decision makers, and developers from Andorra, Austria, Belgium, Canada, Denmark, France, Germany, Iceland, India, Italy, Netherlands, Portugal, the Russian Federation, Spain, Switzerland, Turkey, United Kingdom, and the United States were in attendance. All of them had something in common: they want to adopt Machine Learning in their organizations to remain competitive in the market.


The VSSML18 offered 13 master classes to introduce the basic Machine Learning concepts and techniques, 5 practical workshops for hands-on practice of the lessons learned, 5 case studies to understand how several companies are currently applying Machine Learning, and all this content divided into two parallel sessions for both attendee profiles business and developers.

In addition to the scheduled lectures, the attendees of this two-day event had the chance to discuss their Machine Learning projects and ideas with the BigML Team members at the Genius Bar, and increased their network by meeting international professionals. They also met some of BigML’s partners who presented use cases and shared some of their projects at booths, such as Barrabés.Biz evaluating ICOs, CleverData.io working on a predictive maintenance problem, Bankable Frontier Associates using Machine Learning for social good, and Talento Corporativo spreading the Machine Learning word in the north of Spain. But not everything was hard work, the VSSML18 attendees could also enjoy two energetic morning runs to start the days and enjoy some drinks and beverages at the end of both days.

To conclude the fourth edition of the Machine Learning school held in Valencia, the BigML Team can only thank all attendees who participated in this unforgettable event, as well as the co-organizers VIT Emprende, València Activa, and Ajuntament de València for helping us make it happen. Thank you all for the great feedback! And please check out the #VSSML18 photo albums on Facebook and Google+  to see the event’s highlights.

For those who could not make it to the VSSML18, we hope to see you at the MLSD18 in Doha, Qatar, on November 4 and 5, where we will be holding the inaugural edition of our Machine Learning School series in the Middle East region!

Machine Learning School in Doha, Qatar: Launching the First Edition!

BigML and the Qatar Computing Research Institute (QCRI), part of Hamad Bin Khalifa University, are excited to announce the first edition of our Machine Learning School in Doha, Qatar! The MLSD18 will take place on November 4-5 at the Qatar National Convention Centre, and it will be the first Machine Learning course that BigML and QCRI are organizing together in the Middle East. This event will be one of the first activities of the new center established by QCRI, the Qatar Center for Artificial Intelligence (QCAI).

No prior Machine Learning knowledge is required to attend the MLSD18 sessions. Attendees from all backgrounds will be welcome at this two-day event, from business leaders, industry practitioners, and developers, to graduate students, as well as advanced undergraduates, seeking a quick, practical, and hands-on introduction to Machine Learning to solve real-world problems. The MLSD18 is an ideal event to learn the basic as well as more advanced Machine Learning concepts and techniques that you will need to master to take your business or project to the next level.

The course will present an introduction to what Machine Learning is, where we are and where we are going. It will also cover the main techniques of classification, regression, time series, clusters, anomaly detection, association discovery, and topic models. You will also be able to put in practice all the concepts learned with interactive exercises and use cases. Additionally, more technical topics will be addressed in detail, such as data transformations, feature engineering, how to work programmatically using the API, bindings, how to automate Machine Learning workflows via WhizzML, and more!

Where

Qatar National Convention Centre, Room 104: Al Luqta St, Ar Rayyan, Education City. Doha, Qatar. See map here.

When

2-day event: on November 4-5, 2018 from 08:00 AM to 07:30 PM AST.

Applications

Please complete this form to apply. After your application is processed, you will receive an invitation to purchase your ticket. We recommend that you register soon; space is limited and as per our previous editions in other locations, the event may sell out quickly.

Schedule

You can check out the full agenda and other details of the event here.

Additional Activities

In order to feel the full Machine Learning experience, the BigML Team and QCRI have additional activities set for you:

  • The Genius Bar. We are looking forward to helping you solve your questions regarding your business, projects, or any ideas related to Machine Learning. Feel free to book your 30-minute private session by contacting us at mlsd18@bigml.com.
  • The morning runs. We will go for a healthy and fun 30-minute run before the event starts. The meeting point on Sunday 4 and Monday 5 will be announced shortly.
  • Get to know the lecturers and other attendees during the networking breaks. We expect hundreds of locals as well as Machine Learning practitioners and experts coming from other Middle East countries, Asia, and more!

The BigML Team is very happy to continue growing our Machine Learning Schools, and we are looking forward to celebrating many editions in Doha together with QCRI. Please read this blog post to know more about BigML’s previous Machine Learning schools.

Machine Learning Internship Abroad with BigML

We are pleased to introduce Efe Toros, who joined the BigML Team this summer as a Data Science Intern. This post shares his experience as a BigML Intern and how he contributed to help us make Machine Learning beautifully simple for everyone. We’ll let Efe take it from here…

Machine Learning Internship Abroad in Valencia

I had the amazing opportunity to work in Valencia this Summer as a Data Science Intern for BigML. I am going into my senior year at the University of California Berkeley studying Data Science, and I can say that my experience at BigML was exactly what I needed to motivate me to finish my last stretch of school. It was a great environment where I got to apply the skills I learned in the classroom into the real world.

Efe Toros BigML Intern

During my time at BigML, I got to experience both a level of freedom and guidance while conducting my work. When I first I arrived, my mentor and I laid out an informal roadmap of the tasks and goals for my internship. My main job was to create technical use cases for various industries that would show the benefits of using Machine Learning, specifically what BigML has to offer, in helping businesses solve problems, increase efficiency, or improve processes.

All my projects took the format of descriptive Jupyter notebooks that explained the workflow of processing data, constructing additional preparatory code, and most importantly using BigML’s API and Python bindings to create the core of the predictions. You can find all of the notebooks in this Github repository of BigML use cases and a summary of each use case below. My projects involved five different use cases (including demos on the BigML Dashboard and Jupyter notebooks): 

Overview of Projects

Predicting home loan defaults and customer transaction value:

The first two projects were sourced from Kaggle competitions using datasets provided by Santander Bank.

  • For the first project, I organized and cleaned multiple datasets of bank data in order to create a predictive model that would identify if an individual would default on their home loan, as shown in the BigML Dashboard tutorial video below. 
  • The second project involved analyzing high dimensional customer data to create a model that would identify a customer’s transaction value. Most of the work was done while organizing and summarizing the high dimensionality for better performance of the trained model. 

Building a movie recommender:

  • This project involved using two of BigML’s unsupervised learning algorithms, clustering and topic modeling, to create recommendation systems. The projects are separated into two notebooks where each algorithm derives information from a dataset, organizing each instance in a higher dimension. This enables a better search for similarities. For instance, one of my recommender systems used BigML’s topic modeling algorithm in a batch topic distribution on the training data. BigML’s topic models can predict unknown data, but if used on the data that the topic model was trained on, you can derive more information from the text fields. With this new topic batch distribution, every instance that was just composed of a title and description of a movie now had additional numerical fields that broke apart the text, almost like a DNA (see image below). This allowed the movies to be organized in a high dimensional plane where I compared movies and found similarities with a distance function.

Topic Modeling use case

Predicting flight delays:

  • The fourth project combined two datasets that were sourced from the Bureau of Public Transportation and American weather data. The main objective was to accurately identify airplanes that were likely to be delayed. The transportation data alone was not sufficient enough for a great model, as it had many repetitive and uninformative fields; therefore, engineering of the weather dataset enabled the joining of both datasets, leading to a more accurate model. Two models were created, one that labeled if a flight would be delayed upon takeoff, and one that labeled delayed flights before takeoff.

Predicting engine failure:

  • The final project focused on sensory data that NASA had generated to analyze engine failures. After creating a target field of remaining useful life of an engine, two models were trained. The first model focused on predicting the exact cycles an engine had before failing. This model performed fairly well but had room for improvement since the model had trouble predicting the cycles for engines that were far from failing. Therefore, the second model focused on the predicting whether an engine would fail in the frame of the next 30 cycles, and the model performed very well since it did an excellent job of identifying engines which started to show signs of malfunction. The idea for this project could be generalized to any sensory data such as machinery used in production lines.  

Not only did BigML give me the opportunity to experience a new culture and meet great people, but also I developed a solid foundation in Machine Learning to pursue a data-driven career. I know these news skills will give me a valuable perspective to approach problems that exist all around us.

BigML Intern Efe Toros Valencia Spain

Efe Toros, a Data Science Intern at BigML in San Sebastián, Spain during the summer of 2018.

Ready to apply to become the next BigML Intern?

BigML will have more internship opportunities in 2019, so we encourage you to check out the BigML Internship page and contact us with any inquiries at internships@bigml.com.

Building Information Modeling (BIM): Machine Learning for the Construction Industry

This guest post is originally authored by David Martínez, CEO at Ibim Building Twice S.L. and Pedro Núñez, I+D+I Manager at Ibim.

Building Information Modeling (BIM) is revolutionizing the construction industry. Unlike the data generated by computer-aided design (CAD), which represent flat shapes or volumes and 2D drawings consisting of lines, BIM data represent the reality of the built structure. This new way of digitizing the real world is superior in operational terms, and the structure of its data is ideal for analytical purposes and the application of Machine Learning techniques.

BigML enables BIM consultancies, Project Management Offices (PMO), construction companies, and developers to apply Machine Learning to BIM (even experimentally). Its user-friendly platform makes modeling possible without any in-depth knowledge of Machine Learning and enables previously unimaginable automated processes and knowledge.

BIM model example.

Building Information Modeling uses data organized in a similar way to a database to create digital representations of real-life structures. BIM includes the geometry of the building, its spatial relationships and geographic information, and also the quantities and properties of its components. This information can be used to generate drawings and schedules that express the data in different ways.

BIM model example.

The possibilities of applying Machine Learning techniques to BIM are countless. Classification algorithms, anomaly detection, and even time series analysis can be used with BIM. It is worth mentioning that BIM data are used throughout the lifespan of a building (i.e., during the design, construction, maintenance phases) and can even include real-life sensor data. This is a good example of how classification algorithms can be used by combining data from many buildings, the characteristics, and location of the flats to predict how well they might sell, or even the likelihood of construction delays. On the other hand, anomaly detection is very useful to pinpoint modeling errors, and with regards to time series analysis, we can apply it to real-time data to make better maintenance predictions.

More specifically, Ibim Building Twice S.L. has conducted research into how the use of a room in a flat can be predicted based on its geometry and other BIM data. The findings are so remarkable that the company has decided to publish them as a contribution to the digitization of the construction industry. The different types of rooms in BIM are usually labeled entirely by hand by the expert modeler. The use of Machine Learning algorithms to automate this type of task could reduce the necessary time and outlay considerably. The experiment was based on data about residential buildings in BIM generated with Autodesk Revit®. The data about the rooms in the flats were extracted and re-processed using data schedules plus C# programming with the Revit API.

Model of flat using Revit. Left: names of rooms suggested by the logistic regression algorithm. Right: final names assigned by additional programming.

The extracted data were used as source data in BigML, which we first explored with dynamic scatterplots:

Graphs of rooms according to area of rooms or housing unit and hierarchy / quadrature.

Later on, we created several structured data sets for training decision trees, logistic regressions, and deepnets, all of which are classification algorithms.

BigML makes it possible to measure the performance of each model easily. Although all three algorithms were used to solve the same problem (i.e., labelling rooms according to their function on the basis of their geometry and other data), the accuracy and suitability of the algorithms may vary considerably depending on the problem in hand, so it is advisable to evaluate them all in order to determine which one yields the best predictions.

In our experiment, the top models were about 90% accurate in predicting room use. Those were evaluated against data obtained from different architects and buildings, suggesting quite a promising technique for use in production. The findings of the study were presented at the EUBIM 2018 congress held in Valencia, on May 17-19, 2018. For more details, please watch the video of the presentation and check the corresponding slideshow and original article in English and Spanish that include full details of the experiment.

The 4th Valencian Summer School in Machine Learning is Open for Enrollment

We are excited about our upcoming Summer School in Machine Learning 2018, the fourth edition of this international event. Hundreds of decision makers, industry practitioners, developers, and curious minds will delve into key Machine Learning concepts and techniques they need to master to join the data revolution. All of this will take place on September 13-14 in a great location, La NAU Cultural Center, one of the most beautiful and historic buildings from the University of Valencia.

The VSSML18 aims to cover a wide spectrum of needs as BigML’s main focus is to make Machine Learning beautiful and simple for everyone. Regardless of your prior Machine Learning experience, with this two-day course you will be able to:

  • Learn the foundational ideas behind Machine Learning theory with Master Classes that emphasize putting them into practice in your business or project.
  • Choose your preferred option between two parallel sessions: Machine Learning for business users or for developers. With these options, the VSSML18 can serve a diverse audience while providing customized content. Check out the full schedule for more details.
  • Practice with the BigML platform the concepts learned during the course via practical workshops. We recommend that you bring your laptop to create your own Machine Learning projects and start applying best-practices Machine Learning to find valuable insights in your data. Only a browser is required.
  • Understand how Machine Learning is currently being applied in several industries with real-world use cases. To provide a complete curriculum, in addition to the theoretical and hands-on part, it’s important to find out how real companies are benefitting from Machine Learning. This year we see how Barrabés.Biz uses BigML to evaluate ICOs, or how CleverData.io works on a predictive maintenance problem, or how Bankable Frontier Associates uses Machine Learning for social good, among other use cases.
  • Discuss your project ideas with the BigML Team members at the Genius Bar. We are happy to help you with your detailed questions about your business or projects. You can contact us ahead of time at vssml18@bigml.com to book your 30-minute slot with a designated BigML expert.
  • Enhance your business network. International networking is the intangible benefit at the VSSML18. Join the multinational audience representing 13 countries so far, including Spain, Portugal, Italy, Germany, Austria, Belgium, Netherlands, United Kingdom, Russian Federation, Turkey, India, United States, and Canada.
  • Stay fit during the event with our morning runs! Before the event starts we will go for a 30 minute-morning run along the Turia Gardens, one of the largest urban parks in Spain. The meeting point on Thursday 13 and Friday 14 will be at the main entrance of the venue, La Nau Cultural Center, at 06:30 AM CEST. We are counting on you to join!

As preparations are being wrapped up, please check the VSSML18 page for more details on the hotels we recommend for your stay in Valencia, in case you come from outside the city. APPLY TODAY, and reserve one of our spots before we reach full capacity!

The Fusions Webinar Video is Here: Improve your Model Performance Through Algorithmic Diversity!

The new BigML release brings Fusions to our Machine Learning platform, the new modeling capability that combines multiple models to achieve better results. A Fusion combines different supervised models (models, ensembles, logistic regressions, and deepnets) and aggregates their predictions to balance out the individual weaknesses of single models.

Since yesterday, on July 12, 2018, Fusions are available from the BigML Dashboard, API, and WhizzML, and they follow the same principle as ensembles, where the combination of multiple models often provides better performance than any of the individual components. All these details, along with the new and more complete text analysis options, are explained in the official launch webinar. You can watch it anytime on the BigML YouTube channel.

For further learning on Fusions and other new features, please visit our release page, where you will find:

  • The slides used during the webinar.
  • The detailed documentation to learn how to use Fusions with the BigML Dashboard and the BigML API.
  • The series of six blog posts that gradually explain Fusions to give you a detailed perspective of what’s behind this new capability. We start with an introductory post that explains the basic concepts, followed by several use cases to understand how to put Fusions to use, and then three more posts on how to use Fusions through the BigML Dashboard, API, WhizzML and Python Bindings. Finally, we complete this series with a technical view of how Fusions work behind the scenes.
  • An extra section with a blog post and documentation on the new text analysis enhancements released.

Thanks for your watching the webinar, for your support, and your nice feedback! For more queries or comments, please contact the BigML Team at support@bigml.com.

Text Analysis Enhancements: 22 Languages and More Pre-processing Options!

We’re happy to share new options to automatically analyze the text in your data. BigML has been supporting text fields as inputs for your supervised and unsupervised models for a long time, which pre-process your text data in preparation for Machine Learning models. Now, these text options have been extended as new ones have been added to further streamline your text analysis and enhance your models’ performance.

  • BigML supports 15 new more languages. The total number has increased from 7 to 22 languages! Now you can upload your text fields in Arabic, Catalan, Chinese, Czech, Danish, Dutch, English, Farsi/Persian, Finish, French, German, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Turkish, Romanian, Russian, Spanish, or Swedish. BigML will auto-detect the language or languages (in case your dataset contains several languages in different fields) in your data. The detected language is key for the text analysis because it determines the tokenization, the stop word removal, and the stemming applied.
  • Extended the stop words removal options: You can now opt to remove stop words for the detected language or for all languages. This option is very useful when you have many languages mixed in the same field. For example, social media posts or public reviews are usually written in several languages. Another related enhancement helps you decide the degree of aggressiveness that you want for the stop words removal i.e., light, normal, or aggressive. Depending on your main goal, there can be some stop words that may be useful for you, e.g., the words “yes” and “no” may be interesting since they express affirmative and negative opinions. A lower degree of aggressiveness will include some useful stop words in your model vocabulary.
  • One of the greatest new additions to the text options is n-grams! Although you could already choose bigrams before, we’ve extended it so you can now include bigrams, trigrams, four-grams, and five-grams in your text analysis. Moreover, you can also exclude unigrams from your text and make the analysis based only on higher size n-grams (see the filter for single tokens below).
  • Lastly, a number of optional filters to exclude uninteresting words and symbols from your model vocabulary have been added since specific words can introduce noise into your models depending on your context.
    • Non-dictionary words (e.g., words like “thx” or “cdm”)
    • Non-language characters (e.g., if your language is Russian, all the non-Cyrillic characters will be excluded)
    • HTML keywords (e.g., href, no_margin)
    • Numeric digits (all numbers containing digits from 0 to 9)
    • Single tokens (i.e., unigrams, only n-grams of size 2 or more will be considered if selected)
    • Specific terms (you can input any string and it will be excluded from the text)

new_text_options.png

You can set up these options by configuring two different resources: sources and/or topic models. By configuring the source and then creating a dataset, you propagate the text analysis configuration to all the models (supervised or unsupervised) that you create from that dataset. Hence, an ensemble or an anomaly detector trained with this dataset will use a vocabulary shaped by the options configured at the source level. Topic models are the only resources for which you can re-configure these text options. This is because the topic model results are greatly impacted by the text pre-processing so BigML provides a more straightforward way to iterate on such models so you don’t need to go back to the source step each time.

Let’s illustrate how these new options work using the Hillary Clinton’s e-mails dataset from Kaggle. The main goal is to know the different topics in these e-mails without having to read them all. For this, we will create a topic model using this dataset. We assume you know how topic models work in BigML, otherwise, please watch this video.

We’re only using two fields from the full Kaggle dataset (“ExtractedSubject” and “ExtractedBodyText”) to create the topic model. First, we create the topic model with the BigML 1-click option which uses the default configuration for all the parameters.

1-click-topic-model

When the model is created, we can inspect the different topics by using the BigML visualization for topic models. You can see that we have some relevant topics like topic 36, which is about economic issues in Asia (mostly China). But most of the topics, even if they contain relevant terms, are also mixed with lots of frequent words, numbers, and acronyms (for example “fw”, “fyi”, “dprk”, “01”, “re”, “iii”, etc.) that don’t tell us much about the real content of the e-mails.

topic-model-1-click

Let’s try to improve those results by configuring the text options offered by BigML. We can observe in our first model that there were some stopwords that we don’t really care much about such as “ok” or “yes”. Therefore, we set the stop words removal as “Aggressive” this time. We also had many terms and numbers that are not telling us anything about the e-mail themes, such as “09”, “iii” or “re”. To exclude those terms from our analysis, we’ll use the non-dictionary words and numeric digits filters. Finally, in order to get some more context, we’ll also include bigrams, trigrams, four-grams and five-grams to our topic model.

topic-model-configuration.png

So we create the new topic model and… voilà! In a couple of clicks, we have a much more insightful model with more meaningful topics that help us better interpret the content of the underlying e-mails.

You can see that most of the meaningless words have disappeared and the terms within the topics are much more thematically related. For example, now we have five topics that contain the word “president” and talk about five different themes: European politics, current US government, US elections, US politics and Iran politics. In the model we built before, the minority thematics like Iran politics didn’t feature a topic of their own as they were mixed with other topics while other more frequent (but meaningless) words had topics of their own.

topic-model-filters.png

We may clean this model even further, and filter uninteresting words like “pm”, “am”, “fm”, etc. However, we feel satisfied enough with these topics and we prefer to spend the time creating a new model with a totally different approach.

Sometimes, the meaning of a single word can change if you look at the terms around it. For example, “great” may be a positive word, but “not so great” is indicating a bad experience. We can make this kind of analysis by using BigML n-grams and at the same time excluding unigrams from the topic model. The resulting model only includes topics that contain bigrams, trigrams, four-grams and five-grams. All topics show delimited themes that may be slightly different than the topics we obtained before. For example, the topic about the English politics was too broad before, it was mixed with European politics, however now it has two topics for its own.

topic-model-ngrams.png

Ok, topic models and the new text options on BigML are great, but what is the main goal of all this? We could use these topics for many purposes. For example, to analyze the most mentioned topics in Hillary Clinton e-mails by calculating per-topic distributions for each e-mail (very easy with BigML’s topic distribution feature). Moreover, you could use the per-topic distributions as inputs to build an Association and see which topics are more strongly correlated. In summary, when you have a topic model created and the per-topic distributions calculated, you can use them as inputs for any supervised or unsupervised models.

Thanks for reading! As usual, we look forward to hearing your feedback, comments, or questions. Feel free to send them to support@bigml.com. To learn more, please visit this page.

To Fuse or Not To Fuse Models?

The idea of model fusions is pretty simple: You combine the predictions of a bunch of separate classifiers into a single, uber-classifier prediction, in theory, better than the predictions of its individual constituents.

As my colleague Teresa Álverez mentioned in a previous post, however, this doesn’t typically lead to big gains in performance. We’re typically talking 5-10% improvements even in the best case. In many cases, OptiML will find something as good or better than any combination you could try by hand.

So, then, why bother? Why waste your time fiddling with combinations of models when you could spend it on doing things that will almost certainly have a more measurable impact on your model’s performance, like feature engineering or better yet, acquiring more and better data?

Part of the answer here is that looking at a number like “R squared” or “F1-score” is often an overly reductive view of performance. In the real world, how a model performs can be a lot more complex than how many answers it gets wrong and how many it gets right. For example, you, the domain expert, probably want a model that behaves appropriately with regards to the input features and also makes predictions for the right reasons. When a model “performs well”, that should mean nothing less than that the consumers of its predictions are satisfied with its behavior, not just that it gets some number of correct answers.

If you’ve got a model that has good performance numbers, but it’s exhibiting some wacky behavior or priorities, using fusions can be a good way to get equivalent (or occasionally better) performance, but with the added bonus of behavior that perhaps appears a little saner to domain experts. Here are a few examples of such cases:

“Minding The Gap” with tree-based models

There are loads of literature out there showing that, for many datasets, ensembles of trees will end up performing as good or better than any other algorithm. However, trees do have one unfortunate attribute that may become obvious if someone observes many of their predictions: The model takes the form of axis-aligned splits, and so large regions of the input space will be assigned the same output value (resulting in a PDP that looks something like this):

treepdp

Practically, this will mean that small (or even medium-sized) changes to the input values will often result in identical predictions, unless the change crosses a split threshold, at which point it will change dramatically. This sort of stasis / discontinuity can throw many people for a loop, especially if you have a human operator working with the model’s predictions in real-time (e.g., “Important Feature X went up by 20% and the model’s prediction is still the same!”).

A solution to the problem is to fuse the ensemble to a deepnet that performs fairly well. This changes the above PDP to look more like this:

diffpdp

You’ll still see thresholds where the prediction jumps a bit, but there’s no longer complete stasis within those partitions. If the deepnet’s performance is close to the ensemble, you’ll get a model with more dynamic predictions without sacrificing accuracy.

The Importance of Importance

In the summary view of the models learned when you use OptiML, you’ll see a pull-down that will give you the top five most important fields for each model.

importance

For non-trivial datasets, you may often see that models with equivalent or nearly-equivalent performance have very different field importances. The field importances we report are exactly what they say on the tin: they tell you how much the model’s predictions will change when you change the value of that feature.

This is where you, the domain expert, can use your knowledge to improve your model, or at least save it from disaster. You might see a case where a high performing model is relying on just a few features from the dataset, and another high performing model is relying on a few different features. Fusing the models together will give you a model guaranteed to take all of those features into account.

This can be a win for two reasons, even if the performance of the fused model is no better than the separate models. First, people looking at the importance report will find that the model is taking into account more of the input data when making its prediction, which people generally find more satisfying than the model taking into account only a few of the available fields. Second, the fused model will be more robust than the constituent models in the sense that if one of the inputs becomes corrupt or unreliable, you still have a good chance of making the right prediction because of the presence of the other “backup” model.

(Mostly) Uncorrelated Feature Sets with Different Geometries

Okay, so that’s a mouthful, but what I’m talking about here is situations where you’ve got a set of features, where some are better modeled separately from the others.

Why would you do this? It’s possible that a subset of the features in your data is amenable to one type of modeling and others to different types of modeling. If this is the case, and if those different features are not terribly well-correlated with one another, a fusion of two models, each properly tuned, may produce better results than either one on its own.

A good example is where you have a block of text data with some associated numeric metadata (say, a plain text loan description and numeric information about the applicant). Models like logistic regression and deepnets are generally good at constructing models that are algebraic combinations of continuous numeric features, and so might be superior for modeling those features. Trees and ensembles thereof are good at picking out relevant features from a large number of possibilities, and so are often well-suited to dealing with word counts. It seems obvious, then, that carefully tuning separate models for each type of data might be beneficial.

Whether the combination outperforms either one by itself (or a joint model) depends again on the additional performance you can squeeze out by modeling separately and the relative usefulness of each set of features.

That Is The Question

Hopefully, I’ve convinced you that there are reasons to use fusions that go beyond just trying to optimize your model’s performance; they can also be a way to get a model to behave in a more satisfying and coherent way while not sacrificing accuracy. If your use case fits one of the patterns above, go ahead and give fusions try!

Want to know more about Fusions?

If you have any questions or you would like to learn more about how Fusions work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

Automating Fusions with WhizzML and the Python Bindings

by

This blog post is the fifth in our series of posts about Fusions and focuses on how to automate processes that include them using WhizzML, BigML’s Domain Specific Language for Machine Learning workflow automation. Summarizing what a Fusion is, we can define them as a group of resources that predict together in order to reduce each resource’s individual weakness.

In this post, we are going to describe how to automate a process that creates a good predictor by employing Fusions in a programmatic way. As we have commented in other posts related to WhizzML, WhizzML allows you to execute complex tasks that are computed completely on the server side with parallelization. This eliminates connection issues and takes care of your account limits regarding the maximum number of resources you can create at the same time. We will also describe the same operations with our Python bindings as another option for client-side control.

As we have mentioned, this release of Fusions puts the focus on the power of the many models working together. Starting from the beginning, suppose we have a group of trained models (trees, ensembles of trees, logistic regressions, deepnets, or another fusion), and we want to use all of them to create new predictions. Using multiple models will remove the weakness of a singular model. The first step is to create a Fusion resource passing the models as a parameter. Below is the code for creating a Fusion in the simplest way: passing a list of models in this format [“<resouce_type/resource_id>”, “<resouce_type/resource_id>”, …] and without any other parameter, that is, taking the default parameters.

;; WhizzML - create a fusion
(define my-fusion (create-fusion {"models" my-best-models}))

If you choose to use Python to code your workflows and run the process locally, instead of running completely in the server,  the equivalent code is below, where the models are also passed as the unique parameter in a list.

# Python - create a fusion
fusion = api.create_fusion(["model/5af06df94e17277501000010",
                            "logisticregression/5af06df84e17277502000019",
                            "deepnet/5af06df84e17277502000016",
                            "ensemble/5af06df74e1727750100000d"]})

Just like all BigML resources, Fusions have parameter options that the user can add in the creation request to improve the final result. For instance, suppose that we want to add different weights to each one of the models that compose the Fusion because we know that one of the models is more accurate than others. Another point to highlight is that the creation of the resource in BigML is done in an asynchronous way, which means that most of the time, the creation request doesn’t return the resource as it will be when it’s completed. In order to get it completed, you have two main options: to make an iterative retrievement until it’s finished, or use the functions created by that effect. In WhizzML, the function create-and-wait​ stops the workflow execution while the Fusion is not completed.

Let’s see how to do it specifying weights for the models and assign the variable when the resource is completed. Looking at the code below, you can see how the list of models persists, but now we are also passing a set of dictionaries:

;; WhizzML - create a fusion with weights and wait for the finish
(define my-fusion
  (create-and-wait-fusion {"models"
                           [{"id" "model/5af06df94e17277501000010"
                             "weight" 1}
                            {"id" "deepnet/5af06df84e17277502000016"
                             "weight" 4}
                            {"id" "ensemble/5af06df74e1727750100000d"
                             "weight" 3}]}))

In Python bindings, the asynchronism is managed by the ok function, and the weights are added to each model’s object in the Fusion. Here is the code for the Python binding that is equivalent to the WhizzML code above.

# Python - create a fusion with weights and wait for the finish
fusion = api.create_fusion([
    {"id": "model/5af06df94e17277501000010", "weight": 1},
    {"id": "deepnet/5af06df84e17277502000016", "weight": 4},
    {"id": "ensemble/5af06df74e1727750100000d", "weight": 3}])
api.ok(fusion)

To see the complete list of arguments for Fusion creation, visit the corresponding section in the API documentation.

Once the Fusion has been created, the best way to measure its performance, as with every type of supervised model, is to make an evaluation. To do so, you need to choose data that is different than the one used to create the Fusion, since you want to avoid overfitting problems. This data is often referred to as a”test dataset”. Let’s see first how to evaluate a Fusion by employing WhizzML:

;; WhizzML - Evaluate a fusion
(define my-evaluation
    (create-evaluation {"fusion" my-fusion "dataset" my-test-dataset}))

and now how it should be done with the Python bindings:

# Python - Evaluate a fusion
evaluation = api.create_evaluation(my_fusion, my_test_dataset)

In both cases, the code is extraordinary simple. With the evaluation, you can determine if the performance given is acceptable for your use case, or if you need to continue improving your Fusion by adding models or training new ones.

As with any supervised resource, once the model has a good level of performance, you can start using it to make predictions, which is the goal of the built Fusion model. Following the line of the post, let’s write the WhizzML code to make single predictions, that is predict just the result of a “row” of new data.

;; WhizzML - Predict using a fusion
(define my-prediction
    (create-prediction {"fusion" my-fusion
                        "input-data" {"state" "kansas" "bmi" 32.5}}))

To do exactly the same with Python bindings, your code should be like the following. The first parameter is the Fusion resource ID and the second one is the new data to predict with.

# Python - Predict using a fusion
prediction = api.create_prediction(my_fusion, {
    "state": "kansas",
    "bmi": 32.5
})

Here we are showing the most simple case to make a prediction, but prediction creation has a large list of parameters in order to bring a good fit to the result, according to your needs.

When your goal is not only to predict a single row but a group of data, represented as a new dataset (that you previously uploaded to the BigML platform), you should create a batchprediction resource, which only requires two parameters: the Fusion ID and the ID of this new dataset.

;; WhizzML - Make a batch of predictions using a fusion
(define my-batchprediction
    (create-batchprediction {"fusion" my-fusion "dataset" my-dataset))

It couldn’t get any easier. The equivalent code in Python is almost the same and very simple too. Here it is:

# Python - Make a batch of predictions using a fusion
batch_prediction = api.create_batch_prediction(my_fusion, my_dataset)

Want to know more about Fusions?

Stay tuned for the next blog post to learn how Fusions work behind the scenes. If you have any questions or you would like to learn more about how Fusions work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

%d bloggers like this: