We are pleased to introduce Efe Toros, who joined the BigML Team this summer as a Data Science Intern. This post shares his experience as a BigML Intern and how he contributed to help us make Machine Learning beautifully simple for everyone. We’ll let Efe take it from here…
I had the amazing opportunity to work in Valencia this Summer as a Data Science Intern for BigML. I am going into my senior year at the University of California Berkeley studying Data Science, and I can say that my experience at BigML was exactly what I needed to motivate me to finish my last stretch of school. It was a great environment where I got to apply the skills I learned in the classroom into the real world.
During my time at BigML, I got to experience both a level of freedom and guidance while conducting my work. When I first I arrived, my mentor and I laid out an informal roadmap of the tasks and goals for my internship. My main job was to create technical use cases for various industries that would show the benefits of using Machine Learning, specifically what BigML has to offer, in helping businesses solve problems, increase efficiency, or improve processes.
All my projects took the format of descriptive Jupyter notebooks that explained the workflow of processing data, constructing additional preparatory code, and most importantly using BigML’s API and Python bindings to create the core of the predictions. You can find all of the notebooks in this Github repository of BigML use cases and a summary of each use case below. My projects involved five different use cases (including demos on the BigML Dashboard and Jupyter notebooks):
- Predicting if a customer would default on a home loan.
- Predicting a customer’s transaction value from personal data.
- Building a movie recommendation system using clustering and topic modeling.
- Predicting if a flight would be delayed.
- Predicting engine failure from sensory data.
Overview of Projects
Predicting home loan defaults and customer transaction value:
- For the first project, I organized and cleaned multiple datasets of bank data in order to create a predictive model that would identify if an individual would default on their home loan, as shown in the BigML Dashboard tutorial video below.
- The second project involved analyzing high dimensional customer data to create a model that would identify a customer’s transaction value. Most of the work was done while organizing and summarizing the high dimensionality for better performance of the trained model.
Building a movie recommender:
- This project involved using two of BigML’s unsupervised learning algorithms, clustering and topic modeling, to create recommendation systems. The projects are separated into two notebooks where each algorithm derives information from a dataset, organizing each instance in a higher dimension. This enables a better search for similarities. For instance, one of my recommender systems used BigML’s topic modeling algorithm in a batch topic distribution on the training data. BigML’s topic models can predict unknown data, but if used on the data that the topic model was trained on, you can derive more information from the text fields. With this new topic batch distribution, every instance that was just composed of a title and description of a movie now had additional numerical fields that broke apart the text, almost like a DNA (see image below). This allowed the movies to be organized in a high dimensional plane where I compared movies and found similarities with a distance function.
Predicting flight delays:
- The fourth project combined two datasets that were sourced from the Bureau of Public Transportation and American weather data. The main objective was to accurately identify airplanes that were likely to be delayed. The transportation data alone was not sufficient enough for a great model, as it had many repetitive and uninformative fields; therefore, engineering of the weather dataset enabled the joining of both datasets, leading to a more accurate model. Two models were created, one that labeled if a flight would be delayed upon takeoff, and one that labeled delayed flights before takeoff.
Predicting engine failure:
- The final project focused on sensory data that NASA had generated to analyze engine failures. After creating a target field of remaining useful life of an engine, two models were trained. The first model focused on predicting the exact cycles an engine had before failing. This model performed fairly well but had room for improvement since the model had trouble predicting the cycles for engines that were far from failing. Therefore, the second model focused on the predicting whether an engine would fail in the frame of the next 30 cycles, and the model performed very well since it did an excellent job of identifying engines which started to show signs of malfunction. The idea for this project could be generalized to any sensory data such as machinery used in production lines.
Not only did BigML give me the opportunity to experience a new culture and meet great people, but also I developed a solid foundation in Machine Learning to pursue a data-driven career. I know these news skills will give me a valuable perspective to approach problems that exist all around us.