Within 24 hours of turning in my IBM badge, laptop and signed exit papers, I found myself on a plane to Buenos Aires followed by Melbourne, and Sydney for conferences and client meetings representing a company I was not even officially working for yet. I am now one of the most recent additions to the BigML team and my motivation to give up my IBM career and to make a fresh start with a startup did not come up as a sudden urge. Rather, it was a gradual process of observing the sea change in the marketplace.
Having interfaced with many analytics organizations as part of my tenure at IBM, it is my conviction that we have entered a new era, where the democratization of machine learning is allowing organizations large and small to add repeatable statistical rigor to all kinds of processes that up to now have been predominantly influenced by human bias e.g. candidate identification and the interview process (HR), predicting vacation rental prices, athlete profiling by scouts, sizing and pricing complex services projects, and optimizing crop yields and farming operations. No doubt all of those business profiles, including the guy predicting vacation rental prices will one day utilize machine learning – without having to reinvent themselves as hackers that is.
Speed, Deployability, and Costs
Like most digital technologies, Machine Learning is in the process of becoming automated and commoditized with BigML, Amazon, Microsoft, and Google leading the charge.
The business drivers? There are many:
- Transforming manual set of processes into a single fluid one by leveraging easy to use services
- Lowering the complexity and cost of building and deploying predictive models
- Increasing business performance by applying machine learning in daily operations to speed up the time-to-market of more and more data-driven decisions.
With tools like BigML, in a fraction of the time that it takes to install and configure any statistical software package like R, SAS or SPSS, you can create an online account, load your source data, train, test, and boom! You have built a predictive model that helps you score all your present and future data. Best of it, the same process template can be applied to any of your functional areas (marketing, sales, risk, compliance, maintenance etc.) accelerating the scale of data driven actions in the whole company.
Oh, and did I mention exportable models that can be shared with anybody in your organization, be enabled to run remotely on IOT devices or other high value assets (i.e. cell towers, manufacturing equipment, infrastructure pipes, etc.) and supporting most popular runtime environments such as Python, C++, Java, Node.js and more. That’s right, you can create models, then export them to anybody or practically anything and have them run locally at a single machine or send them to a million machines at no additional cost. Yes, democratization of advanced analytics is literally here.
All this means the traditional enterprise software vendor approach (SAS, SAP, IBM etc.) of selling large bundles of software that include components/modules that sit unused is slowly seeing the end of days. Users want innovation, not re-stitching together of 20+ year old platforms buttressed with spend heavy traditional marketing to keep up with the budding tools that have adopted cloud-based distributed machine learning driven approaches from birth. On top of that, over the last decade companies have become much more comfortable working with innovative startup providers that offer a SaaS model built on IaaS and PaaS substrates, thus not shying away from passing the savings from low overhead costs to their customers. Since the playbook of best practices needed to operate a cloud-born company are common knowledge these days, we will likely witness the product selection bias tipping increasingly away from the incumbents.
Machine Learning made “beautiful”?
When I first heard BigML’s motto, my thought was “Who would associate Machine Learning with beauty?” However, after seeing the audience reactions ranging from “Wow!”, “Very cool”, and from time to time “This is like IBM Watson and Tableau put together” I have come to appreciate the effort BigML team has put in enabling a highly streamlined and understandable machine learning workflow that is capable of demystifying the mathematically complex Machine Learning concepts for even the uninitiated. Beautifully simple indeed. It has finally occurred to me that living in an era ushered in by Mr. Steve Jobs has gotten us all spoiled with much higher expectations from any and all products that we come to touch. Naturally, there is no reason why the same should not apply to Machine Learning software.
While such meaningful progress is being made towards making machine learning more usable and understandable for a broader set of technical and business users, today’s typical practitioners are mainly Developers, Data Scientists, and “Data Wranglers”—the unsung heroes of any analytics project. Which brings us to an inconvenient reality: There are just not enough of these well-balanced teams of “Practitioners” in the market to meet the exploding demand. So what are companies to do at a time Machine Learning is supposed to take center stage in their irreversible digital evolution?
If you are tasked with building a Data Science team, I recommend getting started with assigning tasks to Business Analysts and Data Wranglers. They should clearly prioritize and formulate the business problem, extract, transform and get ready the relevant data sources for predictive modeling. Most practitioners agree that up to 80% of the time and effort is spent with those stages of the process. In parallel, you can aim to find a person with “practical” experience in the area of machine learning. WARNING – if during your hiring process you come across a Data Scientist whose career highlight is an exotic algorithm he has been working on for the last 2 years, say THANK YOU and RUN the other way. That’s more of a Machine Learning Researcher profile than a practitioner. Frankly many companies don’t need that level of specialization in a field that has been around for over half a century with many proven techniques and approaches already productized and available as RESTful API end points – just add data.
As far as BigML is concerned, we are Data Scientist agnostic. We recognize that a “practical” and well aligned Data Scientist can empower a broader team with relevant Machine Learning knowledge allowing them to more efficiently explore problems that matter to the business. Equally important is a scalable, programmable, and easy to use MLaaS (Machine Learning as a Service) tool that let’s capable Developers and in-house SMEs build, test and deploy predictive use cases that can learn and get better and better over time. The results speak for themselves as far as the business impact is concerned regardless if one can derive the mathematical formulae that gave rise to the Random Decision Forest algorithm. Bootstrapping is the call of the day, which is only fitting in a assume nothing, test everything, let the data be your guide Lean Startup world.
Admittedly, some of these steps are easier written here than actually delivered, so as expected, companies will need to do their homework and identify the differences between these MLaaS providers with care. Well, I already did my due diligence, and now I am sprinting with BigML.
We are happy to announce that BigML and The Polytechnic University of Valencia (UPV) have signed an agreement creating a new University-Business chair in order to support new Machine Learning research and to promote the use of Machine Learning technologies at UPV. The official signature ceremony of the agreement took place on the 20th of October at The Polytechnic University of Valencia Rector’s office with representatives from both parties:
As part of the agreement BigML will collaborate with UPV on organizing lectures, teaching activities, idea competitions, conferences, courses, seminars, pre-doctoral and post-doctoral grants, internships and more. These activities are expected to aid the students in maximizing the knowledge transfer from BigML while creating exciting career opportunities for the graduates where they can apply their newly acquired technical skills.
On the research front, the business chair plans to multiply Machine Learning research projects coming out of Valencia, Spain (and Europe at large through UPV connections). Promoting hi-tech entrepreneurship in advanced analytics and software disciplines in the region has been one of the primary motivations behind our recent decision to locate our European headquarters in Valencia so we are very glad that UPV has agreed to join forces in this worthy endeavor.
We are happy to announce a special full day event bringing together some of the best minds in machine learning and business management to discuss The Future Impact of Machine Learning in depth. The event will take place at Las Naves in Valencia on October 20, 2015. Registration is free and by invitation only. To apply and reserve your place for this unique event, be sure to fill out the event form ASAP as available spots will not last long.
Where will the Machine Learning revolution lead The World? Will Machine Learning create more jobs than it destroys? These remain some of the most controversial topics for technology pundits and world leaders alike to discuss either on Bloomberg TV or at venues like World Economic Forum’s annual meeting at Davos. They are fair questions in that we are observing a greater momentum in Artificial Intelligence based technologies and products launched into the new connected world economy – think Siri, Nest, Cortana, Google Now, the self-driving car and the military grade drones. In fact, Google’s complete reorganization into an Alphabet soup of project oriented entities may have been a bit confusing to some, but it was undertaken to prepare for this new AI future to a large extent. There is now serious plans even about creating ships that can cross the sea without humans on board. So long for the Captain Phillips sequel!
This is a very complex subject and we are collectively barely scratching the surface. As a result, we need many more informed debates about the economic impact of new technologies such as Machine Learning. We are hopeful that the Las Naves gathering will help move this crucial debate further with participation from:
World-renowned Machine Learning Professor Dr. Thomas G. Dietterich
Professor of Information Systems at IE Business School, Enrique Dans in Madrid, Spain.
Ramon López de Mántaras will be covering the historical perspective of AI. Dr. Dietterich will inform the audiences on recent advances and breakthroughs with real world stories as well as the long-term risks and opportunities he sees. Finally, Enrique Dans will be highlighting the expected business impact of AI to make it a multidimensional discussion on the topic. Once again, the registration for this event is free and by invitation only. To apply and reserve your spot, be sure to fill out this event form. If you would like to promote the event among your circle, we also encourage you to download the poster in English, Spanish or Valencian.
This is part of an ongoing statistics-related blog post series in anticipation of BigML’s upcoming statistical tests resource. The previous post was about fraud detection with Benford’s Law. In this post, we will explore the topic of correlation, and how it can help you in designing and applying machine learning models.
Consider the following…
Put in plain terms, correlation is a measure of how strongly one variable depends on another. Consider a hypothetical dataset containing information about professionals in the software industry. We might expect a strong relationship between age and salary, since senior project managers will tend to be paid better than young pup engineers. On the other hand, there is probably a very weak, if any, relationship between shoe size and salary. Correlations can be positive or negative. Our age and salary example is a case of positive correlation. Individuals with a higher age would also tend to have a higher salary. An example of negative correlation might be age compared to outstanding student loan debt: typically older people will have more of their student loans paid off.
Correlation can be an important tool for feature engineering in building machine learning models. Predictors which are uncorrelated with the objective variable are probably good candidates to trim from the model (shoe size is not a useful predictor for salary). In addition, if two predictors are strongly correlated to each other, then we only need to use one of them (in predicting salary, there is no need to use both age in years, and age in months). Taking these steps means that the resulting model will be simpler, and simpler models are easier to interpret.
There are many measures for correlation, but by far the most widely used one is Pearson’s Product-Moment coefficient, or Pearson’s r. Given a collection of paired (x,y) values, Pearson’s coefficient produces a value between -1 and +1 to quantify the strength of dependence between the variables x and y. A value of +1 means that all the (x,y) points lie exactly on a line with positive slope, and inversely, a value of -1 means that all of the points lie exactly on a line with negative slope. A Pearson’s coefficient of 0 means that there is no relationship between the two variables. To see this visually, we can look at plots of our hypothetical data, and the Pearson’s coefficient computed from them.
We observe that the sign of the correlation coefficient matches the slope of the line of best fit. Note, however that the magnitude of the coefficient is not related to the slope of the line, but only on how well the points fit the line. Also notice from our Age vs. Debt plot that the relationship between the two variables is more exponential rather than linear. Despite seeing a clear relationship between the two variables, a good straight line fit can not be made and so the resulting correlation coefficient is smaller than one might expect. To elaborate on how non-linear data can confound Pearson’s r, we turn our attention to Anscombe’s Quartet. This is a set of four cleverly constructed datasets which have the same Pearson’s correlation coefficient, as well as other summary statistics, yet are significantly dissimilar when seen on a graph.
In cases such as this, Spearman’s rank-correlation coefficient, or Spearman’s rho, may be a good alternative measure. Spearman’s rho quantifies how monotonic the relationship between the two variables is, i.e. “Does an increase in x usually result in an increase in y?” (technically it is equivalent to computing Pearson’s r for a rank-transformed version of the data). We can see that while the four Anscombe datasets have equivalent Pearson’s r, Spearman’s rho does a good job discriminating them.
Correlation coefficients are a useful tool for exploring relationships within your data. Having been introduced to the topic of correlations, we invite you explore it further with BigML’s new correlations resource, and start exploring your data!
“If you want to hear my honest opinion, what you’re about to do looks like a desperate move to me.” This was the reaction of a good friend and a mentor of mine, when he heard that I am about to accept BigML’s offer to join them as a V.P. of Business Development. Mind you, he has spent most of his career working for large international companies at the C-level. To be fair to him, his first conclusion was based on a five minute phone introduction on BigML and my future role, while I was on the TGV train to Paris on my way to PAPIs.io and he was trying to catch a plane in London himself. He continued: “Mobile payments are finally about to hit the mass market and there’s a huge buzz about all the FinTech topics you’ve been working on for years and instead of cashing out you want to join a small startup in what was the name again? Predictive analytics?”
I should have gone differently about explaining my motivation in exchanging a comfortable life and a well-paid FinTech job. While I am proud to have headed a great team at one of the most innovative Swiss banks that afforded me the opportunity to frequently exchange experiences with other FinTech professionals as a speaker and as a board director of Mobey Forum, I’m leaving it behind for three things:
- The opportunity to actively participate in this unique moment in time, when predictive analytics technologies are starting to changing the world as we know it
- A company that is perfectly positioned to ride this big wave by offering Machine Learning as a Service (MLaaS) to enable not just one but a multitude new business use cases
- A team that is very passionate about making machine learning simple, beautiful and easily applicable into predictive apps and services for all interested parties out there including developers, students, analysts and business experts.
Let me further explain.
Predictive Analytics in Business Context
Coincidentally, my predictive analytics story began around when BigML was founded in 2011. At that time, I had initiated the card linked offers program for my bank in Switzerland. It was in essence pretty similar to what some startups in the U.S. had been offering, but with a twist. Instead of using only simple descriptive data analytics focusing on a selection of customers that have shown a certain behavior in the last X months, we focused our energy to develop a service based on predictive models that could tell us what the customer is likely to do next. This predictive capability was based on a huge amount of data our customers had given us permission to mine. This quickly became our unique selling proposition in the market and turned me and my colleagues into believers.
So what is so exciting about making predictions about unknown events? It boils down to the ability to learn from large datasets rather than following preset business rules, which is a huge improvement in the business world. The majority of the existing business rules within our companies are a part of the heritage of the pre-digital world that have been transferred to the digital age without questioning their present purpose. Meanwhile, we have access to technologies in the area of modeling, machine learning and data mining that enable us to build automated, self-improving processes to better adapt our companies to the present environment. In this context, being able to predict unknown outcomes will make the difference between acting and reacting, between smart and (to state it politely) not-so-smart and between winners and losers of tomorrow’s competitive landscape. Those who don’t count cards will ultimately leave the poker table penniless.
And how about those who specialize on counting cards? Kevin Kelly from Wired magazine wrote: “In fact, the business plans of the next 10,000 startups are easy to forecast: Take X and add AI.” If we take the FinTech space as an example of how many opportunities lie in front of new players who will benefit from the incapability of banks to unlock the full potential of digitization just imagine what kind of potential lies across different industries when it comes to artificial intelligence and predictive apps. Some claim smart apps will have an even bigger impact on our businesses and private lives then the introduction of mobile computing had. As in the case of my good friend from the beginning of this post, being a senior manager and running a multi-million euro business operation doesn’t automatically put you into a position to understand the potential of this technology and the forces at work, which are bound to redefine how we run our day-to-day businesses.
Granted, the idea of machine learning in the cloud is no longer brand new. Last week Computer World UK published the list of “7 cloud tools to harness artificial intelligence for your business”. And the participant list is impressive: Google, Microsoft, Amazon, Alibaba (the announcement was made last week – “huānyíng” guys), IBM and two other startups. One of them is BigML. But back in 2011, only two of those seven companies were at the start line: Google and BigML.
BigML’s early start to introduce large scale machine learning to the masses has turned into a healthy mix of subscription customers as well as corporate engagements including hybrid installations and professional services in a variety of industry segments and predictive use cases. Increasing demand in BigML’s expertise in this field has also resulted in high profile strategic joint ventures such as the partnership with Telefónica Open Future_ to build an automated platform for assess early stage technology startup investment opportunities. There has never been a better time to be in machine learning and it will likely remain a central area of business innovation for the next 5 years as per Gartner’s newly published hype cycle curve. We have got work to do!
What struck me right away when I met the team in Barcelona during PAPIs.io last year was the fact that many of the people working for the company have followed our energetic CEO for the second or a third time into a new venture. This is indispensable proof of trust and leadership qualities. Like in the lyrics of that Bob Marley’s song: “You can fool some people sometimes, but you can’t fool all the people all the time.” Take my first conversation with JAO, BigML’s CTO. He wasted no time coming straight to the point: “Don’t consider starting here by lining up bunch of integration projects as a success. I want our developers to continue further developing and improving our predictive platform rather than running a bunch of short-term snowflake projects leaving it to stagnate.” Bang! This was music to my ears. A company ran by the product guys that aim to first and foremost make a great product. So no more writing about it for me, let’s start!
Not a day goes by without an important figure from the technology world acknowledging the immense potential of large-scale machine learning applications in virtually every industry. The award-winning researcher and scientist Professor Pedro Domingos of University of Washington recently mentioned “Algorithms increasingly run our lives. They find books, movies, jobs, and dates for us, manage our investments, and discover new drugs.” He then went on to add “The race is on to invent the ultimate learning algorithm: one capable of discovering any knowledge from data, and doing anything we want, before we even ask.”
Machine learning’s impending impact on jobs that so far were considered exclusive to highly skilled humans will likely remain a controversial yet unavoidable topic for the foreseeable future. Take for example Early Stage Startup Investing. Silicon Valley venture capitalists have had a big hand in funding startups that are strategically leveraging machine learning to radically disrupt old world industries like hospitality and transportation services. This has helped create the much celebrated “Unicorns” that are defining the pace of innovation in our evolving digital consumer economy, where it makes more business sense to optimize the utilization of existing assets than to own and operate them outright.
As brilliant as this strategy has been, the utilization of machine learning to drive decision making in early stage investing itself has remained mysteriously absent. A controversial recent article by Arlo Gilbert very succinctly articulates on this “dirty secret” of Silicon Valley, which stubbornly remains very human and relationship driven. As Gilbert puts it “A computer doesn’t worry about its next fund.”
Given the backdrop of a typical exponential technology, it is not too far-fetched to think that this kind of freedom from existing human biases combined with increasingly powerful predictive models crunching increasingly sophisticated new sources of structured and unstructured data may force the VCs hands to have a stance on whether they should embrace automatic start-up investing sooner than they realize. This also happens to be exactly what Kirk Kardashian of Fortune has investigated in his recent article titled “Could algorithms help create a better venture capitalist?”
At the heels of these developments, Telefónica’s global entrepreneurship and innovation network Telefónica Open Future_ (TOF_) and BigML are proudly partnering to build the World’s First Automated Early Stage Investing Platform to help usher a new way of technology investing. To make this a reality, the alliance will take advantage of PAPIs conference events series as they offer the best platform to introduce new predictive applications to the world:
- In all future PAPIs CONNECT events as well as the annual PAPIs conference, there will be a new startup battle section devoted to best startups actively working on state-of-the-art predictive apps.
- Both the invited startups as well as the eventual winner will be automatically selected by TOF_and BigML’s new algorithm. The winning startup will be offered an investment by one the TOF_ funds thereby making PAPIs the world’s first event where an algorithm automatically selects the contestants and the winner.
Machine learning innovators gathering in Sydney for this year’s International Conference on Predictive APIs and Apps are uniquely positioned to ride this new wave of change. The event is taking place in Sydney on 6 & 7 August 2015 and it is bringing together leaders from Amazon Machine Learning, BigML, Google Prediction API and Microsoft Azure ML on the same stage. As such, PAPIs events participants and followers who have a machine learning driven startup idea or a relevant ongoing project can now gain additional exposure and perhaps also secure a seed round investment from an established technology player with the foresight to go against the grain in true maverick fashion. As the original sponsor of PAPIs, BigML is looking forward to meet with our future contestants.
So far we are very pleased with the welcoming messages and the genuine interest you have expressed towards BigML’s new European headquarters in Valencia. We would like to capitalize on this and return the favor by announcing that we are organizing the 1st Valencian Summer School in Machine Learning on September 15 and 16 in collaboration with the Universitat Politècnica de València, the Universitat de València, and Las Naves. Besides adding to the stock of quality machine learning specialists in Europe, we think that this event will make for a great opportunity for us to learning your industry context, goals and strategic outlook so we can better shape BigML’s scalable, consumable, and highly programmable cloud-based machine learning platform to your fast evolving needs.
As Kevin Kelly of Wired magazine recently wrote, “The business plans of the next 10,000 startups are easy to forecast: take X and add AI.” We agree with his take wholeheartedly. These startups will be the future Microsofts and Googles and they will rely on Machine Intelligence at an unprecedented rate in doing so. We believe that the rate and magnitude of the innovations on this front will make yesteryears’ information retrieval and decision support systems look pedestrian in comparison.
In that vein, our FREE invitation only two-day summer school is perfect for advanced undergraduates as well as graduate students and industry practitioners seeking a quick, practical, and hands-on introduction to Machine Learning. It will also serve as a good introduction to the kind of work that students can expect if they enroll in advanced masters both at the Universitat Politècnica de València and the Universitat de València. The primary goal is not only to introduce basic machine learning concepts, techniques, and tools, but also to share real-world experiences and let the students practice with real datasets and help them build their first predictive applications. As such, lectures are going to be very intensive; and will take place at Las Naves from 7:30am to 9:30pm on both September 15th and September 16th.
If this sounds right up your alley, please register today and be considered for one of 42 available spots at this challenging yet fun event. Applications will be evaluated based on a combination of interests, skills, and motivation. Make sure that you also follow our blog and Twitter account for further announcements on the event specifics such as participating lecturers. We wish upon you a perfect Valencian summer, where you get to sharpen your machine learning skills!
The first version of BigML’s add-on for Google Sheets has been released! The BigML add-on provides an easy way to fill the blanks in your spreadsheets using the predictions of models and clusters in BigML. As we explained in a previous post, now you can fill in the columns in your spreadsheeet by using the existing BigML decision tree models to generate predictions based on the sheet data. Thus, using the add-on you can, for instance, score your sales prospects based on the historic sales records in your Google Sheets. Similarly, you can group your customer data into segments according to the clusters they belong to.
Get the add-on running
The add-on is available at the Chrome Web Store or directly from your Google Sheet by using the Get add-ons item in the Add-ons menu. The BigML add-on appears under the Bussiness Tools category. Just click the +Free button to install it, and a new BigML submenu item will appear under the Add-ons menu.
By accepting them,you allow the add-on to acces the data in your Google Sheet and your models and clusters at BigML. The add-on will read the Sheet data and download the BigML models or clusters to Google servers, where all the predictions and cluster labelings will be done. Then the add-on will appear as a sidebar in your Google Sheet. In order to authenticate and use the models and clusters in BigML, you will need to provide your credentials:
that will be stored and used from then on (you can check anytime your credentials at BigML). Then you will be ready to start using the add-on.
Using the add-on step by step
Just click on BigML > Start under the Add-ons menu to see the sidebar that will show your models or clusters in BigML.
In BigML you can build your resources in a development environment (no cost involved) or in production. You can also organize your resources in projects. The search form in the add-on works in both environments, allowing to filter your resources by name or project.
To fill the blanks in any column of your Google Sheet you will need a model that can predict the field in this column from the contents of the rest of columns in the same row. Select from the list the one that best fits your data and click on it. A detailed description of the model will appear in the sidebar, and the model will be ready to use in your Google Sheet.
Finally, select the range of rows that you would like to complete and click the Predict button to see the predictions appear. Note that the columns in your selection are expected to match the fields in the model (listed in the model description on your sidebar).
The blank cells will be filled with the predicted values (that in this case must belong to a list of categories) and a confidence rate that ranges from 0 (no confidence) to 1 (total confidence) will be placed in the last available column. If the column to fill has numerical values in it, the associated model will have a numerical objective field and the predicted values will be also numerical. Then a new column will be added to show the associated error for the prediction.
Using clusters is quite the same. In this case, select the range of rows that contains the instances of data you want to segment and two new columns will be added to your Sheet. The first one will contain the label of the segment that the row belongs to (or centroid name). The second one shows the distance of the data in that row to the centroid, or central point for each segment. You can check this video to see how the add-on works in basic use cases.
First time BigML-GAS users
To use BigML for the first time, you’ll need to Sign up on our web site. As a result, you will land in a development environment with some data sources available to make your first steps in the platform. Still, to start working with your add-on you will need to either:
- create a model from one of the available sources or your own historical data
- clone an existing model from our model gallery.
In order to create your first models you can upload any local or remote CSV file. When uploading from a public Google Sheet, use its export to CSV feature. The corresponding URL can be pasted in the remote URL form and the data in your Google Sheet will be uploaded to BigML.
The data will be transformed into a source, where each column will be described as a field and its type will be inferred from the uploaded contents. Then it only takes one click to generate a dataset, where all statistical information per field is stored. Select the field that you want to fill with predictions (or objective field) and another click will give you a model for your data. This model will be immediately available in your Google Sheets through the BigML add-on.
You can also search the Gallery of models for a model that fits your data. Remember that BigML will use the first row in your selection or range of data as a headers row, and the names of the columns there should match the ones in the model you use to fill the blanks. Once you find a model that suits your needs in the Gallery, you can get it from there clicking the label at the top-right corner:
and it will be cloned in your account, ready to use from your Google Sheets through the add-on.
This is all you need to enrich the information in your Google Sheets using BigML’s add-on. Let BigML bring Machine Learning to your Google Sheets!
So far we are very pleased with the welcoming messages and the genuine interest you have expressed towards BigML’s new European headquarters in Valencia. We would like to capitalize on this and return the favor by announcing that we will be sponsoring three Machine Learning events in the next two weeks. As we revealed yesterday, Professor Geoff Webb of Monash University who has invented the groundbreaking Association Discovery technology ‘Magnum Opus‘ is joining BigML as Technical Advisor. This means he will be traveling to Spain to give two lectures on Scaling Log-linear Analysis to Datasets with Thousands of Variables. The first one will take place at Universitat Politècnica de València on July 8, 2015 at 4PM. The second lecture on the same topic will be held at the Artificial Intelligence Research Institute (IIIA-CSIC) in Bellaterra, Barcelona on July 14, 2015 at 12PM.
If you are dealing with very wide datasets, which a lot of predictive analytics and data mining use cases tend to qualify for these days thanks to a growing set (and amount) of public and private data sources as well as advances on feature engineering, then this lecture will likely be tremendously impactful in helping you to reconsider your previous assumptions and to grow more comfortable in scaling your solutions without having to throw away useful insights simply because of computational constraints. If interested, please simply stop by or drop us a note if you want to know more details. Please be sure to also follow us on our blog and on Twitter for further updates.
Lastly, we’d like to remind you of our upcoming inaugural Machine Learning Valencia Meetup at Las Naves on July 9, 2015, where we will showcase BigML and get to meet and exchange information with the engineering and the machine learning community in the city. We are looking forward to shake hands and make new connections along the way.
Fresh off the news on the opening of our new European headquarters, we are excited to make public that BigML has completed the acquisition of the groundbreaking Association Discovery software Magnum Opus. First released fifteen years ago, and progressively refined since, Magnum Opus has delivered reliable and actionable insights for retailers, financial institutions and numerous scientific applications and embodies the state-of-the-art in the field of association discovery. Consequently, this acquisition is a significant step forward in BigML’s vision to build the world’s premier cloud-based Machine Learning platform including carefully curated, most effective algorithms and data mining techniques that have already proven their mettle on complex real-world predictive analytics problems.
As part of the acquisition, world-renowned expert on Association Discovery and this year’s ACM SIGKDD Sydney Conference program co-chair Geoff Webb has joined BigML as Technical Advisor. Dr. Webb is a Professor of Information Technology Research in the Faculty of Information Technology at Monash University of Melbourne, where he heads the Centre for Data Science. He was editor in chief of the premier data mining journal, Data Mining and Knowledge Discovery, for ten years. He is co-editor of the Springer Encyclopedia of Machine Learning, a member of the advisory board of the Statistical Analysis and Data Mining journal, a member of the editorial board of the Machine Learning journal, and was a foundation member of the editorial board of ACM Transactions on Knowledge Discovery from Data. Dr. Webb is an IEEE Fellow and has received the 2013 IEEE ICDM Service Award and a 2014 Australian Research Council Discovery Outstanding Researcher Award.
Association discovery is one of the most studied tasks in the field of data mining. Stated simply, association mining identifies items that are associated with one another in data. Historically, far more attention has been paid to how to discover associations than to what associations should be discovered. Having observed the shortcomings of the dominant frequent pattern paradigm, Dr. Webb developed the alternative top-k associations approach. Magnum Opus employs the unique k-most-interesting association discovery technique as it allows the user to specify what makes an association interesting and how many associations s/he would like. The available criteria for measuring interest include lift, leverage, strength (also known as confidence), support and coverage. This approach effectively reveals the statistically sound, new and unanticipated core associations in the data whereas most other association discovery tools find so many spurious associations that it is next to impossible to find useful associations amongst the dross. Association mining complements other statistical data mining techniques in a number of ways as it:
- Avoids the problems due to model selection. Most data mining techniques produce a single global model of the data. A problem with such a strategy is that there will often be many such models, all of which describe the available data equally well. Association mining can find all local models rather than a single global model. This empowers the user to select between alternative models on grounds that may be difficult to quantify for a typical statistical system to take into account.
- Scales very effectively to high-dimensional data. The standard statistical approach to categorical association analysis (i.e. log-linear analysis) has complexity that is exponential with respect to the number of variables. In contrast, association mining techniques can typically handle many thousands of variables.
- Concentrates on discovering relationships between values rather than variables. This is a non-trivial distinction. If someone is told that there is an association between gender and some medical condition, they are likely to immediately wish to know which gender is positively associated with the condition and which is not. Association mining goes directly to this question of interest. Further, association between values, rather than variables, can be more powerful (discover weaker relationships) when variables have more than two values.
- Strictly controls the risk of making false discoveries. A serious issue inherent in any attempt to identify associations with classical methods is an extreme risk of false discoveries. These are apparent associations that are in fact only artifacts of the specific sample of data that has been collected. Magnum Opus is the only commercial association discovery software to provide strict statistical control over the risk of making any such errors.
The BigML product team has already started charting the path to a seamless integration of Magnum Opus capabilities into our platform in 2015. This means effective immediately, we will NOT be offering new Magnum Opus licenses or downloads. Existing Magnum Opus licensees will be supported as usual. Additional blog posts, a lecture series by Dr. Webb and more information on the integration timeline will be provided in the coming weeks so please stay tuned.