In the first part of this series, we saw how to insert machine-learned model predictions into a webpage on-the-fly using a content script browser extension which grabs the relevant pieces of the webpage and feeds them into a BigML actionable model. At the conclusion of that post, I noted that actionable models were very straightforward to use, but their static nature would lead to lots of effort in maintaining the browser extension if the underlying model were constantly updated with new data. In this post, we’ll see how to use BigML’s powerful API to both keep your models up to date, and write a browser extension that stays current with your latest models.
Staying up to date
In keeping with our previous post, we will be working on a model to predict the repayment status of loans on the microfinance site Kiva. In the last post, we used a model that was trained nearly one year ago on a snapshot of the Kiva database. Hundreds of new loans become available every day, which means this model is starting to look a bit long in the tooth. We’ll start off by learning a fresh model from the latest Kiva database snapshot, and in doing so, we’ll get to use some of the new features we’ve added to BigML over the last year, such as objective weighting and text analysis. However, the Kiva database snapshot is several gigabytes of JSON-encoded data, much of which won’t be relevant to our model, so we really don’t want to relearn from a new snapshot several months down the line when our hot new model becomes stale. Fortunately, we can avoid this situation using multi-datasets. Periodically, say once a month, we’ll grab the newest loan data from Kiva and build a BigML dataset. With multi-datasets, we can take our monthly dataset, our original snapshot dataset, and all the previous monthly datasets, concatenate them together and learn a model from the whole thing. With the BigML API, this all happens behind the scenes, and all we need to do is supply a list of the individual datasets. We can do all of this with a little Python script. Here is the main loop:
Every dataset created by this script will have “kiva-data” as its name, so we can use the BigML API to list the available datasets and filter by that name. If none show up, then we know we need to create a base dataset from a Kiva snapshot; otherwise, we’ll use the Kiva API to create an incremental update dataset. In either case, we then proceed to create a new model using multi-datasets. All we need to do is pass a list of our desired dataset resource IDs. We assign the name “kiva-model” to the model so that it can be easily found by our browser extension. We employ a few other minor tricks in our script, such as avoiding throttling in the Kiva API. You can check out the whole thing at this handy git repo.
API-powered browser extension
Our first browser extension was a content script that fired whenever we navigated to a Kiva loan page. It would grab the loan information, feed it to the model, and then use jQuery to insert a status indicator into the webpage’s DOM tree. Our new extension won’t be very different with regards to loan data scraping and DOM manipulation; the main difference will be how the model predictions are generated. Whereas our first extension featured an actionable model, i.e. a series of nested if-else statements, our new extension will perform predictions with a live BigML model, using the REST-ful API. To interface with the model, our extension now needs to know about our BigML credentials. In order to store the BigML credentials, we create a simple configuration page for the extension, consisting of two input boxes and a Save button.
Whenever we save a new set of credentials, we want to lookup the most recent Kiva loans model found in that BigML account. However, this isn’t the only context in which we want to fetch models. For instance, it would be a good idea to get the latest model every time the web browser is started up. We’ll implement our model fetching procedure inside an event page, so it can be accessed from any part of the extension.
At the bottom of event page, we see that it listens to the browser startup event, and any messages with the greeting “fetchmodel” to fire the model fetching procedure. We also see that it listens for the
onInstalled to open up the configuration page for the first time. With this extra infrastructure in place, we are ready to make our modifications to the content script. Here is the script in its entirety:
Compared to the previous version of the extension, the bulk of the differences lie in the
predictStatus function. Rather than evaluating a hard-coded nested if-then-else structure, the loan data is POSTed to the BigML prediction URL with AJAX, and most of the DOM manipulation has been refactored to occur in the AJAX callback function. One perk we get from using a live model is that we get a confidence value along with our prediction of the loan status. We’ve used that to add a nice little meter alongside our status indicator icon.
Better browsing through big data
You can grab the source code for this Chrome browser extension here, where it is also available as a Greasemonkey user script for Firefox. We hope that this post, and its prequel, have been able to show the relative ease with which BigML models can be incorporated into content scripts and browser extensions. Beyond this kiva.org example, there are a myriad of potential applications on the internet where machine learned models can be used to provide a richer and more informed web browsing experience. It’s now up to you to use what you’ve learned here to go out and realize those applications!
I recently came across a 1995 Newsweek article titled “Why the Web Won’t be Nirvana” in which author and astronomer Cliff Stoll posited the following:
“How about electronic publishing? Try reading a book on disc. At best, it’s an unpleasant chore: the myopic glow of a clunky computer replaces the friendly pages of a book. And you can’t tote that laptop to the beach. Yet Nicholas Negroponte, director of the MIT Media Lab, predicts that we’ll soon buy books and newspapers straight over the Internet. Uh, sure.”
Well, it turns out that Mr. Stoll was slightly off the mark (not to mention his bearish predictions on e-commerce and virtual communities). Electronic books have been a revelation, with the Kindle format being far and away the most popular. In fact, over 30% of books are now purchased and read electronically. And Kindle readers are diligent with providing reviews and ratings for the books that they consume (and of course the helpful prod at the end of each book doesn’t hurt).
So that got us thinking: are there hidden factors in a Kindle book’s data that are impacting its rating? Luckily, import.io makes it easy to grab data for analysis, and we did exactly that: pulling down over 58,000 kindle reviews which we could quickly import into BigML for more detailed analysis.
My premise going into the analysis was that author and words in the book’s description, along with length of book, would have the greatest impact on the number of stars that a book receives in its rating. Let’s see what I found out after putting this premise (and the data) to the machine learning test via BigML…
We uploaded over 58,000 Kindle reviews, capturing URL, title, author, price, save (whether or not the book was saved), pages, text description, size, publisher, language, text-to-speech enabled (y/n), x-ray enabled (y/n), lending enabled (y/n), number of reviews and stars (the rating).
This data source includes both text, numeric and categorical fields. To optimize the text processing for authors, I selected “Full Terms only” as I don’t think that first names would have any bearing on the results.
I then created a dataset with all of the fields, and from this view, I can see that several of the key fields have some missing values:
Since I am most interested in seeing how the impact of the book descriptions impact the model, I decide to filter my dataset so that only the instances that contain descriptions will be included. BigML makes it easy to do this by simply selecting “filter dataset”:
and from there I can choose which fields to filter, and how I’d like them to be filtered. In this case I selected “if value isn’t missing” so that the filtered dataset will only include instances where those fields have complete values:
And just like that, I now have a new dataset with roughly 50,000 rows of data called “Amazon Kindle – all descriptions” (you can clone this dataset here). I then take a quick look at the tag cloud for description, which is always interesting:
In the above image we see generic book-oriented terms like “book,” “author,” “story” and the like coming up most frequently – but we also see terms like “American,” “secret,” and “relationship” which may end up influencing ratings.
Building my model:
My typical approach is to build a model and see if there are any interesting patterns or findings. If there are, I’ll then go back and do a training/test split on my dataset so I can evaluate the strength of said model. For my model, I tried various iterations of the data (this is where BigML’s subscriptions are really handy!). I’ll spare you the gory details of my iterations, but for the final model I used the following fields: price, pages, description, lending, and number of reviews. You can clone the model into your own dashboard here.
What we immediately see in looking at the tree is a big split at the top, based on description, with the key word being “god”. By hovering over the nodes immediately following the root node, we see that any book that contains the description “god” has a likely review of 4.46 stars:
while those without “god” in the description have a rating of 4.27:
Going back to the whole tree, I selected the “Frequent Interesting Patterns” option in order to quickly see which patterns in the model are most relevant, and in the picture below we see six branches with confident, frequent predictions:
The highest predicted value is on the far right (zoomed below), where we predict 4.57 stars for a book that contains “god” in the description (but not “novel” or “mystery”), costs more than $3.45 and has lending enabled.
Conversely, the prediction with the lowest rating does not contain “god,” “practical,” “novel” or several other terms, is over 377 pages, cannot be loaned, and costs between $8.42 and $11.53:
Looking through the rest of the tree, you can find other interesting splits on terms like “inspired” and “practical” as well as the number of pages that a book contains.
Okay, so let’s evaluate
Evaluating your model is an important step as data most certainly can lie (or mislead), so it is critical to test your model to see how strong it truly is. BigML makes this easy: with a single step I can create a training/test split (80/20), which will enable me to build a model with the same parameters from the training set, and then evaluate that against the 20% hold-out set. (You can read more about BigML’s approach to evaluations here).
The results are as follows:
You can see that we have some lift over mean-based or random-based decisioning, albeit somewhat moderate. Just for kicks, I decided to see how a 100-model Ensemble would perform, and as you’ll see below we have improvement across the board:
This was largely a fun exercise, but it demonstrates that machine learning and decision trees can be informative beyond the predictive models that they create. By simply mousing through the decision tree I was able to uncover a variety of insights on what datapoints lend themselves to positive or less positive Kindle book reviews. From a practical standpoint, a publisher could build a similar model and factor it into its decision-making before green-lighting a book.
Of course my other takeaway is that if I want to write a highly-rated Kindle title on Amazon, it’s going to be have something to do with God and inspiration.
This is a guest post by Andy Thurai (@i). Andy has held multiple technology and business development leadership roles with large enterprise companies including Intel, IBM, BMC, CSC, Netegrity and Nortel for the past 20+ years. He’s a strategic advisor to BigML.
Now that I got your attention about Germany’s unfair advantage in the World Cup, I want to talk about how they used analytics to their advantage to win the World Cup—in a legal way.
I know the first thing that comes to everyone’s mind talking about unfair advantage is either performance-enhancing drugs (baseball & cycling) or SpyCam (football, NFL kind). Being a Patriots fan, it hurts to even write about SpyCam, but there are ways a similar edge can be gained without recording the opposing coaches’ signals or play calling.
It looks like Germany did a similar thing, legally, and had a virtual 12th man on the field all the time. For those who don’t follow football (the soccer kind) closely, it is played with 11 players on the field.
So much has been spoken about Big Data, Analytics and Machine Learning from the technology standpoint. But the World Cup provided us all with an outstanding use case on the application of those technologies.
SAP (a German company) collaborated with the German Football Association to create Match Insights analytics software, dubbed as the ultimate solution for football/soccer. This co-innovation project allows teams to not only analyze their own players, but also learn about their opponents as well to create a game plan. The goal of the co-innovation project between SAP and the coaches of the German national football team was to build an innovative solution that enhanced the on-field performance leading up to and winning the World Cup.
Considering that in 10 minutes, 10 players who touch the ball can produce as high as 7 million data points, imagine how many data points will be created in one game. When you analyze multiple games played by a specific opponent, your tape study and napkin cheat sheets won’t be that effective anymore. By being able to effectively harness insights from this massive collection of data points, the German team had a leg up heading into the World Cup.
The proof was in the pudding when Germany used this program to thump Brazil 7-1 in the World Cup and then went on to triumph in the finals against Argentina. In their 7-1 win against Brazil, during a 3 minute stretch in the first half, 3 goals were scored while Brazil owned the ball for 52% of the time during that period. If you watched the game, Germany’s teamwork was apparent; but what was not apparent is that the software provided information on the possession time, when to pass, whom to hold the ball against vs. whom to pass against, etc. The Defensive Shadow analysis portion of the software shows teams exactly how to beat the opponent’s defensive setup based on a specific opponent alignment and movement.
In a recent article, Sophie Curtis of Telegraph explains a tactic used by the Germans, which was to reduce their average possession time from 3.4 seconds in the 2010 World Cup to 1.1 seconds in 2014. This not only confused the defenders, but also made the defenders uncertain whom to defend and quickly tired from randomly chasing their opponents.
If you watched the match closely, the passes and ball advancement all seemed to be executed with clinical precision. I thought they were just well coached or figured it out during the game, but apparently they had outside help which decided the win – before they ever set foot on the field. While the same advantage can be gained by watching tapes (or videos), the software’s uncanny prediction ability goes far beyond traditional mechanisms by converting all game data and player skills to actionable directives that can be implemented on the field. I don’t think you can make a more powerful statement than thumping a popular, soccer crazy host nation which fielded a decent football team. I doubt even the legendary Pele would have done any better against Big Data analytics.
The Match Insights tool is exclusive to the German team right now, but according to SAP they have plans to sell it more broadly in the future. In the meanwhile, it is wise to choose Germany if you are betting with your friends.
Big Data is generally a massive collection of data points. Most companies that I deal with think just by collecting data, securing and storing it for later analysis, they are doing “Big Data.” However, they are missing the key element of creating real-time actionable intelligence so they can make decisions on the fly about their processes. Most companies either don’t realize this, or are not set up to do this. This is where companies like BigML can add a lot of value. BigML is a machine learning company (best of breed in my mind) which helps you do exactly that.
For example, in a recent blog post, Ravi Kankanala of Xtream IT Labs talks about how they were able to predict opening weekend box office revenues for a new movie using BigML’s predictive modeling. The point that caught my attention was that they made 200,000 predictions with 90%+ accuracy. This means their clients can look at these results and decide the best time to release their movie (day of the week, day of the month and the month) rather than treating it as a guessing game. This insight also helps studios segment and concentrate their marketing campaigns to improve box office results.
In a blog posted 3 days before the Super Bowl (on Jan 31, 2014), Andrew Shikiar used BigML to predict a Seattle win with an uncanny 76% accuracy. In the same article he also predicted Seattle covering the spread with 72% accuracy. But, if you were watching the pundits on ESPN, they were split 50-50% even an hour before the game with everyone of them yapping about the best offense ever going against a solid Seattle defense so the results are unpredictable.
While Moneyball, and Billy Beane, may have introduced the public to this concept of manipulating a sport based on statistical analysis, that approach was based on how overall statistical numbers can be applied to a team’s roster composition. But these precision statistics by BigML can help teams adjust every pitch (as in baseball), every throw (as in football), and every pass (as in soccer).
And the goodness doesn’t stop there. The BigML team is busy developing technology that underpins a growing number of predictive initiatives, including:
- Predicting the symptoms of customers thinking of discontinuing subscription services and the best way to target them with exclusive campaigns, promotions, or incentives to retain them before their customers think of switching. – Churn rate analysis.
- Predicting fraudster behavior amongst apparently normal customers.
- Predicting future life events based on changing shopping patterns or suggest different shopping patterns based on life event/style changes.
The key takeaway here is that you can do more than just collect data and store it: with the right strategy and software you can get meaningful insights. If you gain an edge over your competitors, you can win your business World Cup too.
It sure pays to have an edge!
This is a guest post by Ravi Teja Kankanala, Senior Technical Lead at Xtream IT Labs (XIT Labs).
At XIT Labs, the innovation division of Xtream IT People Inc, we use machine learning and big data to more accurately predict whether any movie is likely to be successful at the box office upon its release. We have developed a tool which we call Trailer Momentum Score that makes use of BigML.
From Raw Data to Opening Weekend Prediction
We sought to construct a model which would provide a prediction for the opening weekend box office takings of any given movie by using a raw dataset. Before using BigML, we tried other approaches by applying algorithms to the raw data which certainly made improvements, but found that our prediction tended to be off too often. After being introduced to BigML, we found that not only the prediction process was greatly simplified but the prediction error was greatly reduced. While using BigML, we’ve been able to request fixes and improvements in the service to the support team. Francisco Martin, the CEO of BigML, made sure the issues we raised were immediately fixed so we were able to successfully move into production.
Data and Results
The data-gathering process began in September 2013, pulling from across various social media platforms and data feeds for the movie industry. Unfortunately, we cannot elaborate on the types of structured and unstructured data we collected and processed because of confidentiality agreements with our clients. The salient point is that we have used BigML machine-learned models to make more than 200,000 predictions so far, which generally had 90%+ accuracy. These results helped the client to decide optimal timing and segmentation for marketing campaigns in order to improve their box office results.
How Has BigML Made a Difference?
BigML is very simple and intuitive. From the very first moment that we started using it, we were able to understand how we could leverage the tool. BigML has a very simple user interface which makes it easy to construct models which made our data scientists’ lives much simpler. We are also currently using BigML to predict sentiment analysis for various retailers and food chains across the world. The end results using BigML speak for themselves.
In the few months since we announced a collaboration and technical partnership with Tableau, we’ve heard from many users about how they’re benefiting from the ability to incorporate and visualize predictive models built using BigML within their Tableau environments. We’ve also built some videos that show you exactly how easy it is to use this killer functionality.
On July 30 Tableau is hosting a webinar where an analyst from interactive firm AKQA will discuss how he’s used both BigML and Tableau to gain predictive insight for clients’ media and marketing campaigns.
Learn more and register here today!
The next generation of business analysts will be required to not just understand advanced analytics, but to utilize them as part of their job function. While this traditionally has been impossible due to the complexity and cost of advanced statistical packages, BigML now makes it very viable. And we’re very excited by the fact that many universities are bringing BigML into their classrooms as part of their curricula — thereby helping educate tomorrow’s business leaders on the power and benefit of predictive modeling.
Two such universities that are using BigML are the University of Georgia and the University of Maryland University College (UMUC).
Creating Analytics Bulldogs at University of Georgia
Dr. Hugh Watson is a Professor of MIS at University of Georgia’s Terry College of Business, and has a deep professional and academic history of working with statistics and analytics. Dr. Watson has taught various classes on Information Systems for over thirty years so has excellent perspective on how IT tools have evolved–as have the expectations of employers for matriculating students.
Starting with this year’s Spring Semester, Dr. Watson brought BigML into his undergraduate class on Business Intelligence. The typical incoming student has familiarity with information systems and data models, but likely hasn’t utilized machine learning and/or built advanced predictive models. Over the course of the semester, Dr. Watson used BigML along with related BI tools such as MicroStrategy and Tableau. For the BigML portion of the class, the UGA students walked through several use cases based on models and datasets published in BigML’s gallery, with the aim of learning how to build models and gleaning predictive insights from data. By the end of the class, Watson’s students were able to complete a project that required the students to build their own models and run predictions in BigML.
Draft quote from Hugh Watson: “A deep understanding of data systems and analysis is becoming an imperative in today’s job market, which is why MIS is a key focus area for our program at the Terry College of Business. BigML helped my students understand how historical data can be utilized to predict future outcomes–all in a fun and easy-to-understand workflow.”
BigML helping power Master’s Degree in Data Analytics at UMUC
Dr. Steve Knode is a professor at UMUC’s Graduate School who specializes in instruction in emerging technologies, decision support systems and artificial intelligence. UMUC offers a Master’s Degree in Data Analytics, which has seen a huge spike in interest over the past few years as more and more professionals in the workforce are seeing the benefit of bolstering their analytics skill sets so as to enhance their productivity and marketability.
Dr. Knode is using BigML as part of UMUC’s Decision Management Systems class with a focus on combining decision making with the use of analytics tools. BigML’s ease of use makes it a great fit for the course as students can spend time implementing decision models rather than trying to learn a new software package. For Dr. Knode’s class, students uploaded data that was generated from decision requirements models and quickly were able to perform a variety of predictive analyses. In future semesters Dr. Knode plans to leverage BigML’s newest features, including our cluster analysis functionality.
Draft quote from Steve Knode: “Students in UMUC’s Master’s Degree in Data Analytics are typically full-time professionals who want to learn and leverage the latest tools and methodologies for making data-driven decisions and analyses. BigML has been a hit with our students as it clearly shows them predictive relationships in data, and its ease-of-use is a huge benefit to students and instructors alike. I suspect that many of my students will utilize BigML in their jobs in the immediate future.”
Bring BigML into your classroom today!
Aside from these business-oriented programs, BigML’s API and wealth of underlying functionality makes it ideal for the next generation of developers and computer scientists.
In addition to BigML’s ease of use for undergraduate and graduate students alike, BigML has the added benefit of being a hosted solution–meaning that students can simply open a web browser or command line shell rather than having to install software on disparate systems. BigML works with professors to grant their students complementary subscriptions, and is also available to help out with appropriate course ware, lesson plans and syllabi.
If you’re interested in bringing BigML into your classroom, please contact us today!
My name is Dean Hudson and I’m the President of EngageHi², a healthcare IT solutions provider and service delivery partner to BigML.
The phrase “Transforming Healthcare” looms as large as “Big Data” or “Business Intelligence” in the healthcare industry. For years the industry has been throwing around jargon and buzzwords to drive awareness, marketing and sales. We have all heard about the three V’s of Big Data; oops wait now there are four! When we talk about transforming healthcare by providing insight to data more is involved than just volume, veracity, velocity, and variety. Let’s define what I feel the missing V really is: Value!
The true value of your data, the insight delivered from that data, and the evidence-based decisions derived from that data is what will transform healthcare! Everyone seems to be focusing on the glitz and buzzwords but not the true building blocks. Those building blocks are the data quality, governance, provenance, and data model – along with other key enablers such as the EKW (enterprise knowledge warehouse). Once we have successfully ironed out the building blocks we can then move to a critical step that achieves true Value from data: the predictive model.
Critics in the marketplace have been less than enthusiastic as to the value of predictive modeling in healthcare. Some feel that it has minimal impact on improving quality of care, reducing costs, and improving population health. In my opinion, however, it will offer the greatest value and impact for improving outcomes and driving patient engagement. And if you look at what has occurred with IBM’s Watson and related innovations, I think it’s fair to say that the industry is finally beginning to gravitate towards the predictive and prescriptive as powered by machine learning. Let’s explore why…
Machine learning and predictive modeling will change how we build insight and relationships in healthcare. Predictive analytics is the use of artificial intelligence modeling, statistics, and pattern detection that sifts through mounds of data to identify behavior patterns and using these patterns to gain new insights. The barrier to adopting predictive technologies in healthcare has been cost and complexity – two things that BigML’s machine learning platform uniquely addresses. As someone who has been providing advanced IT solutions to the healthcare industry for 20+ years, I can state unequivocally that tools like BigML stand to revolutionize the way that health networks, hospitals and clinics run their businesses, and – most importantly – provide superior patient care.
EngageHi² is proud to partner with BigML to provide the real V (value!) to healthcare organizations of all shapes and sizes. We’re actively working on implementations that will leverage BigML’s intuitive workflow and rich visualizations for solutions ranging from patient diagnoses to readmission predictions to quality of care analysis. Healthcare will transform and outcomes will improve, but the industry requires solutions that accurately deliver the value, and not just buzzwords!
My name is Rahul Desai and I’m the CEO and co-founder of Trendify, a meta-startup that uses machine learning and big data to more reliably determine whether any given startup will succeed or not. I’d like to re-count the Trendify story, and elaborate on where BigML fits in.
From News Analytics to Startup Prediction
At first, I wanted to create a news analytics platform that could predict stock activity through social sentiment analysis. Not soon after, I gave up on it because of my non-tech background, a seemingly insurmountable obstacle. After getting a job at a local startup, Encore Alert, I sought mentorship from the CEO there. He knew what I was trying to do and sent me some interesting leads (iSentium social sentiment, and a neural network for stock-picking), at which point I realized people were doing stock prediction but no one was doing startup prediction. We focus on startup prediction because it’s a fairly open market and one where risk management tools have the potential for incredible impact. By helping both investors and entrepreneurs, we can bring some incredible technologies to life, that otherwise might not see the light of day.
In February, I began gathering open data from various sources on the internet, building a set of 10,000 data points regarding 130 companies: founder, company, and funding data. Although I’d love to elaborate more on the types of data I’ve collected, we feel that we owe it to our clients and potential clients that we protect their privacy and the confidentiality of their data. However, in broad strokes, I can mention that we do an extremely thorough data collection that matches or even surpasses Bloomberg in its scope, drawing from news, social media, and business databases.
A Predictive Model of Startup Success
After collecting this data, I ended up building a model on my own; this is where I used BigML. They have an easy-to-use, gorgeous interface that’s also incredibly powerful. We created an ensemble using training data regarding 65 companies. After training and testing, this model was able to predict the eventual success of Dell, Beats, and Box, as well as the failure of Fisker, with only the data that would be available during the first few years of operation. That initial model was built just to prove that we can accomplish our mission.
In the near future, we intend to create an ensemble around our new dataset: 1,000,000+ data points spanning thousands of companies. At that scale, we can show statistical significance. The output of this model will be directly actionable for our clients, indicating success/failure with confidence levels, and offering a print-out of the most contentious factors that led to a particular decision. After our beta launch later this year, we’re going to integrate real-time analytics so that investors and entrepreneurs can monitor milestones’ effects on their companies. With easily usable platforms like BigML, companies like Trendify can be viable. Team Trendify is very thankful to BigML for helping us prove that it’s actually possible to do what we set out to, and we look forward to continuing our relationship.
The other day, I was showing off the Kiva model from BigML’s gallery to my wife. I got the comment that while it’s super easy to do the predictions in the BigML dashboard, it would be even better if the model results could appear directly in the Kiva loan page, without needing to flip between browser tabs. This got me thinking: this sounds like a job for a browser extension. I had never created a browser extension before, and it struck me as an intriguing project. Turns out, injecting predictions into the Kiva webpages was a piece of cake, thanks to BigML’s feature-ful API and export capabilities. In this blog post, I’ll walk you through two versions of the BigML-powered browser extension: the first using actionable models, and the second using the BigML API to grab predictions from a live model.
Kiva is a micro-financing website which connects lenders to individuals and groups in developing countries who are seeking loans to improve their quality of life. Historical records about the loans created throughout the site’s history can be accessed via Kiva’s REST-ful API. The vast majority of Kiva loans are successfully repaid, but using data gathered through the Kiva API, we can train a BigML model to identify the circumstances which are more likely to result in a defaulted loan. This model was discussed at length in a previous blog post. Loans which are currently in the fundraising stage can also be queried by the Kiva API, and the resulting response can be fed to the BigML model to give a prediction of whether the final status of the loan will be
defaulted. Our goal is to create a browser extension which runs while viewing the page for an individual loan or the list of loans, and which will insert some sort of indicator for the predicted outcome of the loan.
Laying the Groundwork
We’ll be creating a Chrome browser extension here, but the same code could easily be ported to a Greasemonkey script for Firefox. The Chrome developer guide can tell you all about building extensions, but hopefully I can communicate everything you need to know in this tutorial. Also, the source code for this project can be found on Github. Feel free to grab a copy and follow along.
Every Chrome extension starts with a JSON manifest file, which gives the browser some info about what kind of extension we have, and which files it will need access to. In our case, the manifest is pretty short and sweet. Here are its contents in their entirety:
The first four items are just metadata about the extension. The next item is the more interesting bit. Chrome extensions are split into multiple categories depending on their behavior. We want an extension that modifies the content of a webpage, so we need to create a content script. Within the definition of the content script, we give a regular expression to specify the URLs at which we want the script to run. This pattern matches both the pages for viewing individual loans, and for browsing a list of loans. Next we state the scripts and stylesheets which comprise the extension. Note that I’ve bundled JQuery with the extension, as I lean on it for DOM manipulation and AJAX calls. The last item in the manifest specifies that the extension will need to access some image files located in the given directory.
With the manifest squared away, we can move on to writing the script.
One of the strengths of the classification and regression tree models created by BigML is that they can be readily represented by a series of nested if statements. This is precisely what we get when we export a model as an actionable model.
The actionable model accepts as parameters some data about a loan and returns the predicted status as a string. Our job now is to find the loan data that the model needs to do its thing. Kiva loans all have a unique ID number, so we’ll create a function which looks up a particular loan ID with the Kiva API, and uses the returned information to make a prediction, and create a status indicator which we will insert into the loan page’s DOM.
Finally, we need to differentiate the script’s behavior between the pages to view individual loans or browsing a list of loans. The URLs for individual loan pages end with the loan ID, which we can extract with a regular expression. In the list view, each item in the list contains a hyperlink to the individual loan page, which we can grab the link with JQuery and again get the loan ID from the destination URL. From that point, it’s just a simple matter of calling
predictStatus and inserting our indicator next to the “Lend” button.
Running the extension
To run the extension, you must first install it through Chrome’s extension configuration screen. The easiest way to get there is to navigate to chrome://extensions/. Once there, ensure that Developer mode is selected, then click “Load unpacked extension”, select the directory which contains your manifest.json, and you’re good to go. If all goes according to plan, you will now see a green or red indicator icon beside every “Lend” button on kiva.org.
If you decide to do any tinkering with the extension’s source code, you will need to reload the extension to see the effect of your changes.
Coming up: Using the BigML API
Using actionable models is arguably the easiest way to include BigML models in a browser extension, but having the model baked into the script can be inconvenient if the model is frequently changing. New loans are constantly being posted on Kiva, and so new data is ever available through the Kiva API. With BigML’s multi-dataset capabilities, we can continously refine our model with a growing body of training data. Keeping our browser extension up to date with our model building efforts would involve pasting in a new version of
predictStatus every time we create a new model. In the next installment of this tutorial, I’ll show how we can use BigML’s REST-ful API to ensure that our extension is always using the freshest models. Stay tuned!
Dropbox has become the de facto mechanism to store and transfer large files. In fact, it’s amazing how many customers send us Dropbox links for the purpose of analyzing their data. Now in BigML you can do all of this in an integrated fashion by simply granting BigML permission to access your Dropbox files (this permission can be revoked at any time).
Once you’ve granted BigML access, you can browse your Dropbox account for files that you’d like to bring into BigML for machine learning analyses. BigML automatically only enables download of file types that we can process–so don’t worry about accidentally asking BigML to ingest your pictures, movies and presentations!
See how easy it is to tap into this new functionality through the video below:
We hope that you enjoy this new capability!