Bring Machine Learning to PostgreSQL with BigML

Posted by

As of late, we’ve been using PostgreSQL in BigML quite a lot, and so do some of our customers. We love the features that they are bringing in the next release (which is in Beta as I write) and particularly the one that allows creating what they call generated columns. In this post, I’ll be explaining what these generated columns are and how they can use the Machine Learning models in BigML to fill in any numeric or categorical field in your table.

What is a generated column?

A generated column is a special column that is defined as the result of a computation that involves any of the values of the regular columns in the row. Let’s see an example.

Say you have a table of contacts in your database where you fill in their first and last name plus their email. You might want to keep also the full name for output purposes, but of course, you don’t want that to be a column to be filled independently. Here’s where a generated column will come handy:

CREATE OR REPLACE FUNCTION my_concat(text, text)
RETURNS TEXT AS 'text_concat_ws' LANGUAGE internal immutable;

CREATE TABLE contacts (
    first_name TEXT,
    last_name TEXT,
    full_name TEXT GENERATED ALWAYS AS (
        my_concat(' ',first_name,last_name)) STORED,
    email TEXT);

The full_name column is defined as generated always, so you will not be able to insert values in that column. Instead, the column is automatically filled by a concatenation of the contents of first_name and last_name with a blank between them.

testdb=# INSERT INTO contacts (first_name, last_name, email)
    VALUES ('John', 'Doe', 'jdoe@gmail.com');
INSERT 0 1
testdb=# SELECT * FROM contacts;
 first_name | last_name | full_name |     email      
------------+-----------+-----------+----------------
 John       | Doe       | John Doe  | jdoe@gmail.com
(1 row)

Ok, that’s not bad at all and will both ensure consistency and ease maintenance. However, the information in the table has not increased. The generated column is not telling us anything that we do not know in advance. What if we could use Machine Learning to add more information to our table?

Machine Learning insights

For those of you who are not familiar with Machine Learning, it’s a branch of Artificial Intelligence that has been proven to be very useful to the Enterprise so far. The basic idea behind Machine Learning is using computers to label things for us by just providing a collection of previously labeled examples. The computer uses some algorithms to learn from these examples and is able to predict the label for the new incoming cases.

For instance, imagine that you run a telecom. Some of your customers will churn, but which ones? Wouldn’t it be nice to be told who is likely to churn? Maybe you could offer a discount or other offer to convince them to stay.

That’s one of many things that Machine Learning can do for you. Based on the examples of customers that churned, the computer can learn which patterns lead to churn and predicts whether a user calling your customer service line matches any of those patterns and therefore at risk of churning (for more details, check this post by Chris Mohritz).

Back to our example, how could we add that label to our call center table?

Powering tables with AI

In order to predict the likeliness of a customer to churn, the Machine Learning algorithms build models. You can learn more about the different types of models and their uses in our videos. As this post is not focused on how to build a model, let’s use an existing model for the telecom churn problem that we have in our model gallery. You can easily clone that into your BigML account for free.

Churn telecom model

Feeding what we know about the user to the model (the total day minutes, voice mail plan, total day charge, total intl minutes, total intl calls, etc.) the model will tell us what we don’t know: whether the user is likely to churn. Could we add that information as one more column in our table?

The good news is that PostgreSQL offers extensions that allow you to define functions using several general purpose languages. One of them is plpythonu that you can use to embed Python code in the postgreSQL functions. Also, BigML offers bindings to several languages (Python included) that know how to use the Machine Learning models to create predictions for your input data. Let’s put all of that together in five steps:

  1. Register or login to BigML so as to use its models.
  2. Clone the model available in BigML’s gallery to your account.
  3. Install the Python bindings.
  4. Create a function to generate the prediction.
  5. Create a table to store your input data and the generated column.

Step 1 is easily done by using the signup form at bigml.com that will ask for your email and basic information. Then you can follow the link to the model and click on the buy link to copy it. From that moment, you’ll be able to use the model to make predictions. At this point, your model is stored in your private environment in BigML’s servers.

The next step is installing the bindings, where some classes are able to download that model to your local computer and use the information therein to predict the churn output for each set of inputs. The detail about how to install them can be found in the bindings documentation, but basically, it means using pip to install them:

pip install bigml

Now comes the time for defining the function that will predict whether the customer is going to churn in PostgreSQL:

CREATE OR REPLACE FUNCTION predict_churn(total_day_minutes REAL,
                                         voice_mail_plan TEXT,
                                         total_day_charge REAL,
                                         total_intl_minutes REAL,
                                         total_intl_calls REAL)
          RETURNS text
          AS $$
            from bigml.model import Model
            from bigml.api import BigML

            # ------ user's data -------------- #
            model_id = "model/52bc7fd03c1920e4a3000016" # model ID
            username = "my_username" # replace with your username
            api_key = "*****" # replace with your API key
            # ---------------------------------- #
            local_model = Model(model_id,
                                api = BigML(username, api_key,
                                            storage='./storage'))
            return local_model.predict( \
                {"total day minutes": total_day_minutes,
                 "voice mail plan": voice_mail_plan,
                 "total day charge": total_day_charge,
                 "total intl minutes": total_intl_minutes,
                 "total intl calls": total_intl_calls})
          $$ LANGUAGE plpythonu immutable;

The Python code uses the ID of the model, which can be retrieved from BigML’s dashboard, and your credentials (username and API key). Thanks to that, the Model class will download the model information to your computer the first time that function is called. The model will be stored in a ./storage folder, and from then on this local copy will be used to make the predictions. In order to use the function, we just need to create the table with a generated column as before:

CREATE TABLE churn (
    total_day_minutes REAL,
    voice_mail_plan TEXT,
    total_day_charge REAL,
    total_intl_minutes REAL,
    total_intl_calls REAL,
    churn_prediction TEXT GENERATED ALWAYS AS
        (predict_churn(total_day_minutes,
                       voice_mail_plan,
                       total_day_charge, 
                       total_intl_minutes,
                       total_intl_calls)) STORED);

And voilà! Next time a customer calls your call center, insert the information about him in the table

INSERT INTO churn (total_day_minutes,
                   voice_mail_plan,
                   total_day_charge,
                   total_intl_minutes,
                   total_intl_calls)
VALUES (45,'yes',55,120,3);

INSERT INTO churn (total_day_minutes,
                   voice_mail_plan,
                   total_day_charge,
                   total_intl_minutes,
                   total_intl_calls)
VALUES (55,'no',50,100,12);

The model will immediately add the prediction for the churn

SELECT churn_prediction FROM churn;
 churn_prediction 
------------------
False
True
(2 rows)

Can’t wait to use it!

For those of you that want to try this right away, there’s an alternative to generated columns: using triggers. A trigger is a function that will be called on the event of inserting, updating or deleting a row. Triggers can be attached to tables so that tasks are performed before or after one of these designated events take place.

To mimic our example, we could create a regular table with plain columns

CREATE TABLE plain_churn (
    total_day_minutes REAL,
    voice_mail_plan TEXT,
    total_day_charge REAL,
    total_intl_minutes REAL,
    total_intl_calls REAL,
    churn_prediction TEXT);

but add a trigger on insert or update, so that the content of churn_prediction is computed as the prediction based on the rest of columns

CREATE or REPLACE FUNCTION predict_churn_trg()
          RETURNS TRIGGER
          AS $$
            from bigml.model import Model
            from bigml.api import BigML
            # ------ user's data -------------- #
            model_id = "model/52bc7fd03c1920e4a3000016" # model ID
            username = "my_username" # replace with your username
            api_key = "*****" # replace with your API key
            # ---------------------------------- #
            local_model = Model(model_id,
                                api = BigML(username, api_key,
                                            storage='./storage'))
            new_values = TD["new"] # values to be stored
            new_values["churn_prediction"] = local_model.predict( \
                {"total day minutes": new_values["total_day_minutes"],
                 "voice mail plan": new_values["voice_mail_plan"],
                 "total day charge": new_values["total_day_charge"],
                 "total intl minutes": new_values["total_intl_minutes"],
                 "total intl calls": new_values["total_intl_calls"]})
            return "MODIFY"
          $$ LANGUAGE plpythonu;

By associating the trigger to the previous table, the churn_prediction column will also be automatically generated when the rest of the values change.

CREATE TRIGGER churn_trg
BEFORE INSERT or UPDATE ON plain_churn
FOR EACH ROW
EXECUTE PROCEDURE predict_churn_trg();

So we are ready to go!

INSERT INTO plain_churn (total_day_minutes,
                   voice_mail_plan,
                   total_day_charge,
                   total_intl_minutes,
                   total_intl_calls)
VALUES (55,'no',50,100,12);
SELECT churn_prediction FROM plain_churn;
 churn_prediction 
------------------
True
(1 rows)

Cool right? Let us know how your experience goes and meanwhile happy predicting!

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s