Skip to content

A REST API for Principal Component Analysis

As part of our PCA release, we have released a series of blog posts, including a use case and a demonstration of the BigML Dashboard. In this installment, we shift our focus to implement Principal Component analysis with the BigML REST API. PCA is a powerful data transformation technique and unsupervised Machine Learning method that can be used for data visualizations and dimensionality reduction.

pca-workflow

Authentication

The first step in any BigML workflow using the API is setting up authentication. In order to proceed, you must first set the BIGML_USERNAME and BIGML_API_KEY environment variables, available in the user account page. Once authentication is successfully set up, you can begin executing the rest of this workflow.

export BIGML_USERNAME=my_name
export BIGML_API_KEY=13245
export BIGML_AUTH="username=$BIGML_USERNAME;api_key=$BIGML_API_KEY;"

Create a Source

Data sources can be uploaded to BigML in many different ways, so this step should be appropriately adapted to your data with the help of the API documentation. Here we will create our data source using a local file downloaded from Kaggle.

curl "https://bigml.io/source?$BIGML_AUTH" -F file=@mobile.csv

This particular dataset has a target variable called “price_range”. Using the API we can update the field type easily.

curl "https://bigml.io/source/4f603fe203ce89bb2d000000?$BIGML_AUTH" \
  -X PUT \
  -d '{"fields": {"price_range": {"optype": "categorical"}}}' \
  -H 'content-type: application/json'

Create Datasets

In BigML, sources need to be processed into datasets.

curl "https://bigml.io/dataset?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"source": "source/4f603fe203ce89bb2d000000"}'

Because we will want to be able to evaluate our model trained using PCA-derived features, we need to split the dataset into a training and test set. Here we will allocate 80% for training and 20% for testing, indicated by the “sample_rate” parameter.

curl "https://bigml.io/dataset?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"origin_dataset": "dataset/59c153eab95b3905a3000054",
  "sample_rate": 0.8,
  "seed": "myseed"}'

By setting the parameter “out_of_bag” to True, we select all the rows that were not selected when creating the training set in order to have an independent test set.

curl "https://bigml.io/dataset?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"origin_dataset": "dataset/59c153eab95b3905a3000054",
  "sample_rate": 0.8,
  "out_of_bag": true,
  "seed": "myseed"}'

Create a PCA

Our datasets are now prepared for PCA. The Principal Components obtained from PCA are linear combinations of the original variables. If the data is going to be used for supervised learning at a later point, it is critical not to include the target variable in the PCA, as it will result in the target variable being present in the covariate fields. As such, we create a PCA using all fields except “price_range”, using the “excluded_fields” parameter.

curl "https://bigml.io/pca?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"dataset": "dataset/59c153eab95b3905a3000054",
  "input_fields": [list everything but price_range]}'

Create Batch Projections

Next up, utilize the newly created PCA resource to perform a PCA batch projection on both the train and test sets. Ensure that all PCs are added as fields to both newly created datasets.

curl "https://bigml.io/batchprojection?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"pca": "pca/5423625af0a5ea3eea000028",
  "dataset": "dataset/59c153eab95b3905a3000054",
  "all_fields": true,
  “output_dataset”:true}'

Train a Classifier

After that, using the training set to train a logistic regression model that predicts the “price_range” class is very straightforward.

curl "https://bigml.io/logisticregression?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"dataset": "dataset/5af59f9cc7736e6b33005697",
  "objective_field":"price_range"}'

Model Evaluation

Once ready, evaluate the model using the test set. BigML will provide multiple classification metrics, some of which may be more relevant than others for your use case.

curl "https:/bigml.io/evaluation?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"dataset": "dataset/5af5a69cb95b39787700036f",
  "logisticregression": "logisticregression/5af5af5db95b3978820001e0"}'

Want to know more about PCA?

Our final blog posts for this release will include additional tutorials on how to automate PCAs with WhizzML and the BigML Python bindings. For further questions and reading, please remember to visit the release page and join the release webinar on Thursday, December 20, 2018. Attendance is FREE of charge, but space is limited so register soon!

Principal Component Analysis with the BigML Dashboard: Easy as 1-2-3!

The BigML Team is bringing Principal Component Analysis (PCA) to the BigML platform on December 20, 2018. As explained in our introductory post, PCA is an unsupervised learning technique that can be used in different scenarios such as feature transformationdimensionality reduction, and exploratory data analysis. PCA, explained in a nutshell, fundamentally transforms a dataset defined by possibly correlated variables into a set of uncorrelated variables, called principal components.

In this post, we will take you through the five necessary steps to upload your data, create a dataset, create a PCA, analyze the results and finally make projections using the BigML Dashboard.

pca-workflow

We will use the training data of the Fashion MNIST dataset from Kaggle which contains 60,000 Zalando’s fashion article images that represent 10 different classes of products. Our main goal will be to use PCA to reduce the dataset dimensionality which has 784 fields containing pixel data to build a supervised model that predicts the right product category for each image.

fashion-mnist-dataset.png

1. Upload your Data

Start by uploading your data to your BigML account. BigML offers several ways to do so: you can drag and drop a local file, connect BigML to your cloud repository (e.g., S3 buckets) or copy and paste a URL. In this case, we download the dataset from Kaggle and just drag and drop the file.

BigML automatically identifies the field types. We have 784 input fields containing pixel data and correctly set as numeric by BigML. The objective field “label” is also identified as numeric because it contains digit values from 0 to 9, however, we need to convert this field into categorical since each digit of this field represents a product category instead of a continuous numeric value. We can easily do this by clicking on the “Configure source” option shown in the image below.

configure-source.png

2. Create a Dataset

From your source view, use the 1-click dataset menu option to create a dataset, a structured version of your data ready to be used by a Machine Learning algorithm.

In the dataset view, you will be able to see a summary of your field values, some basic statistics, and the field histograms to analyze your data distributions. You can see that our dataset has a total of 60,000 instances where each of the 10 classes in the objective filed has 6,000 instances.

fashion-mnist-dataset2.png

3. Create a PCA

Before creating the PCA we need to split our dataset into two subsets: 80% for training and 20% for testing. This is because our main goal in building a PCA is to reduce our data dimensionality to build a supervised model that can predict the product categories afterwards. If we used the full dataset to build the PCA and then split the resulting dataset into the train and tests subsets to build the supervised model, we would be introducing data leakage, i.e., the training set would contain information of the test set. However, this split wouldn’t be necessary if we wanted to use PCA for other purposes such as for data exploration.

split-dataset.png

Next, we take the 80% training set to create the PCA. You can use the 1-click PCA menu option, which will create the model using the default parameter values, or you can adjust the parameters using the PCA configuration option. Another important thing to consider at this point is that we need to exclude our objective field from the PCA creation to avoid another possible data leakage scenario. Otherwise, we will be mixing information about the objective field into the principal components that we will use as predictors for our supervised model.

create-pca.png

BigML provides the following parameters to configure your PCA:

  • Standardize: allows you to automatically scale numeric fields to a 0-1 range. Standardizing implies assigning equal importance to all the fields regardless if they are on the same scale. If fields do not have the same scale and you create a PCA with non-standardized fields, it is often the case that each principal component is dominated by a single field. Thus, BigML enables this parameter by default.
  • Default numeric value: PCA can include missing numeric values as valid values. However, there can be situations for which you don’t want to include them in your model. For those cases, you can easily replace missing numeric values with the field’s mean, median, maximum, minimum or with zero.
  • Sampling: sometimes you don’t need all the data contained in your test dataset to generate your PCA. If you have a very large dataset, sampling may very well be a good way to get faster results.

configure-pca.png

4. Analyze your PCA Results

When your PCA is created, you will be able to visualize the results in the PCA view, which is composed of two main parts: the principal component list and the scree plot.

  • The principal component list allows you to see the components created by the PCA (up to 200). Each of the principal components is a linear combination of the original variables, is orthogonal to all other components, and ordered according to the variance. The variance of each component indicates the total variability of the data explained by that component. In this list view, you can also see the original field weights associated with each component that indicate each field’s influence on that component.

principal-components-list.png

  • The scree plot helps you to graphically see the amount of variance explained by a given subset of components. It can be used to select the subset of components to create a new dataset either by setting a threshold for the cumulative variance or by limiting the total number of components using the slider shown in the image below. Unfortunately, there is not an objective way to decide the optimal number of components for a given cumulative variance. This depends on the data and the problem you are looking to solve so be sure to apply your best judgment given your knowledge of the context.

scree-plot.png

5. Create Projections

PCA models can be used to project the same or new data points to a new set of axes defined by the principal components. In this case, we want to make projections on our two subsets (the 80% for training and the 20% for testing) so we can replace the original fields by the components calculated by our PCA to create and evaluate a supervised model.

Create a Dataset from the PCA view

If you want to get the components for the same dataset that you used to create the PCA, you can use the “Create dataset” button that BigML provides in the PCA view.  This option is like a shortcut that creates a batch projection behind the scenes. For our 80% subset, we are using this faster option. We can see in the scree plot that selecting around 300 components (out of the 784 total components) using the slider shown in the image below, gives us more than 80% of the cumulative variance which seems a large enough number to create a new dataset without losing much information from the original data.

create-dataset

After the dataset is created we can find it listed on our Dashboard. The new dataset will include the original fields used to create the PCA and the new principal components taking into account the threshold set.

training-dataset-with-components.png

Create a Batch Projection

If you want to use a different dataset than the one used to create the PCA, then you need to take the long path and click on the “Batch projection” option. We are using this option for our 20% subset. The step-by-step process is explained below.

1. Click on the option “Batch projection” from the PCA view.batch-projection.png

2. Select the dataset you want to use and configure the batch projection if you want. In this case, we are selecting the 20% test subset and we are limiting the number of components to be returned up to 300 by using the slider shown in the image below (as we did with the training set before). batch-projection-limits.pngWe can also choose to remove the original fields or not. In this case, we are keeping them since we want to use the same 80% and 20% subsets to build and evaluate two different supervised models: one with the original fields and another one with the components.

3. When the batch projection is created, you can find the new dataset containing the components in your dataset list view.test-dataset-with-components.png

Final Results

Using our reduced dataset with the 300 components, we create a logistic regression to predict the product categories. We also create another logistic regression that uses the original 784 fields that contained the pixel data so we can compare both models’ performances.

logistic-regression-with-components.png

When we evaluate them, we can observe that the performances of the 300 component models (f-measure=0.8449) are almost exactly the same as the one from the model that used all of the original fields (f-measure=0.8484) despite the fact that we only used ~40% of the original fields. This simple act allows us to reduce model complexity considerably, in turn, decreasing the training and prediction times.

results-comparison.png

Want to know more about PCA?

If you would like to learn more about Principal Component Analysis and see it in action on the BigML Dashboard, please reserve your spot for our upcoming release webinar on Thursday, December 20, 2018. Attendance is FREE of charge, but space is limited so register soon!

Applying Dimensionality Reduction with PCA to Cancer Data

Principal Component Analysis (PCA) is a powerful and well-established data transformation method that can be used for data visualization, dimensionality reduction, and possibly improved performance with supervised learning tasks. In this use case blog, we examine a dataset consisting of measurements of benign and malignant tumors which are computed from digital images of a fine needle aspirate of breast mass tissue. Specifically, these 30 variables describe specific characteristics of the cell nuclei present in the images, such as texture symmetry and radius.

Data Exploration

The first step in applying PCA to this process was to see if we can more easily visualize separation between the malignant and benign classes in two dimensions. To do this, we first divide our dataset into train and test sets and perform the PCA using only the training data. Although this step can be considered feature engineering, it is important to conduct the train-test split prior to performing PCA because the transformation takes into account variability in the whole dataset and would lead to information in the testing set leaking into the training data otherwise.

pca_scree.png

The resulting PCA resource lists the top Principal Components (PC), sorted by their Percent Variation Explained. In this example, we can see that PC1 accounts for 45.12% of the total variance in the dataset, and the top 7 PCs alone account for approximately 90% of the Percent Variation Explained. We can further explore which of the original fields are contributing the most to the various PCs by inspecting the bar chart provided on the left. The three greatest contributors to PC1 turn out to be “concave points mean”, “concavity mean”, and “concave points worst”.  Based on this information, we can begin to conclude that features related to concavity are highly variable and possibly discriminative.

components.png

A PCA transformation yields new variables which are linear combinations of the original fields. The major advantages of this transformation are that the new fields are not correlated with one another and that each successive Principal Component seeks to maximize the remaining variance in the dataset under the constraint of being orthogonal to the other components. By maximizing variance and decorrelating the features, PCA-created variables can often perform better in supervised learning – especially with model types that have higher levels of bias. However, this is performed at the expense of overall interpretability. Although we can always inspect the contributions of our original variables on the PCA fields, a linear combination of 30 variables will always be less straightforward than simply inspecting the original variable.

Data Visualization

scatterplot.png

After plotting the dataset according to only the top 2 Principal Components (PC1 and PC2) and coloring each data point by diagnosis (benign or malignant), we already can see a considerable separation between the classes. What makes this result impressive is that we used no knowledge of our target variable when creating or selecting for these Principal Component fields. We simply created fields that explained the most variance in the dataset, and they also turned out to have enormous discriminatory power.

Predictive Modeling

Finally, we can evaluate how well our Principal Components fields work as the inputs to a logistic regression classifier. For our evaluation, we trained a logistic regression model with identical hyperparameters (L2 regularization, c=1.0, bias term included) using 4 different sets of variables:

  • All 30 original variables
  • All Principal Components
  • Top 7 PCs (90% PVE)
  • Top 2 PCs only

The results are visualized in the Receiver Operating Characteristic (ROC) curve below, with a malignant diagnosis serving as the positive class and sorted by Area Under the Curve (AUC). In general, all of the models in this example performed very well, with rather subtle differences in performance. However, the scale of the input data varied widely. Most notably, a model using only two variables (PC1 and PC2) performed with an AUC of >0.97, and very close to the top performing model with AUC >0.99.

roc_auc.png

As part of Occam’s Razor of Machine Learning, it is often advantageous to utilize a simpler model whenever possible. PCA-based dimensionality reduction is one method that enables models to be built with far fewer features while maintaining most of the relevant informational content. As such, we invite you to explore the new PCA feature with your own datasets, both for exploratory visualization tasks and as a preprocessing step.

Want to know more about PCA?

If you would like to learn more about Principal Component Analysis and see it in action on the BigML platform, please reserve your spot for our upcoming release webinar on Thursday, December 20, 2018. Attendance is FREE of charge, but space is limited so register soon!

Introduction to Principal Component Analysis: Dimensionality Reduction Made Easy

BigML’s upcoming release on Thursday, December 20, 2018, will be presenting our latest resource to the platform: Principal Component Analysis (PCA). In this post, we’ll do a quick introduction to PCA before we move on to the remainder of our series of 6 blog posts (including this one) to give you a detailed perspective of what’s behind the new capabilities. Today’s post explains the basic concepts that will be followed by an example use case. Then, there will be three more blog posts focused on how to use PCA through the BigML DashboardAPI, and WhizzML for automation. Finally, we will complete this series of posts with a technical view of how PCAs work behind the scenes.

Understanding Principal Component Analysis

Many datasets in fields as varied as bioinformatics, quantitative finance, portfolio analysis or signal processing can contain an extremely large number of variables, that may be highly correlated, resulting in sub-optimal Machine Learning performance. Principal component analysis (PCA) is one technique that can be used to transform such a dataset in order to obtain uncorrelated features or as a first step in dimensionality reduction

PCA Introduction

Because PCA transforms the variables in a dataset without accounting for a target variable, it can be considered an unsupervised Machine Learning method suitable for exploratory data analysis of complex datasets. However, when used towards dimensionality reduction, it also helps reduce supervised model overfitting, as there remain fewer relationships to consider between variables after the process. To do this, the principal components yielded by a PCA transformation are typically ordered by the amount of variance each explains in the original dataset. The practitioner can decide how many of the new component features can be eliminated from a dataset while preserving most of the original information contained in it.

Even though they are all grouped under the same umbrella term (PCA), under the hood, BigML’s implementation incorporates multiple factor analysis techniques, rather than only the standard PCA implementation. Specifically,

  • Principal Component Analysis (PCA)BigML utilizes this option if the input dataset contains only numerical data.
  • Multiple Correspondence Analysis (MCA): this option is available if the input dataset contains only categorical data.
  • Factorial Analysis of Mixed Data (FAMD)in case the input dataset contains both numeric and categorical fields this option is also available.

In the case of items and text fields, data is processed using a bag-of-words approach allowing PCA to be applied. Because of this nuanced approach, BigML can handle categorical, text, and items fields in addition to numerical data in an automatic fashion that does not require manual intervention by the end user.

Principal Component Analysis (PCA)

Want to know more about PCA?

If you would like to learn more about Principal Component Analysis and see it in action on the BigML Dashboard, please reserve your spot for our upcoming release webinar on Thursday, December 20, 2018. Attendance is FREE of charge, but space is limited so register soon!

Principal Component Analysis (PCA): Dimensionality Reduction!

The new BigML release is here! Join us on Thursday, December 20, 2018, at 10:00 AM PST (Portland, Oregon. GMT -08:00) / 07:00 PM CET (Valencia, Spain. GMT +01:00) for a FREE live webinar to discover the latest addition to the BigML platform. We will be showcasing Principal Component Analysis (PCA), a key unsupervised Machine Learning technique used to transform a given dataset in order to yield uncorrelated features and reduce dimensionality. PCA is most commonly applied in fields with high dimensional data including bioinformatics, quantitative finance, and signal processing, among others.

 

Principal Component Analysis (PCA), available on the BigML Dashboard, API and WhizzML for automation as of December 20, 2018, is a statistical technique that transforms a dataset defined by possibly correlated variables (whose noise negatively affects the performance of your model) into a set of uncorrelated variables, called principal components. This technique is used as the first step in dimensionality reduction, especially for those datasets with a large number of variables, which helps improve the performance of supervised models due to noise reduction. As such, PCA can be used in any industry vertical as a preprocessing technique in the data preparation phase of your Machine Learning projects.

 

BigML PCA is distinct from other implementations of the PCA algorithm, our Machine Learning platform lets you transform many different data types in an automatic fashion that does not require you to configure it manually. That is, BigML’s unique approach can handle numeric and non-numeric data types, including text, categorical, items fields, as well as combinations of different data types. To do so, BigML PCA incorporates multiple factor analysis techniques, specifically, Multiple Correspondence Analysis (MCA) if the input contains only categorical data, and Factorial Analysis of Mixed Data (FAMD) if the input contains both numeric and categorical fields.

 

When we work with high dimensional datasets, we often have the challenge of extracting the discriminative information in the data while removing those fields that only add noise and make it difficult for the algorithm to achieve the expected performance. PCA is ideal for these events. While a PCA transformation maintains the dimensions of the original dataset, it is typically applied with the goal of dimensionality reduction. Reducing the dimensions of the feature space is one method to help reduce supervised model overfitting, as there are fewer relationships between variables to consider. The principal components yielded by a PCA transformation are ordered by the amount of variance each explains in the original dataset. Plots of the cumulative variance explained, also known as scree plots, are one way to interpret appropriate thresholds for how many of the new features can be eliminated from a dataset while preserving most of the original information.

Want to know more about PCA?

Please join our free, live webinar on Thursday, December 20, 2018, at 10:00 AM PT.  Register today as space is limited! Stay tuned for our next 6 blog posts that will gradually present PCA and how to benefit from it using the BigML platform.

Note: In response to user inquiries, we are including links here to the datasets featured in the two images above showing the BigML Dashboard. For the first, we filtered a subset of AirBNB data fields available on Inside AirBNB, and for the second, the Arrythmia diagnosis dataset is available on the BigML Gallery. We hope you enjoy exploring the data on your own!

Preparing Data for Machine Learning with BigML

At BigML we’re well aware that data preparation and feature engineering are key steps for the success of any Machine Learning project. A myriad of splendid tools can be used for the data massaging needed before modeling. However, in order to simplify the iterative process that leads from the available original data to a ML-ready dataset, our platform has recently added more data transformation capabilities. By using SQL statements, you can now aggregate, remove duplicates, join and merge your existing fields to create new features. Combining these new abilities with Flatline, the existing transformation language, and the platform’s out-of-the-box automation and scalability will help greatly to solve any real Machine Learning problem.

Data Transformations with BigML

The data: San Francisco Restaurants

Some time ago we wrote a post describing the kind of transformations needed to go from a bunch of CSV files that contained information about the inspections of some restaurants and food businesses in San Francisco. The data was published by the San Francisco’s Department of Public Health and was structured in four different files:

  • businesses.csv: a list of restaurants or businesses in the city.
  • inspections.csv: inspections in some of previous businesses.
  • violations.csv: detected law violations in some of previous inspections.
  • ScoreLegend.csv: a legend to describe score ranges.

The post described how to build a dataset that could be used to do Machine Learning with them using MySQL. Let’s compare now how could you do that using BigML’s newly added transformations.

Uploading the data

As explained in the post, the first thing that you need to do to use this data in MySQL is defining the structure of the tables where you will upload it, so you need to care about the contents of each column and assign the correct type after a detailed inspection of each CSV file. This means writing commands like this one for every CSV.

create table business_imp (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100), postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100));

and some more to upload the data to the tables:

load data local infile '~/SF_Restaurants/businesses.csv' into table business_imp fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 lines (business_id,name,address,city,state,postal_code,latitude,longitude,phone_number);

and creating indexes to be able to do queries efficiently:

create index inx_inspections_businessid on inspections_imp (business_id);

The equivalent in BigML would be just drag and dropping the CSVs in your Dashboard:

uploadAnd as a result, BigML infers for you the types associated to every column detected in each file. In addition, the types being assigned are totally focused on the way the information will be treated by the Machine Learning algorithms. Thus, in the inspections table we see that the Score will be treated as a number, the type as a category and the date is actually automatically separated into the year, month and day components, which are the ones meaningful in the ML processes.

typesWe just need to verify the inferred types in case we want some data to be interpreted differently. For instance, the violations file contains a description text that includes information about the date the violation was corrected.

$ head -3 violations.csv
"business_id","date","description"
10,"20121114","Unclean or degraded floors walls or ceilings [ date violation corrected: ]"
10,"20120403","Unclean or degraded floors walls or ceilings [ date violation corrected: 9/20/2012 ]"

Depending on how you want to analyze this information, you can decide to leave it as it is, and contents will be parsed to produce a bag of words analysis, or set the text analysis properties differently and work with the full contents of the field.

term_analysis

As you see, so far BigML has taken care of most of the work, defining the fields in every file, their names, the types of information they contain, parsing datetimes and text. The only remaining contribution we think of now is taking care of the description field, which in this case combines information about two meaningful features: the real description and the date when the violation was corrected.

Now that the data dictionary has been checked, we can just create one dataset per source by using the 1-click dataset action.

1-c-dataset

Transforming the description data

The same transformations described in the above-mentioned post can be applied now from using the BigML Dashboard. The first one is removing the [date violation corrected: …] substring from the violation’s description field. In fact, we can go further and use that string to create a new feature: the days it took for the violation to be corrected.

add_field

This kind of transformations was already available in BigML thanks to Flatline, our domain-specific transformation language.

editor

Using a regular expression, we can create a clean_description field removing the date violation part


(replace (f "description")
         "\\[ date violation corrected:.*?\\]"
         "")

Previewing the results of any transformation we define is easier than ever thanks to our improved Flatline Editor.

clean_description

By doing so, we discover that the new clean_description field is assigned a categorical type because its contents are not free text but a limited range of categories.

clean_desc

The second field is computed using the datetime capabilities of Flatline. The expression to compute the days that took to correct the violation is:


(/ (- (epoch (replace (f "description")
                      ".*\\[ date violation corrected: (.*?) \\]"
                      "$1") "MM/dd/YYYY")
      (epoch (f "date") "YYYYMMdd"))
   (* 1000 24 60 60))

day_to_correction

where we parsed the date in the original description field and subtracted the one that the violation was registered in. The difference is stored in the  days_to_correction new feature, to be used in the learning process.

Getting the ML-ready format

We’ve been working on a particular field of the violations table so far, but if we are to use that table to solve any Machine Learning problem that predicts some property about these businesses, we need to join all the available information in a single dataset. That’s where BigML‘s new capabilities come handy, as we now offer joins, aggregations, merging and duplicate removal operations.                       new_trans

In this case, we need to join the businesses table with the rest, and we realize that inspections and violations use the business_id field as the primary key, so a regular join is possible. The join will keep all businesses and every business can have none or multiple related rows in the other two tables. Let’s join businesses and inspections:

join

Now, to have a real ML-ready dataset, we still need to meet a requirement. Our dataset needs to have a single row for every item we want to analyze. In this case, it means that we need to have a single row per business. However, joining the tables has created multiple rows, one per inspection. We’ll need to apply some aggregation: counting inspections, averaging scores, etc.

aggreg

The same should be done for the violations table, where again each business can have multiple violations. For instance, we can aggregate the days that it took to correct the violations and the types of violation per business.

viol_aggr

And now, use a right join to add this information to every business record.

viol_join

Finally, the ScoreLegend table is just providing a list of categories that can be used to discretize the scores into sensible ranges. We can easily add that to the existing table with a simple select * from A, B expression plus a filter to select the rows whose Score field value is between the Minimum_Score and Maximum_Score of each legend. In this case, we’ll use the more full-fledged API capabilities through the Python bindings.

# applying the sql query to the business + inspections + violations
# dataset and the ScoreLegend
from bigml.api import BigML
api = BigML()
legend_dataset = api.create_dataset( \
    [business_ml_ready,
     score_legend],
    {"origin_dataset_names": {
        business_ml_ready: "A",
        score_legend: "B"},
     "sql_query": "select * from A, B"})
api.ok(legend_dataset)

# filtering the rows where the score matches the corresponding legend
ml_dataset = api.create_dataset(\
    legend_dataset,
    {"lisp_filter": "(<= (f \"Minimum_Score\")" \
                    " (f \"avg_score\")" \
                    " (f \"Maximum_Score\"))"})

With these transformations, the final dataset is eventually Machine Learning ready and can be used to cluster restaurants in similar groups, find out the anomalous restaurants, or classify them according to their average score ranges. Nevertheless, we can generate new features, like the distance to the city center, or the rate of violations per inspection. These transformations can help to better describe the patterns in data. Here’s the Flatline expression needed to compute the distance of the restaurants to the center of San Francisco using the Haversine formula.


(let (pi 3.141592
      lon_sf (/ (* -122.431297 pi) 180)
      lat_sf (/ (* 37.773972 pi) 180)
      lon (/ (* (f "longitude") pi) 180)
      lat (/ (* (f "latitude") pi) 180)
      dlon (- lon lon_sf)
      dlat (- lat lat_sf)
      a (+ (pow (sin (/ dlat 2.0)) 2) (* (cos lat_sf) (cos lat) (pow (sin (/ dlon 2.0)) 2)))
      c (* 2 (/ (sqrt a) (sqrt (- 1 a)))))
 (* 6373 c))

For instance, modeling rating in terms of the name, address, postal code or distance to the city center could give us information about how to look up for the best restaurants.

distance

Trying a logistic regression, we learn that to find a good restaurant, it’s best to move a bit away from the center of San Francisco.

logistic

Having data transformations in the platform has many advantages. Feature engineering becomes an integrated feature, so trivial to be used, and scalabilityautomation and reproducibility are granted, as for any other resource (and one click away thanks to Scriptify). So don’t be shy and give it a try!

Enterprise Machine Learning More Accessible Than Ever with BigML Lite

Yesterday, millions of shoppers flocked to online sales for Cyber Monday. While this single day of extra savings is exciting, we believe in providing excellent value to our customers year-round. Today, we are delighted to introduce BigML Lite, a new Private Deployment option that makes enterprise Machine Learning more accessible than ever.

BigML Lite for enterprise
At this point, it’s well established that businesses in all industries have the challenge and opportunity to utilize tremendous amounts of data. What isn’t as well accepted yet is that obtaining a company-wide Machine Learning platform is key to enable analysts, developers, and engineers to build robust predictive applications in a timely manner.

A terrific blog on “Why businesses fail at machine learning” by Cassie Kozyrkov uses cooking metaphor to explain how companies often make the mistake of trying to build an oven (a Machine Learning platform) instead of baking bread (deriving insights and making predictions from data). Continuing with this metaphor, the majority of data-driven companies are in the business of “making bread”, so there’s no reason to spend resources creating an oven from scratch. BigML, on the other hand, is the “oven maker” of this metaphor. The BigML Team has spent the last 7+ years meticulously building a comprehensive Machine Learning platform that provides instant access to the most effective ML algorithms and high-level workflows.

To help companies focus on what matters most, automating the decision-making process, BigML offers Private Deployments for customers to start building production-ready predictive apps from day one, without having to worry about low-level infrastructure management. Now, BigML provides two options to meet the needs of both small and large scale deployments: BigML Lite and BigML Enterprise.

  • BigML Lite offers a fast-track route for implementing your first use cases. Ideal to get immediate value for startups, small to mid-size enterprises or in a single department of a large enterprise ready to benefit from Machine Learning.
  • BigML Enterprise offers full-scale access for unlimited users and organizations. Ideal for larger enterprises ready for company-wide Machine Learning adoption.

All BigML Private Deployments (Lite or Enterprise) include the following:

  • Unlimited tasks.
  • Regular updates and upgrades of new features and algorithms.
  • Priority access to customized assistance.
  • Ability to run in your preferred cloud provider, ISP, or on-premises.
  • Fully managed or self-managed Virtual Private Cloud (VPC) deployments.

With BigML Lite, your company can obtain the full power of BigML’s platform on a single server at a significantly reduced price. After successfully bringing your initial predictive use cases to production, you can easily upgrade to bigger deployments, auto-scaling to accommodate more users and more data. Along with our Private Deployments, we are happy to guide companies with their projects,  giving personalized support to help your business successfully apply Machine Learning.

Please see our pricing page for more details and contact us at info@bigml.com for any inquiries.

Ready, set, deploy!

K-means – – Finding Anomalies while Clustering

On November 4th and 5th, BigML joined the Qatar Computing Research Institute (QCRI), part of Hamad Bin Khalifa University, to bring a Machine Learning School to Doha, Qatar! We are very excited to have this opportunity to collaborate with QCRI.

During the conference, Dr. Sanjay Chawla discussed his algorithm for clustering with anomalies, k-means–. We thought it would be a fun exercise to implement a variation of it using our domain-specific language for automating Machine Learning workflows, WhizzML.

Applying BigML to ML research

The Algorithm

The usual process for the k-means– algorithm is as follows. It starts with some dataset, some number of clusters k, and some number of expected outliers l. It randomly picks k centroids, and assigns every point of the dataset to one of these centroids based on which one is closest. So far, it’s just like vanilla k-means. In vanilla k-means, you would now find the mean of each cluster and set that as the new centroid. In k-means–, however, you first find the l points which are farthest from their assigned centroids and filter them from the dataset. The new centroids are found using the remaining points. By removing these points as we go, we’ll find centroids that aren’t influenced by the outliers, and thus different (and hopefully better) centroids.

We already have an implementation of k-means in BigML, the cluster resource. But this is not vanilla k-means. Instead of finding the new centroids by averaging all of the points in the cluster, BigML’s implementation works faster by sampling the points and using a gradient descent approach. BigML also picks better initial conditions than vanilla k-means. Instead of losing these benefits, we’ll adapt Chawla’s k-means– to use a full BigML clustering resource inside the core iteration.

This WhizzML script is the meat of our implementation.  

(define (get-anomalies ds-id filtered-ds k l)
  (let (cluster-id (create-and-wait-cluster {"k" k 
                                             "dataset" filtered-ds})
        batchcentroid-id (create-and-wait-batchcentroid 
                            {"cluster" cluster-id 
                             "dataset" ds-id 
                             "all_fields" true 
                             "distance" true 
                             "output_dataset" true})
        batchcentroid (fetch batchcentroid-id)
        centroid-ds (batchcentroid "output_dataset_resource")
        sample-id (create-and-wait-sample centroid-ds)
        field-id (((fetch centroid-ds) "objective_field") "id") 
        anomalies (fetch sample-id {"row_order_by" (str "-" field-id) 
                                    "mode" "linear"
                                    "rows" l
                                    "index" true}))
    (delete* [batchcentroid-id sample-id])
    {"cluster-id" cluster-id 
     "centroid-ds" centroid-ds
     "instances" ((anomalies "sample") "rows")}))

Let’s examine it line by line. Instead of removing l outliers at each step of the algorithm, let’s run an entire k-mean sequence before removing our anomalies.

cluster-id (create-and-wait-cluster {"k" k "dataset" filtered-ds})

It is very easy to then create a batch centroid with an output dataset with distance to the centroid appended.

batchcentroid-id (create-and-wait-batchcentroid {"cluster" cluster-id 
                                                 "dataset" ds-id 
                                                 "all_fields" true 
                                                 "distance" true 
                                                 "output_dataset" 
                                                   true})

To get specific points, we need to use a BigML sample resource to get the most distant points.

sample-id (create-and-wait-sample centroid-ds)

We can now find the distance associated with the lth instance, and subsequently filter out all points greater than that distance from our original dataset.

anomalies (fetch sample-id {"row_order_by" (str "-" field-id) 
                            "mode" "linear"
                            "rows" l
                            "index" true}))

We repeat this process until the centroid stabilizes, as determined by passing a threshold for the Jaccard coefficient between the sets of outliers in subsequent iterations of the algorithm, or until we reach some maximum number of iterations set by the user.

 You can find the complete code on GitHub or in the BigML gallery.

The Script In Action

So what happens when we run this script? Let’s try it with the red wine quality dataset. Here is the result when using a k of 13 (chosen using a BigML g-means cluster) and an l of 10.

Screen Shot 2018-11-05 at 10.25.20 AM

We can export a cluster summary report and compare it to a vanilla BigML cluster with the same k. As you might expect by removing the outlying points, the average of the centroid standard deviations is smaller for the k-means minus two results: 0.00128 versus 0.00152.

What about the points we removed as outliers? Do we know if they were truly anomalous? When we run the wine dataset through a BigML anomaly detector, we can get the top ten anomalies according to an isolation forest. When compared to the ten outliers found by the script, we see that there are six instances in common. This is a decent agreement that we have removed true outliers.

We hope you enjoyed this demonstration of how BigML can work with research to easily custom-make ML algorithms. If you couldn’t join us in Doha this weekend, we hope to see you at the first of our ML School in Seville, Spain, and other upcoming events!

Where Robotics and Machine Learning Meet

Robotic Process Automation (RPA) and Machine Learning are cutting-edge technologies that are showing an astonishing pace of growth in both capabilities and real-world applications. Having realized the powerful synergy between the data generated by Software Robots and the insights that Machine Learning algorithms can provide for any business, Jidoka and BigML, the leading RPA and Machine Learning companies respectively, join forces in a partnership to provide highly integrated solutions to collective partners and clients.

There are plenty of areas where businesses and developers can benefit from this strategic alliance between RPA and Machine Learning. For instance, a company’s customer care department and their email processing requirements. On one hand, BigML creates a Machine Learning model that predicts the receiver (department or employee) of a given email. On the other hand, Jidoka’s robots will automatically carry out all the rule-based tasks that humans historically completed such as checking if there are new e-mails to be processed, forwarding them to the correct recipients according to BigML’s predictions, and registering the task to address the request.

“Merging RPA and Machine Learning capabilities, we are able to provide enhanced automation solutions to companies that want to lead their digital transformation journey”, said Victor Ayllón, Jidoka’s CEO. “Alliances with companies such as BigML allow us to expand significantly the typology and complexity of automated processes, taking RPA one step further. Intelligent Automation is not a matter of “if” but a matter of “when”.

BigML’s CEO, Francisco Martín commented, “The imperative for more automation continues unabated in all type of business processes, and it’s only natural that RPA efforts seek to embed more and more Machine Learning models which optimize those processes, and perform tasks that until now only highly-training humans were capable of achieving. Such automation is liberating personnel so that they can focus on more strategic tasks. BigML’s partnership with Jidoka will result in much more adaptive systems that will help our customers introduce new avenues of productivity.”

With the shared goals of making businesses more agile, productive, and customer-oriented, Jidoka and BigML jointly offer enhanced capabilities for business process automation. Reducing costs and errors, improving response times, increasing human’s performance, enabling predictions, facilitating decision-making based on data, are just some of the benefits businesses will be rewarded with as they adopt Machine Learning driven Robotics.

Partner with a Machine Learning Platform: Bringing Together Top Tech Companies

The concept of “better together” doesn’t just apply to people or beloved food pairings; it also works quite well for software technologies. With this in mind, we are excited to announce the  BigML Preferred Partner Program (PPP) which brings together our comprehensive Machine Learning platform with innovative companies around the world that offer complementary technology and services. Through effective collaboration, we can provide numerous industry-specific solutions to benefit the customers of both companies.

Our Preferred Partner Program is comprised of three different levels of commitment and incentives:

  • Referral Partners: focus on generating qualified leads by leveraging their business network.
  • Sales Partners: drive product demonstrations and advance qualified leads through the contract stage.
  • Sales & Delivery Partners: facilitate the sales process and see closed deals through the solution deployment stage.

BigML provides a matching amount of training according to the tasks covered at each partnership level, including personalized training sessions, sales and IT team training, and BigML Engineer Certifications. In addition to learning one-on-one from BigML’s Machine Learning experts, partner perks include:

Over the past 7 years, BigML has systematically built a consumable, programmable, and scalable Machine Learning platform that is being used by tens of thousands of users around the world to develop end-to-end Machine Learning applications. BigML provides the core components for any Machine Learning workflow, giving users immediate access to analyze their data, build models, and make predictions that are interpretable, programmable, traceable and secure.

With these powerful building blocks that are both robust and easy-to-use on BigML, it is more accessible than ever for data-driven companies to build full-fledged predictive applications across industries. We strongly believe that not only end customers can benefit from our platform, but also those tech companies that wish to help in the adoption of Machine Learning to make it accessible and effective for all businesses. Displayed below is a small sampling of services that can be provided atop of the BigML platform:

Partner Services atop of BigML Platform

Find more information here and reach out to us at partners@bigml.com. In addition to three main levels, BigML is also looking for companies interested in Original Equipment Manufacturer (OEM) and technology partnerships, so if that is of interest to your company, please let us know. We look forward to partnering with other innovative companies!

%d bloggers like this: