How to Perform Clustering in a Single Command Line

Posted by

It’s been a while since we last wrote about the latest changes in our command line tool, BigMLer. In the meantime, two unsupervised learning approaches have been added to the BigML toolset: Clusters and Anomaly Detectors. Clusters are useful to group together instances that are similar to each other and dissimilar to those in other groups, according to their features. Anomaly Detectors, on the contrary, try to reveal which instances are dissimilar from the global pattern. Clusters and anomaly detectors can be used in market segmentation or fraud detection respectively. Unlike trees and ensembles, they don’t need your training data to contain a field that you must previously label. Rather, they work from scratch which is why they’re called unsupervised models.

In this post, we’ll see how easily you can build a cluster from your data, and a forthcoming post will do the same for anomaly detectors. Using the command line tool BigMLer, these machine learning techniques will easily pop out of their shells, and you will be able to use them either from the cloud or from your own computer, no latencies involved.


Clustering your data

There are many scenarios where you can get new insights from clustering your data. Customer segmentation might be the most popular one. Being able to identify which instances share similar characteristics has always been a coveted feature, whether it be for your marketing campaigns or to identify groups with loan default risk or for health diagnosis. Clusters do this job by defining a distance between your data instances based on the values of their features. The distance must ensure that similar instances are closer than different ones. In BigML, all kinds of features contribute to this distance: numeric features, obviously, but also categorical and text features are taken into account. Using this distance, the instances in your dataset that are closer together and more distant to the rest of points are grouped into a cluster. The central points of these clusters, called centroids, give us the average features of the group. The user can optionally label each cluster with a descriptive name based on the features of the centroid. These labels can be used as a cluster-based class name, which can be thought of as assigning a label to each instance of the cluster dataset. Later, you could build a new global dataset with a new column that would be the label of the cluster this instance is assigned to, and in this sense, the label would be a kind of category. Then, when new data appears, you can tell which cluster it belongs to by checking which centroid it is closest to.

So, how can BigMLer help cluster your data? Just type:

bigmler cluster –train my_dataset

and the process begins: first your data is uploaded to BigML‘s servers, and BigML infers the fields that your source contains, their types, the missing values these fields might have, and builds a source object. This source is then summarized into a dataset, where all statistical information is computed. After that, the dataset is used to look for the clusters in your data and a cluster object is created. The command’s output shows the steps of the process and the IDs for the created resources. Those resources are also stored in files in the output directory.

The default clustering algorithm used is G-means, but what does that mean? We talked about that extensively in a prior post, but basically this algorithm groups your data around a small number of instances (the number is usually denoted by k) by computing the distance of the rest of the points in the dataset to those instances. Each point in the dataset is considered to be grouped around the closest one of the k-instances selected in the beginning. This process results in k clusters and their centroids (the central point of the each group) are used as the next starting set of instances for a new iteration of the same algorithm. This carries on until there is no more improvement in the separation among chosen clusters. An advanced algorithm, G-means, eliminates the need for the user to divine what the value for k must be. It compares the results found using different values for k and choses the one that achieves the best Gaussian shaped clusters. That’s the algorithm used in BigMLer if no k value is specified, but you can always set the k of your choice in the command if you please:

bigmler cluster --train my_dataset --k 3

In this case, the created cluster will have exactly 3 centroids, while in the first case, the final cluster count will be determined automatically. This process of picking instances as seed is done randomly, therefore you must specify a seed using the modifier --cluster-seed in your command if you want to ensure deterministic results. To run the clustering from your created dataset with a seed, use the dataset id as starting point:

bigmler cluster --dataset dataset/53b1f71437203f5ac30004a2 \
--cluster-seed my_seed --test my_new_instances.csv

This command will create a new cluster from the data enclosed in the existing dataset that will be reproducible in a deterministic way. It will also generate predictions for the new data enclosed in the my_new_instances file, finding which centroid each instance is closer to. But then, how do you know which instances are grouped together?

Profiling your clustered data

To know more about the characteristics of the obtained set of clusters, you can create a dataset for each of them with --cluster-datasets. Refer to your recently created cluster using its id

bigmler cluster --cluster cluster/53b1f71437203f5ac30004f0 \

and you’ll obtain a new dataset per centroid. Each of them will contain the instances of your original dataset associated to the centroid. If you are not interested in all of the groups, you can chose which one to generate by using the name of the associated centroid.

bigmler cluster --cluster cluster/53b1f71437203f5ac30004f0 \
--cluster-datasets "Cluster 1,Cluster 2"

will only generate the datasets associated to the clusters labeled as Cluster 1 and Cluster 2. The generated datasets will show the statistical information associated with the instances that fall into a certain cluster-defined class.

Can you learn more from your cluster? Well, yes! You can also see which features are most helpful to separate the clustered datasets, so that you can infer the attributes that “define” your clusters. In order to do that, you should create a model for each clustered dataset that you are interested in using --cluster-models:

bigmler cluster --cluster cluster/53b1f71437203f5ac30004f0 \
--cluster-models "Cluster 1, Cluster 2"

How does BigMLer build a tree model on a particular cluster, let’s say Cluster 1? By adding a new field to every instance in your dataset that will contain whether or not this particular instance is grouped there. This brings an additional advantage: each model has a related field importance histogram that shows the importance a certain field has in the classification. Knowing which fields are important to classify new data as related to a centroid can give you an insight into what features define the cluster.

And these are basically the magic sentences that you must know to identify the groups of instances your data contains, the profiling of each group and the features that make them alike and different from others. Similarly, in a forthcoming post, we’ll talk about another set of commands that will help you pinpoint the outliers in your data by using Anomaly Detectors. Stay tuned!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s