Filtering, Filtering, and Filtering

Posted by

BigML is rolling out some new API features, and many of these features can be be summed up in a word: filtering. When you’ve got data, you can slice it up in a lot of interesting ways, and filtering is one way to do that.

BigML filters
BigML filters

Filtering When You Create Datasets

Many sources of data have interesting subsets. For example, suppose you have stock market performance data and want to learn a model to pick stocks. But if you’re only interested in one type of stock (say, stocks in the energy sector, or stocks with market capitalization above a certain value), it doesn’t make sense to train a model using other types of stock. In some cases it might even worsen performance.

Using BigML’s API  you could so far create a dataset specifying a list of input_fields and a list of excluded_fields from the data source. Now,  you can also specify a json_filter as part of the request.  When you specify a filter, only instances (input rows) satisfying it will be included in your dataset. Filters are specified essentially as logical statements created using prefix operators, field identifiers, and constants.

Using the same example above, imagine that you want to filter those stocks with a market capitalization greater than $200 million. You can do that using a json expression like this:

[">", ["field", "market_cap"], 200000000]

Using curl you can include a json_filter in your request to create a dataset as follows:

curl "$BIGML_AUTH" \
     -X POST \
     -H "content-type: application/json" \
     -d '{"source": "source/source/50c1871c035d07305d00000c", "json_filter":[">", ["field", "market_cap"], 200000000]}'

The request above will generate a dataset that only includes instances where the field “market_cap” is greater than $200 million.

If you are a Lisp aficionado (like most of us at BigML), you can also provide the filter as an s-expression, which is a little less tedious syntactically. The only difference is that you should send such an expression as a lisp_filter instead of a json_filter. Imagine that you want to filter the stocks above with a market cap greater than $200 million that belong to the Technology  sector. You can do that using a lisp expression like this:

(and (> (field "market_cap") 200000000)
     (= (field "sector") "Technology"))

Using curl you can include a lisp_filter in your request to create a dataset as follows:

curl "$BIGML_AUTH" \
     -X POST \
     -H "content-type: application/json" \
     -d '{"source": "source/50c1871c035d07305d00000c", "lisp_filter": "(and (> (field \"market_cap\") 200000000) (= (field \"sector\") \"Technology\"))"}

The filtering expressions can get pretty sophisticated, and can involve boolean operators, arithmetic operators, and tests for missing values. You can check out the dataset section of the developer documentation under “Filtering Rows” to learn more.

Finally, you can also use bigmler and filters stored in a file to create new models. For example:

bigmler --source source/50c1871c035d07305d00000c --lisp_filter technology_1b.lisp --objective high

Filtering Dataset Fields

Many times, big data isn’t just about having a lot of instances (rows or examples), it’s about having a lot of fields (columns or features). Text processing and computer vision datasets in particular are notorious for having lots of fields. This means that when you GET one of the datasets you’ve created, the  fields block in that dataset may have thousands or even tens of thousands of entries.

BigML’s API now allows you to limit or eliminate that potentially large block of information. For example, to only get the information for the first 20 fields in a dataset, you can simply add ‘limit=20‘  as a query parameter to your request.

curl "$BIGML_AUTH;limit=20"

For example, to get only the information for the fields that start with “market_cap_“, you can simply add “prefix=market_cap_” as an additional parameter to your request. If you want to use a case-insensitive string to retrieve fields whose name starts with the given prefix you can use “iprefix” instead.

Filtering datasets is pretty flexible. You can also specify a comma-separated list of field names to retrieve, or an offset and size, or a boolean flag that indicates that you don’t want any. Again, the dataset section of the developer documentation under “Filtering and Paginating Fields from a Dataset” has the full story.

Filtering Models

People are big fans of our user interface. They particularly like that they can use the controls in the model view to show and hide different parts of the model, either by the target prediction, or by the “support” —the number of instances passing through a given node in the tree-like BigML model.

We’ve now enabled that same functionality in our API. To retrieve a model where the only paths that predict ‘buy‘ are shown, you can add a ‘value parameter to your request:

curl "$BIGML_AUTH;value=buy"

You can also filter the nodes in the model so that only nodes containing, say, at least 10% of the dataset are shown:

curl "$BIGML_AUTH;value=buy;support=0.1"

This obviously works slightly differently for regression models; In that case, ‘value’ is a range like [10, 20] rather than a class name. You can also combine ‘value‘ and ‘support‘  for more elaborate filters as the example above. You can find more in the models section of the developer documentation under “Filtering a Model“.

Filtering When You Create Datasets

Okay, we’re not really talking about filtering in this last case, but about sampling. If you have a dataset with millions of instances, sometimes it isn’t necessary to use them all. It might be possible to learn a near-perfect predictor with only a few thousand examples, and to use more would be time-consuming, expensive, and not very useful. Said differently, it pays to match the amount of training data with the complexity of your modeling problem.

We’re now giving you the tools you need to sample your input at model creation time. Specifically, if you specify ‘sample_rate‘ as part of your request when creating your model, you can tell BigML to only use a fraction of your dataset. For example:

curl "" \
     -X POST \
     -H "content-type: application/json" \
     -d '{"dataset": "dataset/50c4ef483b56355a3700010d", "sample_rate": 0.1}'

will only use a tenth of your data during model creation. There are a number of related parameters, all of which are explained in the models section of the developer documentation under “Sampling your Dataset“.

We expect to bring a few of these features to BigML’s interface soon. In the meanwhile, if you’re a BigML API user, get out there and give these new features a test drive!

One comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s