Skip to content

Partner with a Machine Learning Platform: Bringing Together Top Tech Companies

The concept of “better together” doesn’t just apply to people or beloved food pairings; it also works quite well for software technologies. With this in mind, we are excited to announce the  BigML Preferred Partner Program (PPP) which brings together our comprehensive Machine Learning platform with innovative companies around the world that offer complementary technology and services. Through effective collaboration, we can provide numerous industry-specific solutions to benefit the customers of both companies.

BigML Preferred Partner Program

Our Preferred Partner Program is comprised of three different levels of commitment and incentives:

  • Referral Partners: focus on generating qualified leads by leveraging their business network.
  • Sales Partners: drive product demonstrations and advance qualified leads through the contract stage.
  • Sales & Delivery Partners: facilitate the sales process and see closed deals through the solution deployment stage.

BigML provides a matching amount of training according to the tasks covered at each partnership level, including personalized training sessions, sales and IT team training, and BigML Engineer Certifications. In addition to learning one-on-one from BigML’s Machine Learning experts, partner perks include:

Over the past 7 years, BigML has systematically built a consumable, programmable, and scalable Machine Learning platform that is being used by tens of thousands of users around the world to develop end-to-end Machine Learning applications. BigML provides the core components for any Machine Learning workflow, giving users immediate access to analyze their data, build models, and make predictions that are interpretable, programmable, traceable and secure.

With these powerful building blocks that are both robust and easy-to-use on BigML, it is more accessible than ever for data-driven companies to build full-fledged predictive applications across industries. We strongly believe that not only end customers can benefit from our platform, but also those tech companies that wish to help in the adoption of Machine Learning to make it accessible and effective for all businesses. Displayed below is a small sampling of services that can be provided atop of the BigML platform:

Partner Services atop of BigML Platform

Find more information here and reach out to us at partners@bigml.com. In addition to three main levels, BigML is also looking for companies interested in Original Equipment Manufacturer (OEM) and technology partnerships, so if that is of interest to your company, please let us know. We look forward to partnering with other innovative companies!

Qatar is Ready for the Machine Learning Adoption

141 attendees from 13 countries enjoyed the first edition of our Machine Learning School in Doha, co-organized with Qatar Computing Research Institute (QCRI), part of Hamad Bin Khalifa University. Machine Learning practitioners from mostly Qatar, as well as Algeria, Egypt, Finland, Mexico, Montenegro, Republic of Korea, South Africa, Spain, Tunisia, Turkey, United Kingdom, and the United States, traveled to the capital of Qatar for two days packed with Machine Learning master classes, practical workshops, one-on-one consultations at the Genius Corner, as well as plenty of chances for networking with data-driven professionals.

The BigML and QCRI teams are very excited and impressed with such a great response, which was felt at the event in our interactions with the audience and the media on November 4-5 at the Marriott Marquis City Center Doha Hotel. The MLSD18 was booked to capacity in less than a week. A wide variety of professionals joined the two-day intensive course, 90 attendees represented 12 top universities from the Middle East region, and the remaining 51 attendees came from 29 organizations, including some of the most influential companies in the region.

The lessons were delivered by experienced BigML team members as well as QCRI Scientists. BigML’s CIO Poul Petersen covered the important topic of Machine Learning-Ready Data, together with Dr. Saravanan Thirumuruganathan, Scientist at QCRI. The curriculum also included supervised and unsupervised learning techniques, taught by Dr. Greg Antell, BigML’s Machine Learning Architect, and Dr. Sanjay Chawla, Research Director with the Qatar Computing Research Institute, respectively. The automation section was presented by BigML’s CTO, Dr. José Antonio Ortega (jao). Furthermore, Dr. Mourad Ouzzani, Principal Scientist with QCRI, presented the latest research projects QCRI has been working on. Please check out these photo albums on Google+ and Facebook to view the event’s highlights.

We could not wish for a better start of our Machine Learning Schools series in the Middle East, and look forward to future editions in the region. Stay tuned for future announcements! For the Machine Learning enthusiasts who could not join the MLSD18, we encourage you to register for the first Machine Learning School in Seville, on March 7-8, 2019.

The Data Transformations Webinar Video is Here: Machine Learning-Ready Data!

The latest BigML release brings new Data Transformation capabilities to our Machine Learning platform, crucially alleviating the data engineering bottleneck.

In addition to our previous offering in the data preparation area, we are now releasing a new batch of transformations that maximize the benefits of BigML platform’s supervised and unsupervised learning techniques by pre-processing input data more effectively. You can now aggregate instances to your datasets, join and merge datasets, and many other options as presented in the official launch webinar video. You can watch it anytime on the BigML YouTube channel.

To learn more about Data Transformations, please visit the release page, where you can find:

  • The slides used during the webinar.
  • The detailed documentation to learn how to use Data Transformations with the BigML Dashboard and the BigML API.
  • The series of blog posts that gradually explain Data Transformations. We start with an introductory post that explains the basic concepts, followed by a use case to understand how to put Data Transformations to use, and then three more posts on how to use Data Transformations through the BigML Dashboard, API, WhizzML, and Python Bindings.

Thanks for watching the webinar! For any queries or comments, please contact the BigML Team at support@bigml.com.

Automating Data Transformations with WhizzML and the BigML Python Bindings

by

This is the fifth post of our series of six about BigML’s new release: Data Transformations. This time we are focusing on the Data Preparation step, prior to any Machine Learning project, to create a new and improved dataset from an existing one.

CRISP-DM_diagram-bigml

The data preparation phase is key to achieve good performance for your predictive models. Not only that, there is a wide variety of operations that can be performed since data usually do not come ready or do not have upfront the fields we need to create our Machine Learning models. Being aware of that, in 2014 BigML introduced Flatline, the DSL language specifically designed for data transformations. Over the years, Flatline has grown and increased the number of operations that can be performed. Now, in this release, we improved its sliding window operations and added the ability to use a subset of SQL instructions that add a new range of transformations, such as joins, aggregations, or adding rows to an existing dataset.

In this blog post, we will learn step-by-step how to automate these data transformations programmatically using WhizzML, BigML’s Domain Specific Language for Machine Learning automation, and the official Python Bindings

Adding Rows: Merge Datasets

When you want to add data to an existing dataset that is already in the platform you will use the following code. This is an example used where data are collected in periods or the same kind of data comes from different sources.

;; creates a dataset merging two existing datasets
(define merged-dataset
  (create-dataset {"origin_datasets"
                   ["dataset/5bca3fb3421aa94735000003"
                    "dataset/5bcbd2b5421aa9560d000000"]})

The equivalent code in Python is:

# merge all the rows of two datasets
api.create_dataset(
    ["dataset/5bca3fb3421aa94735000003",
     "dataset/5bcbd2b5421aa9560d000000"]
)

As we saw in previous posts, the BigML API is mostly asynchronous, which means that the execution will return the ID of the new dataset before its creation is completed. This implies that the analysis of fields and their summary will continue after the code snippet is executed. You can use the directive “create-and-wait-dataset” to be sure that the datasets have finally merged:

;; creates a dataset from two existing datasets and
;; once it's completed its ID is saved in merged-dataset variable
(define merged-dataset
  (create-and-wait-dataset {"origin_datasets"
                            ["dataset/5bca3fb3421aa94735000003",
                             "dataset/5bcbd2b5421aa9560d000000"]})

The equivalent code in Python is:

# merge all the rows of two datasets and store the ID of the
# new dataset in merged_dataset variable
merged_dataset = api.create_dataset(
    ["dataset/5bca3fb3421aa94735000003",
     "dataset/5bcbd2b5421aa9560d000000"]
)
api.ok(merged_dataset)

When you merge datasets, you can update several parameters that you can check in the Multidasets sections of the API documentation. With that, we can now configure a merged dataset with WhizzML setting the sample rates and using the same pattern of pairs <property_name> and <property_value>  we have used in the first example.

;; creates a dataset from two existing datasets
;; setting the percentage of sample in each one
;; once it's completed its ID is saved in merged-dataset variable
(define merged-dataset
  (create-and-wait-dataset {"origin_datasets"
                            ["dataset/5bca3fb3421aa94735000003"
                             "dataset/5bcbd2b5421aa9560d000000"]
                            "sample_rates"
                             {"dataset/5bca3fb3421aa94735000003" 0.6
                              "dataset/5bcbd2b5421aa9560d000000" 0.8}})

The equivalent code in Python is:

# Creates a merged dataset specifying the rate of each 
# one of the original datasets
merged_dataset = api.create_dataset(
    ["dataset/5bca3fb3421aa94735000003", "dataset/5bcbd2b5421aa9560d000000"],
    {
        "sample_rates": {
            "dataset/5bca3fb3421aa94735000003": 0.6,
            "dataset/5bcbd2b5421aa9560d000000": 0.8
        }
    }
)
api.ok(merged_dataset)

Denormalizing Data: Join Datasets

Data is commonly stored in relational databases, following the normal forms paradigm to avoid redundancies. Nevertheless, for Machine Learning workflows, data need to be denormalized.

BigML now allows you to make this process in the cloud as part of your workflow codified in WhizzML or with the Python Bindings. For this transformation, we can use the Structured Query Language (SQL) expressions. See below how it works. Assuming we have two different datasets in BigML, which we want to put together, and both share a field `employee_id` whose field ID is 000002:

;; creates a joined dataset composed by two datasets
(define joined_dataset
  (create-dataset {"origin_datasets"
                   ["dataset/5bca3fb3421aa94735000003"
                    "dataset/5bcbd2b5421aa9560d000000"]
                   "origin_dataset_names"
                   {"dataset/5bcbd2b5421aa9560d000000" "A"
                    "dataset/5bca3fb3421aa94735000003" "B"}
                   "sql_query"
                   "select A.* from A left join B on A.`000000` = B.`000000`"}))

The equivalent code in Python is:

# creates a joined dataset composed by two datasets
api.create_dataset(
    ["dataset/5bca3fb3421aa94735000003", "dataset/5bcbd2b5421aa9560d000000"],
    {
        "origin_dataset_names": {
            "dataset/5bca3fb3421aa94735000003": "A",
            "dataset/5bcbd2b5421aa9560d000000": "B"
        },
        "sql_query":
            "SELECT A.* FROM A LEFT JOIN B ON A.`000000` = B.`000000`"
    }
)

Aggregating Instances

The use of SQL opens the possibility to make a huge quantity of operations with your data like selection, values transformations, and rows groups between others. For instance, in some situations, we need to collect some statistics from the data creating groups around the value of a specific field. This transformation is commonly known as aggregation and the SQL keyword for that is ‘GROUP BY’. See below how to use it in WhizzML, assuming we are managing a dataset with some data of a company where the field 000001 is the department and the field 000005 is employee ID.

;; creates a new dataset aggregating the instances
;; of the original one by the field 000001
(define aggregated_dataset 
  (create-dataset {"origin_datasets"
                   ["dataset/5bcbd2b5421aa9560d000000"]
	           "origin_dataset_names"
                   {"dataset/5bcbd2b5421aa9560d000000" "DS"}
                   "sql_query"
                   "SELECT `000001`, count(`000005`) FROM DS GROUP BY `000001`"}))

The equivalent code in Python is:

# creates a new dataset aggregating the instances
# of the original one by the field 000001
api.create_dataset(
    ["dataset/5bcbd2b5421aa9560d000000"],
    {
         "origin_dataset_names": {"dataset/5bcbd2b5421aa9560d000000": "DS"},
         "sql_query":
             "SELECT `000001`, count(`000005`) FROM DS GROUP BY `000001`"
    }
)

It is possible to use the name of the fields in the queries but field IDs are preferred to avoid ambiguities. It is also possible to define aliases for the new fields using the keyword AS after the operation that follows the SQL syntax. Note that using SQL, you can also perform more complex operations than the ones we demonstrate in this post.

Want to know more about Data Transformations?

If you have any questions or you would like to learn more about how Data Transformations work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow, as well as the full webinar recording.

Efficient Data Transformations Using the BigML API

As part of our release for Data Transformations, we have outlined both a use case and how to execute the newly available features in the BigML Dashboard. This installment demonstrates how to perform data transformations by calling the BigML REST API. In any machine learning workflow, data transformations tend to be one of the most time consuming, but entirely essential, tasks. BigML transformations enable this process to be more seamlessly integrated with the modeling process, keeping aligned with the mission of BigML to make Machine Learning simple and accessible. Access to BigML.io requires authentication and now is a good time to follow these steps if you have not done so already.

bigml_transformations_flow.png

Inspection of the dataset

If you are not familiar with Black Friday, it refers to the day after Thanksgiving in the United States, which is traditionally one of the busiest days of the year for retailers and kicks off the profitable holiday shopping season. For this tutorial, we are making use of a dataset that can be found and downloaded on Kaggle and consists of more than 500,000 observations of shopping behavior. We have sampled from this original source to create a dataset in BigML consisting of 280,889 instances.  The 12 fields in the dataset include:

  • User ID: 5,891 distinct customers
  • Product ID: 3,521 distinct products
  • User demographics: gender, age, occupation, and marital status
  • Geographic information: city category, stay in current city
  • Product information: product category and purchase amount
Screenshot of the imported dataset in the BigML Dashboard

Screenshot of the imported dataset in the BigML Dashboard

Question: What factors are most indicative of a high spending customer?

Being able to reliably identify individuals with high spending is a frequent and critical task for data-savvy marketers. To answer this question, there are a series of data transformation steps that we will need to perform before training and evaluating supervised Machine Learning model.

  1. Group the dataset according to User ID
  2. Calculate total amount of spending per individual
  3. Join the User ID-aggregated dataset with the original dataset to include relevant demographic information
  4. Determine a threshold for “high spending” that seems reasonable
  5. Build a predictive model that discriminates between high spending and low spending customers
  6. Inspect the feature importance of the model

Grouping of the dataset by User ID can be accomplished by using the aggregate function in the Dashboard and the equivalent SQL query in the API. This will result in a new dataset named “User ID Grouped” that consists of 5,891 instances and four total fields: “User ID”, “count_Purchase”, “sum_Purchase”, and “avg_Purchase”.

curl "https://bigml.io/dataset?$BIGML_AUTH" \
    -X POST \
    -H 'content-type: application/json' \
    -d '{"origin_datasets": ["dataset/5bcb47d4b95b397eec000066"],
    "origin_dataset_names": {"dataset/5bcb47d4b95b397eec000066":"A"},
    "name":"User ID Grouped",
    "sql_query": "SELECT A.User_ID, count(A.Purchase) AS count_Purchase, 
    sum(A.Purchase) AS sum_Purchase, avg(A.Purchase) AS avg_Purchase 
    FROM A GROUP BY A.User_ID"}'

Because the newly created dataset does not include the interesting demographic information, we perform a join with the original data in order to add this information for each of the new instances. The fields that we are most interested here are those that define and describe the individual shoppers, not the products. To do this, we will execute three consecutive commands (shown in the image below). The first will keep the fields that we are interested in for this question. The second will remove duplicates. And finally, the third will join this dataset with our dataset about cumulative purchase behavior (“User ID Grouped”).

curl "https://bigml.io/dataset?$BIGML_AUTH" \
    -X POST \
    -H 'content-type: application/json' \
    -d '{"origin_datasets": ["dataset/5bcb47d4b95b397eec000066"],
    "origin_dataset_names": {"dataset/5bcb47d4b95b397eec000066":"A"},
    "name":"User ID Demographics",
    "sql_query": "SELECT A.User_ID, A.Gender, A.Age, 
    A.Occupation, A.City_Category, A.Stay_In_Current_City_Years,
    A.Marital_Status FROM A"}'

curl "https://bigml.io/dataset?$BIGML_AUTH" \
    -X POST \
    -H 'content-type: application/json' \
    -d '{"origin_datasets": ["dataset/5bcde92db95b397edd000036"],
         "origin_dataset_names": {"dataset/5bcde92db95b397edd000036":"A"},
         "name":"User ID Demographics [No Duplicates]",
         "sql_query": "SELECT DISTINCT * FROM A"}'

curl "https://bigml.io/dataset?$BIGML_AUTH" \
    -X POST \
    -H 'content-type: application/json' \
    -d '{"origin_datasets": [
            "dataset/5bcdeed0b95b397ee200007e",
            "dataset/5bcdff1cb95b397eec00009a"],
         "origin_dataset_names": {
            "dataset/5bcdeed0b95b397ee200007e": "A",
            "dataset/5bcdff1cb95b397eec00009a": "B"},
         "name":"DATA JOINED",
         "sql_query":
         "SELECT A.*, B.* 
          FROM A LEFT JOIN B ON A.User_ID = B.User_ID"
        }'

By visually inspecting the “sum_purchase” field of our new dataset, we can see that it approximately follows a power law distribution. Rather than build a model that predicts the exact amount of spending, we are more interested in binning customers into groups of “low”, “high”, and “medium” total spending.

curl "https://bigml.io/dataset?$BIGML_AUTH" \
    -X POST \
    -H 'content-type: application/json' \
    -d '{"origin_dataset": "dataset/5bce154ab95b397eec0000a4",
          "new_fields": [{
          "field": "(cond (< (f \"sum_purchase\") 100000) \"LOW\" 
          (cond (< (f \"sum_purchase\") 500000) \"MEDIUM\" \"HIGH\"))",
          "name":"Discrete_Spending_Sum"}]}'

At this point, our data is in a Machine-Learning ready format. We can build a classifier that predicts whether a user is in the category “LOW”, “HIGH”, or “MEDIUM” based on the “Occupation”, “Age”, “Gender”, “Marital_Status”, “City_Category”, and “Stay_In_Current_City_Years”. For more information on how to train and evaluate models, you can consult the API documentation or previous API tutorials. We will jump ahead to view the feature importance of an ensemble classifier trained on this data.

Global feature importance for predicting spending behavior

Global feature importance for predicting spending behavior

Analyzing the results

It appears that the most important fields for determining high spending are occupation (34.51%), followed by age (19.87%), and “stay in current city” (17.92%). Marital status and gender appear to contribute far less when determining who is a large spender. Considering that this dataset codes for 21 distinct occupations, there is ample opportunity to find additional correlations between spending habits and specific occupations in order to improve or personalize marketing efforts.

With SQL style queries, there are virtually limitless transformations and interrogations you can make for your dataset. You can reference our API documentation to find more examples and templates for how to make these queries programmatically with the BigML API.

Want to know more about Data Transformations?

If you have any questions or you would like to learn more about how Data Transformations work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow, as well as the full webinar recording.

An Intuitive Guide to Financial Analysis with Data Transformations

With regards to the analysis of financial markets, there exists two major schools of thought: fundamental analysis and technical analysis.

  • Fundamental analysis focuses on understanding the intrinsic value of a company based on information such as quarterly financial statements, cash flow, and other information about an industry in general. The goal is to discover and acquire assets that are currently undervalued, often with a long-term approach to investing.
  • Technical analysis is based on the assumption that all of the relevant information about a company is already baked into its share price. Rather than financial statements, this analysis is focused on analyzing trends in the share price in order to forecast the future value, and can often be conducted on extremely short time scales.

Financial analysis made simple with BigML

While both strategies clearly rely on data, the type of data that is most useful and relevant is drastically different. In this blog post, we will explore how sliding window transformations and dataset joins are fundamental data transformation operations for technical and fundamental financial analysis, respectively.

Loading and Filtering Data

In this use case, we will demonstrate how the exploration of market data is enabled by quick and easy data transformations using BigML. We will be using two primary datasets that contain stock market data from 2016. This data, originally obtained from Kaggle, was pre-processed so as to be more relevant for the new BigML transformation options being highlighted. You can access both of these updated datasets in the BigML Gallery.

To scope down the problem, we are including only the NASDAQ Index and “FAANG” stocks. These stocks correspond to high market cap tech sector companies including Facebook (FB), Amazon (AMZN), Apple (AAPL), Netflix (NFLX), and Alphabet (GOOGL).

stock_data_starting.png

Stock Market Data: the FAANG stocks were selected for further investigation

Time Series Feature Engineering

Time series data may not appear to be feature rich at first glance. For example, for each stock in this dataset (e.g. AAPL), a single vector of values exists representing the closing price of the asset for each day. Without further analysis or additional data, this is unlikely to be useful for any sort of forecasting task. Of course, there is a reason why technical analysts spend a lot of time evaluating data visually: they are looking to identify past, ongoing, and emerging trends within the time series. Fortunately, these visual trends can also be systematically created and categorized through feature engineering, a task that rather simple data transformations are able to accomplish en masse.

Screen Shot 2018-10-19 at 9.28.14 AM

Data transformations using a sliding window

One of the most valuable classes of data transformations for time series data is that of sliding windows. BigML now supports a large number of sliding window transformations, including measurements of central tendency (mean and median), computation of sums, products, or differences within a window, and returning maximum or minimum values. The parameters needed for a sliding window computation consist of only the desired operation and the start and end of the window relative to the reference instance. Typically only historical data would make intuitive sense for a time series forecasting problem. However, sliding windows can also be computed using future values if desired. For the sake of example, we will highlight two useful and often-employed types of transformations:

technical transformations.png

BigML configurations to calculate trends using a sliding window across 3 days.

  • Mean of instances: this transformation calculates the average value in a window. This type of transformation can be very useful for smoothing out noisy data in order to focus on more sustained trends.
Screen Shot 2018-10-19 at 12.44.35 PM.png

Smoothing of the pricing data using sliding windows

  • Difference from last: this transformation essentially determines a local, recent trend in the direction of the data. In financial applications, this is frequently referred to as the “momentum” and it represents the velocity of price changes.
Screen Shot 2018-10-19 at 12.57.08 PM

Price momentum using sliding windows

Joining Data Sources

As introduced previously, fundamental analysis is not nearly as concerned with trends in time series as it is with understanding the intrinsic value of a company through additional sources of information. Accordingly, the most powerful data transformation for a fundamental analysis is the ability to quickly and accurately join together information sourced from disparate data sources. Because fundamental financial data is often issued on a quarterly basis, we first will add an additional field to our dataset that assigns each instance to a quarter, based on the month of the year.

The flatline editor is used to add quarterly features to dataset.

The flatline editor is used to add quarterly features to the dataset.

We will then add a new dataset consisting of fundamental data from Amazon to our Dashboard. The instances in this dataset refer to quarters rather than individual dates, and include additional fields useful for fundamental analysis such as operating income, revenue, and total equity.

amazon fundametals.png

Amazon fundamental data for 2016 financial quarters.

What would be most useful is to be able to join this fundamental data with the information we already have regarding the stock price values. This type of operation is known as a join and can now be conducted in the BigML Dashboard without needing to know or utilize SQL, Pandas, Tableau or other data manipulation tools. The fields “period” and “QUARTER” respectively refer to the same information, and allow the associated fundamental analysis to be added to each instance of the daily closing price of Amazon.

Joining of time series stock data with fundamental data

Joining of time series stock data with fundamental data.

The resulting dataset allows for more intricate visualizations and association discovery than possible with only time series or fundamental data. Using the scatterplot option in the Dashboard, we can view the range of closing prices for this stock in relation to fundamental analysis variables, such as “Final Revenue”, as shown below. In the case of Amazon in 2016, there appears to be a strong correlation between these variables although it is far from telling the entire story with regards to price forecasting.

Screen Shot 2018-10-19 at 2.47.24 PM.png

This blog post shows the power of transforming and combining data in order to gain powerful insights about seemingly simple datasets. In particular, sliding windows and dataset joins are frequently used to perform financial analysis of the technical and fundamental variety. We encourage you to dig deeper into this dataset to find unique and informative insights that can be uncovered.

Want to know more about Data Transformations?

If you have any questions or you would like to learn more about how Data Transformations work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow, as well as the full webinar recording.

Data Transformations with the BigML Dashboard: Get your Machine Learning-Ready Data in a Few Clicks

Data preparation is a key task in any Machine Learning workflow, but it’s often one of the most challenging and time-consuming parts. BigML’s upcoming release brings new data transformation features that make it faster and easier than ever before to get your data ready for Machine Learning.

These features significantly expand the data preparation options that BigML already provides, such as missing values treatment, categorical values encoding, date-time fields expansion or NLP techniques for your text fields.

All the new data transformation features can be classified into two groups:

  • SQL queries: The capability of writing SQL queries to create new datasets opens up an infinite number of transformations to prepare your data for Machine Learning. Although the ability to freely write SQL statements will be an API-only feature for now, we are bringing some common transformations to the Dashboard for users that prefer to transform their data in a few clicks: aggregate instances, join and merge datasets. The idea is to add more options in the Dashboard on an ongoing basis; for example, the ability to order instances and remove duplicates. Please send an e-mail to roadmap@bigml.com if you have any particular request.
  • Feature engineering: new sliding windows feature, and significant improvements to the Flatline Editor, enabling more ways to easily create fields for your datasets.

Aggregating Instances

The aggregating instances option in BigML allows you to group the rows of a dataset by a given field.

For example, imagine you have customer data stored in a dataset where each purchase is a different row. If you want to use this dataset to train models to analyze customers purchase behaviors, you need a dataset where each row is a customer instead of a purchase. This is the case of the dataset in the image below where we can aggregate the instances by the field “customerID” to get a row per unique customer. You can also see that we needed to use some aggregation functions for the rest of the fields in order to add them to the new dataset such as the total purchases per customer (“Count_customerID”), the total units purchased (“Sum_Quantity”), the first purchase date (“Min_Date”) or the average price per unit spent per customer (“Avg_UnitPrice”).

aggregation-example

You can easily do this on the BigML Dashboard by following these steps:

  • Click the “Aggregate instances” option from the dataset configuration menu:

aggregate-instances

  • Select the “CustomerID” as the grouping field:

select-grouping-field

  • Configure the aggregation operations for the fields you want to include in the final dataset. For example, in the image below we are including the count of rows per customer and the total amount of units purchased:

define-aggregation-operations

  • When you have all the operations defined, click on the “Aggregate instances” button:

aggregate-instances-cta

This will create a new dataset containing a customer per row and the columns that you defined using the aggregation functions described above. From this dataset, you can also see the SQL query under the hood by clicking the option highlighted in the image below.

aggregated-dataset.png

Joining Datasets

BigML allows you to join several datasets to combine their fields and instances based on one or more related fields between them. This is very useful when your data is scattered in two or more datasets.

For example, imagine we want to predict employee performance and we have two different sources of data: a dataset containing employees’ data (employee name, salary, age, etc.) and another dataset containing departments data (department name, budget, etc.). If we want to include the department data as an additional predictor for our employees’ analysis, we can use a common field in both datasets (department_id) to add the department characteristics to the employee dataset (see image below).

 

join-example.png

You can easily do this on the BigML Dashboard by following these steps:

  • Click the “Join datasets” option from the dataset configuration menu:

join-datasets-option

  • Then select the type of join: left join if you want to get all the instances from the current (left) dataset and the matched instances from the selected (right) dataset; or inner join if you want to get the instances that have matching values in both datasets. In this case, we are selecting the left join because we want all the employees regardless if they have a matching department or not.  Next, select the dataset you want to make the join with:

type-of-join.png

  • Select one or more pairs of joining fields to match the instances of both datasets. In this example, we select the department_id to make the match:

join-fields.png

  • Decide which fields of the selected dataset (the departments dataset in our case) you want to include in the final joined dataset:

final-joined-dataset-fields.png

  • Optionally, you can filter the joined dataset by selecting fields from the current or the selected dataset and setting up different filtering conditions. Then go ahead and click the “Join datasets” button.

filter-joined-dataset.png

This will create a dataset that will contain the matched instances and fields from both datasets. From this dataset, you can also see the SQL query under the hood by clicking the option highlighted in the image below.

join-sql

Merging Datasets

The merging datasets option in BigML allows you to include the instances of several datasets in one dataset.

This functionality can be very useful when you use multiple sources of data. For example, imagine we have employees data in two different datasets and we want to merge them into one dataset.

merge-example.png

You can easily do this on the BigML Dashboard by following these steps:

  • Click the “Merge datasets” option from the dataset configuration menu:

merge-datasets-option.png

  • Select the datasets you want to merge. The datasets should have the same fields so the instances of one dataset can be added after the instances of the other dataset. You can select up to 32 datasets to merge. You can sample each of the datasets selected for the merge by configuring the typical BigML sampling parameters like the percentage rate, replacement, out-of-bag, and seed parameters.

select-merge-dataset

  • Click on the “Merge datasets” option:

merge-datasets.png

This will create a dataset that will contain the instances from the merged datasets. From this dataset, you can also see the merging information by clicking the option highlighted in the image below.

merge-information.png

Feature Engineering

Feature engineering, i.e., the creation of new features that can be better predictors for your models, is one of the most important tasks in Machine Learning because it is usually the biggest source of model improvement. That’s why we also focused our efforts on bringing sliding windows to the BigML Dashboard and improving the Flatline Editor.

Sliding windows

Creating new features using sliding windows is one of the most common feature engineering techniques in Machine Learning. It is usually applied to frame time series data using previous data points as new input fields to predict the next time data points.

For example, imagine we have one year of sales data to predict sales. As domain experts, we know that past sales can be key predictors to predict today’s sales. Therefore, we can use our objective field “sales” to create additional input fields that contain past data. We can create an infinite number of fields: last day sales, the average of last week sales, the difference between last month and this month sales, etc. In the image below, we are creating a new predictor that calculates the average sales of the last two days (see the field in green “avgSales_L2D”).

sliding-windows

This can easily be done on the BigML Dashboard by following these steps:

  • Click the “Add fields” option from the dataset configuration menu:

sliding-window-option

  • Select the mean out of the Sliding windows operations in the selector:

sliding-window-operation

  • Select the field you want to apply the operation to, a window start -2 and a window end -1 (the window start and end define the first and last instances to be considered for the defined calculation; negative values are previous instances, positive values are next instances, with zero being the current instance). Then click on “Create dataset” button.

sliding-window-start-end

This will create a dataset with a new field that contains the average sales of the last two days and can be used as a new predictor.

Flatline Editor Improvements

The Flatline editor allows you to easily create new fields for your dataset by using BigML’s domain-specific language Flatline. You can access the editor by selecting the option “Add fields” from the dataset configuration menu, then select the Flatline formula operation and click on the editor icon (see image below).

flatline-editor-access.png

You can see that the dataset preview now includes a table view where you can easily see a sample of your instances.

table-preview-flatline-editor.png

When you write a formula and you want to view its result, the preview only shows the fields involved in the formula. That way you can easily check if your formula is being calculated correctly. For example in the image below, you can see only two fields in the preview, the one used in the formula as input (the field “duration”) and the new field result of the formula (if the duration of the movie is higher than 100 minutes it is classified as “long”, otherwise it is “short”). You can also change this view to show all the dataset fields again using the green switcher on top of the table preview.

formula-preview-flatline

Want to know more about Data Transformations?

Stay tuned for the next blog post to learn how to perform data transformations with SQL via the BigML API. If you have any questions or you would like to learn more about how Data Transformations work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow, as well as the full webinar recording.

 

Introduction to Data Transformations

BigML’s upcoming release on Thursday, October 25, 2018, will be presenting our latest resource to the platform: Data Transformations. In this post, we’ll do a quick introduction to Data Transformations before we move on to the remainder of our series of 6 blog posts (including this one) to give you a detailed perspective of what’s behind the new capabilities. Today’s post explains the basic concepts that will be followed by an example use case. Then, there will be three more blog posts focused on how to use Data Transformations through the BigML DashboardAPI, and WhizzML for automation. Finally, we will complete this series of posts with a technical view of how Data Transformations work behind the scenes.

Understanding Data Transformations

Transforming your data is one of the most important, yet time-consuming and difficult tasks in any Machine Learning workflow. Of course, “data transformations” is a loaded phrase and entire books are authored on the topic. When mentioned in the Machine Learning context, what we mean is a collection of actions that can be performed compositionally on your input data to make it more responsive to various modeling tasks — if you will, these are the methods to optimally prepare or pre-process your data.

data transformations

To remind, BigML already offers several automatic data preparation options (missing values treatment, categorical fields encoding, date-time fields expansion, NLP capabilities, and even a full domain-specific language for feature generation in Flatline) as well as useful dataset operations such as sampling, filtering, and the addition of new fields. Despite those, we’ve been looking to add more capabilities for full-fledged feature engineering within the platform.

Well, the time has come! This means the powerful set of supervised and unsupervised learning techniques we’ve built from scratch over the last 7 years all stand to benefit from data better prepared to make the most of them. Without further adieu, let’s see what goodies made it to this release:

Aggregating Datasets

  • Aggregating instances: at times aggregating highly granular data at higher levels can be necessary. When that happens, you can group your instances by a given field and perform various operations on the other fields. For example, you may want to aggregate sales figures by product and perform further operations on the resulting dataset before applying Machine Learning techniques such as Time Series.
  • Joining datasets: if your data comes from different sources, in multiple datasets you need to join said datasets by defining a join field. For instance, imagine you have a dataset containing user profile information such as account creation date, age, sex, country, and another dataset that contains transactions that belong to those users with critical fields like transaction date, payment type, amount and more. If you’d rather have all those fields in a single dataset, you can join those datasets based on a common field such as customer_id.
  • Merging datasets: if you have multiple datasets to process to create with the same fields, then you may want to concatenate those before you continue your workflow. Take for example a situation where daily files of sensor data need to be collated into a single monthly file before you can proceed. This would be a breeze with the new merge capability built into the BigML Dashboard.
  • Sliding windows: Creating new features using sliding windows is one of the most common feature engineering techniques in Machine Learning. It is usually applied to frame time series data using previous data points as new input fields to predict the next time window data points. For instance, in predicting hospital re-admissions, we may want to break the healthcare data into weekly windows and see how those weekly signals are correlated with the likelihood of re-admission in the following weeks after the patient is released from the hospital.
  • SQL support: This is big! BigML now supports all the operations from PostgreSQL, which means you have the full power of SQL at your disposal through the BigML REST API. You can choose between writing a free-form SQL query or use the JSON-like formulas that the BigML API supports. You can also easily see the associated SQL queries that created a given dataset and even apply them to other datasets — more on those in the subsequent blog posts.

NOTE: Keep this under the wraps for now, but before you know it, the Dashboard will be supporting other capabilities such as ordering instances, removing duplicate instances, and more!

Want to know more about Data Transformations?

If you have any questions or you would like to learn more about how Data Transformations work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow, as well as the full webinar recording.

Data Transformations: Machine Learning-Ready Data!

BigML’s new release is here! Join us on Thursday, October 25, 2018, at 10:00 AM PDT (Portland, Oregon. GMT -07:00) / 07:00 PM CEST (Valencia, Spain. GMT +02:00) for a FREE live webinar to discover the new Data Transformation options added to the BigML platform, which will yield better results in your projects by simplifying a key part of any Machine Learning workflow.

In previous releases, BigML has focused on presenting a wide range of choices on new algorithm implementations and automation to help you solve a wide array of Machine Learning problems. Now, with the Data Transformations release, we reach an important milestone in our roadmap by enhancing our offering in the area of data preparation as well. Typically, data do not come in a format ready to start working on a Machine Learning project right away. Data from many different sources come in different formats, and with plenty of information that does not add any value for the algorithm that will learn from it. Therefore, preparing your data for your Machine Learning project is a key part of the process to obtain the best predictive model.

Although BigML already offers several automatic data preparation options (missing values treatment, categorical fields encoding, date-time fields expansion, NLP capabilities, and even a full domain-specific language for feature generation), we knew we still had more tools to add for full-fledged feature engineering within the platform. That is why BigML is adding new capabilities that greatly expand the functionality related to prepare your Machine Learning-ready dataset. The latest version of BigML lets you perform SQL-style queries over your datasets, significant improvements of the editor used to write feature generators in Flatline (our feature engineering DSL), and new ways of further improving feature engineering.

Up to now, the main ways of transforming datasets were sampling, filtering, and the addition of new fields. All of them work by scanning the input dataset and performing actions based on a finite number of rows at once. However, you cannot perform global operations like ordering, joins, or aggregations in this fashion. In this release, we introduce SQL-like queries that are able to perform such global transformations, among others. This set of operations is crucial for transforming the data you have into the data you actually need. With queries, you will be able to aggregate instances of your dataset, join datasets, as well as merge them. You can also easily execute your queries in a few clicks as you have the full power of SQL at your disposal through the BigML REST API. There’s more: the Dashboard will shortly support other capabilities like ordering instances, removing duplicates, and more!

The BigML Flatline Editor has been upgraded to easily help you create new fields and validate existing Flatline expressions in your Dashboard in an even more friendly editor. For new BigML users who are not familiar with the term, Flatline is BigML’s domain-specific language for data generation and filtering, which helps you transform your datasets and engineer new features in a wide variety of ways. Apart from the Flatline editor, we also offer some common predefined operations from the Dashboard that allow you to create new features with a few clicks instead of writing formulas. Finally, we are adding sliding windows, one of the most common feature engineering techniques used in Machine Learning. Sliding windows are frequently applied to frame time series data by using previous data points to predict the next data points, e.g., sales for product X in the last rolling 14 days.

Do you want to know more about Data Transformations?

If you have any questions or you would like to learn more about how Data Transformations work, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow, as well as the full webinar recording.

Machine Learning School in Seville, Spain: First Edition!

Seville, the capital of Andalusia (the southern region of Spain), is a place known for its beauty, charming people, and immense cultural heritage. Now, the BigML Team intends to spread the word and promote the adoption of Machine Learning among their citizens, organizations, companies, and academic institutions so the region can also become a more attractive technology hub.

BigML, in collaboration with the EOI Business School, is launching the First Edition of our Machine Learning School in Seville, which will take place on March 7 and 8, 2019. The #MLSEV will be an introductory two-day course optimized to learn the basic Machine Learning concepts and techniques that are impacting all economic sectors. This training event is ideal for many professionals that wish to solve real-world problems by applying Machine Learning in a hands-on manner, e.g., analysts, business leaders, industry practitioners, and anyone looking to do more with fewer resources by leveraging the power of automated data-driven decision making.

Besides the basic concepts, the course will cover a selection of state of the art techniques with relevant business-oriented examples such as smart applications, real-world use cases in multiple industries, practical workshops, and much more.

Where

EOI Andalucía, Leonardo da Vinci Street, 12. 41092. Cartuja Island, Seville, Spain. See map here.

When

2-day event: on March 7-8, 2019 from 8:30 AM to 6:30 PM CET.

Applications

Please complete this form to apply. After your application is processed, you will receive an invitation to purchase your ticket. We recommend that you register soon, space is limited and as per our previous editions in other locations, the event may sell out quickly.

Schedule

Check out the full agenda and other details of the event here

Beyond Machine Learning

In addition to the core sessions of the course, we wish to get to know all attendees better since they will make up tomorrow’s creative forces. As such, we are organizing the following activities for you and will be sharing more details shortly:

  • Genius Bar. A useful appointment to help you with your questions regarding your business, projects, or any ideas related to Machine Learning. If you are coming to the Machine Learning School in Seville and would like to share your thoughts with the BigML Team, please book your 30-minute slot by contacting us at mlsev@bigml.com.  
  • Fun runs. We will go for a healthy and fun 30-minute run after the sessions. We will soon share the details on the meeting point and time. Stay tuned!
  • International networking. Meet the lecturers and attendees during the breaks. We expect hundreds of local business leaders and other experts coming from several regions of Spain as well as from different countries.

The BigML Team is excited to launch more series of our Machine Learning Schools across the globe! The next one will take place in one month from now, on November 4 and 5 in Doha, Qatar. To know more about BigML’s previous Machine Learning schools please read this blog post. Do not hesitate to contact us at education@bigml.com if you would like to co-organize with us a Machine Learning school in your city, we look forward to growing the Machine Learning Schools series!

%d bloggers like this: