Data Preparation for Machine Learning using MySQL

Posted by

Most Machine Learning algorithms require data to be into a single text file in tabular format, with each row representing a full instance of the input dataset and each column one of its features. For example, imagine data in normal form separated in a table for users, another for movies, and another for ratings. You can get it in machine-learning-ready format in this way (i.e., joining by userid and movieid and removing ids and names):

"userid","name","age","gender"
1,John Smith,54,male
2,Carol Brew,29,female

"movieid","title","year","genre","director"
1,The shinning,1980,psychological horror,Stanley Kubrick
2,Terminator,1984,science fiction,James Cameron

"userid","movieid", "visited","rented","purchased","rating"
1,1,Yes,No,Yes,4
1,2,Yes,Yes,No,5
2,1,Yes,No,No,0
2,2,Yes,Yes,Yes,5

Machine Learning Ready:
"age","gender","year","genre","director","visited","rented","purchased","rating"
54,male,1980,psychological horror,Stanley Kubrick,Yes,No,Yes,4
54,male,1984,science fiction,James Cameron,Yes,Yes,No,5
29,female,1980,psychological horror,Stanley Kubrick,Yes,No,No,0
29,female,1984,science fiction,James Cameron,Yes,Yes,Yes,5

Denormalizing (or “normalizing” data for Machine Learning) is a more or less complex task depending on where the data is stored and where it is obtained from. Often the data you own or have access to is not available in a single file—may be distributed across different sources like multiple CSV files, spreadsheets or plain text files, or normalized in database tables. So you need a tool to collect, intersect, filter, transform when necessary, and finally export to a single flat, text CSV file.

If your data is small and the changes are simple such as adding a derived field or making a few substitutions you can use a spreadsheet, make the necessary changes, and then export it to a CSV file. But when the changes are more complex; e.g., joining several sources, filtering a subset of the data, or managing a large amount of rows, you might need a more powerful tool like an RDBMS. MySQL is a great one—it’s free. If the data size that you are managing is in the terabytes, then (and only then) you should consider Hadoop.

Machine Learning transformations

Business inspections in San Francisco

Let’s take a look at an actual example. The San Francisco’s Department of Public Health recently published a dataset about restaurants in San Francisco, inspections conducted,  violations observed, and a score calculated by a health inspector based on the violations observed.

You can download the data directly from the San Francisco open data website. Recently some statistics using this data were reported in this post—they may be difficult to stomach. Imagine, however, that you want to use the data to predict what violations certain kind of restaurants commit—or, if you’re a restaurant owner, to predict whether you are going to be inspected. As the data comes “normalized” in four separated files:

  • businesses.csv: a list of restaurants or businesses in the city.
  • inspections.csv: inspections in some of previous businesses.
  • violations.csv: detected law violations in some of previous inspections.
  • ScoreLegend.csv: a legend to describe score ranges.

You will first need to prepare it to be used as input to a Machine Learning service such as BigML.

Analyzing the data

Let’s first have a quick look at the main entities in the data of each file and its relationships. The four files are in CSV format with the following fields:

$ head -3 businesses.csv
"business_id","name","address","city","state","postal_code","latitude","longitude","phone_number"
10,"TIRAMISU KITCHEN","033 BELDEN PL","San Francisco","CA","94104","37.791116","-122.403816",""
12,"KIKKA","250 EMBARCADERO 7/F","San Francisco","CA","94105","37.788613","-122.393894",""

$ head -3 inspections.csv
"business_id","Score","date","type"
10,"98","20121114","routine"
10,"98","20120403","routine"

$ head -3 violations.csv
"business_id","date","description"
10,"20121114","Unclean or degraded floors walls or ceilings [ date violation corrected: ]"
10,"20120403","Unclean or degraded floors walls or ceilings [ date violation corrected: 9/20/2012 ]"

$ head -3 ScoreLegend.csv
"Minimum_Score","Maximum_Score","Description"
0,70,"Poor"
71,85,"Needs Improvement"

inspections

There are three main entities: businesses, inspections and violations. The relationships between entities are: a 0..N relationship between businesses and inspections and an 0..N relationship between inspections and a violations. There’s also a file with a description for each range in the score

To build a machine-learning-ready CSV file containing instances about businesses, their inspections and their respective violations, we’ll follow three basic steps: 1) importing data into MySQL, 2) transforming data using MySQL, and 3) joining and exporting data to a CSV file.

Importing  data into MySQL

First, we’ll need to create a new SQL table with the corresponding fields to import the data for each entity above. Instead of defining a type for each field that we import (dates, strings, integers, floats, etc), we simplify the process by using varchar fields. In this way, we just need to be concerned with the number of fields and their  length for each entity. We also created a new table to import the legends for each score range.

create table business_imp (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100), postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100));
create table inspections_imp (business_id int, score varchar(10), idate varchar(8), itype varchar(100));
create table violations_imp (business_id int, vdate varchar(8), description varchar(1000));
create table scorelegends_imp (Minimum_Score int, Maximum_Score int, Description varchar(100));

So now we are ready to import the raw data into each of the new tables. We use the load data infile command to define the format of the source file, the separator, whether a header is present, and the table in which it will be loaded.

load data local infile '~/SF_Restaurants/businesses.csv' into table business_imp fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 lines (business_id,name,address,city,state,postal_code,latitude,longitude,phone_number);
load data local infile '~/SF_Restaurants/inspections.csv' into table inspections_imp fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 lines (business_id,score,idate,itype);
load data local infile '~/SF_Restaurants/violations.csv' into table violations_imp fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 lines (business_id,vdate,description);
load data local infile '~/SF_Restaurants/ScoreLegend.csv' into table scorelegends_imp fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 lines (Minimum_Score,Maximum_Score,Description);

If the dataset is  big (i.e., several thousands rows or more), then it’s important to create indexes as follows:

create index inx_inspections_businessid on inspections_imp (business_id);
create index inx_violations_businessidvdate on violations_imp (business_id, vdate);

Transforming  data using MySQL

More often than not, raw data needs to be transformed. For example, numeric codes need to be converted into descriptive labels, different fields need to be joined, some fields might need different format. Also very often missing values, bad formatted data, or wrongly inputted data need to be fixed, and some other times you might need to create and fill new derived fields.

In our example, we’re going to remove the “[ date violation corrected: …]” substring from the violation’s description field:

update violations_imp set description = substr(description,1,instr(description,' [ date violation corrected:')-1) where instr(description,' [ date violation corrected:') > 0;

We are also going to fix some missing data:

update business_imp set city = 'San Francisco' where city = 'San Francicso';
update violations_imp set vdate = '' where vdate = 'N/A';

Finally, we are going to add a new derived field “inspection” and fill it with Yes/No values:

alter table business_imp add column inspection varchar(3) after phone_number;
update business_imp set inspection = 'Yes' where business_id in (select business_id from inspections_imp);
update business_imp set inspection = 'No' where business_id not in (select business_id from inspections_imp);

MySQL has plenty of functions to deal with rows and field transformations. We do love to work at the command line but certainly the versatility of MySQL brings many other advantages.

Joining and Exporting MySQL tables to a CSV file

Once the data has been sanitized and reformed according to our needs, we are ready to generate a CSV file. Before creating it with the denormailized data, we need to make sure that we join the different tables in the right way. Looking at the restaurant data, we can see that some businesses have inspections and others not. For those with inspections, not all of them have violations. Therefore, the query that we should use is a “left join” to collect all businesses, inspections and violations.  Additional transformations, like reformat or concat fields, can be done in this step too.  We also need to make sure that with export the data with a descriptive header.  The next query will make the export trick:

select "Business name", "Address", "City", "State", "Postal code", "Latitude", "Longitude", "Phone number", "Inspection", "Inspection score", "Score type", "Inspection date", "Inspection type", "Violation description"
union all
select a.name, a.address, a.city, a.state, a.postal_code, a.latitude, a.longitude, a.phone_number, a.inspection, b.score, d.description as scoredesc, concat(substr(b.idate,1,4),'-',substr(b.idate,5,2),'-',substr(b.idate,7,2)), b.itype, c.description
into outfile 'sf_restaurants.csv' fields terminated by ',' optionally enclosed by '"' escaped by '\\' lines terminated by '\n'
from business_imp a left join inspections_imp b on (a.business_id = b.business_id) left join violations_imp c on (b.business_id=c.business_id and b.idate = c.vdate) left join scorelegends_imp d on (cast(b.score as unsigned) between d.Minimum_Score and d.Maximum_Score);

A file named sf_restaurants.csv will be generated with a row per instance in this format:

"Business name","Address","City","State","Postal code","Latitude","Longitude","Phone number","Inspection","Inspection score","Score type","Inspection date","Inspection type","Violation description"
"DINO'S PIZZA DELI","2101 FILLMORE ST ","San Francisco","CA","94115","37.788932","-122.433895","","Yes","84","Needs Improvement","2011-05-16","routine","Improper storage of equipment utensils or linens"
"DINO'S PIZZA DELI","2101 FILLMORE ST ","San Francisco","CA","94115","37.788932","-122.433895","","Yes","84","Needs Improvement","2011-05-16","routine","Inadequately cleaned or sanitized food contact surfaces"
"CHEZ MAMAN","1453 18TH ST ","San Francisco","CA","94107","37.762513","-122.397169","+14155378680","Yes","81","Needs Improvement","2012-05-24","routine","Improper thawing methods"
"CHEZ MAMAN","1453 18TH ST ","San Francisco","CA","94107","37.762513","-122.397169","+14155378680","Yes","81","Needs Improvement","2012-05-24","routine","Inadequate food safety knowledge or lack of certified food safety manager"
"CHEZ MAMAN","1453 18TH ST ","San Francisco","CA","94107","37.762513","-122.397169","+14155378680","Yes","81","Needs Improvement","2012-05-24","routine","Inadequately cleaned or sanitized food contact surfaces"
"CHEZ MAMAN","1453 18TH ST ","San Francisco","CA","94107","37.762513","-122.397169","+14155378680","Yes","81","Needs Improvement","2012-05-24","routine","High risk food holding temperature"
"EL MAJAHUAL RESTAURANT","1142 VALENCIA ST","San Francisco","CA","94110","37.754687","-122.420945","+14155827514","Yes","82","Needs Improvement","2012-12-11","routine","Moderate risk vermin infestation"
"EL MAJAHUAL RESTAURANT","1142 VALENCIA ST","San Francisco","CA","94110","37.754687","-122.420945","+14155827514","Yes","82","Needs Improvement","2012-12-11","routine","Improper cooling methods"
"EL MAJAHUAL RESTAURANT","1142 VALENCIA ST","San Francisco","CA","94110","37.754687","-122.420945","+14155827514","Yes","82","Needs Improvement","2012-12-11","routine","High risk food holding temperature"
"J.B.'S PLACE","1435 17TH ST ","San Francisco","CA","94107","37.765003","-122.398084","","Yes","83","Needs Improvement","2011-09-27","routine","Unclean unmaintained or improperly constructed toilet facilities"
"J.B.'S PLACE","1435 17TH ST ","San Francisco","CA","94107","37.765003","-122.398084","","Yes","83","Needs Improvement","2011-09-27","routine","High risk vermin infestation"
"J.B.'S PLACE","1435 17TH ST ","San Francisco","CA","94107","37.765003","-122.398084","","Yes","83","Needs Improvement","2011-09-27","routine","Moderate risk food holding temperature"
...

You can download the raw CSV or clone the dataset in BigML here.

Final steps

Once the data has been exported, you might want to move the file from the MySQL default export folder (usually in the database folder), replace end-of-line characters (\N) for empty strings, and compress the file if it’s too large.

sudo mv /var/mysql/data/sf_restaurants/sf_restaurants.csv /tmp/
sed "s/\\\N//g" /tmp/sf_restaurants.csv > /tmp/sf_restaurants_bigml.csv
bzip2 /tmp/sf_restaurants_bigml.csv

Finally, if you want to get a first quick predictive model you upload it to BigML with BigMLer as follows:

bigmler --train /tmp/sf_restaurants_bigml.csv.bz2 --objective Inspection

Et voilà!, a model like this can be yours.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s