List of Public Data Sources Fit for Machine Learning

Below is a wealth of links pointing out to free and open datasets that can be used to build predictive models. We hope that our readers will make the best use of these by gaining insights into the way The World and our governments work for the sake of the greater good. If you have an academic or research project, please keep in mind that BigML offers special discounts and free access for those.  In fact, you will automatically get a FREE PRO subscription as long as you sign up with your “.Edu” email.

Data Journals

Data-artikelen | Sargasso
Data journalism and data visualization from the Datablog | News | The Guardian

Data Marketplaces and Data Hubs

Knoema – Home
Public Data Sets : Amazon Web Services
Data Publica | Les données pour votre business
Archive-It – Web Archiving Services for Libraries and Archives
Google Public Data Explorer
Welcome – the Data Hub
Data Sets | AggData
Find & Purchase Data Subscriptions | Windows Azure Marketplace
Factual | Home

Data Search Engines

Zanran Numerical Data Search
Quandl – Intelligent Search for Numerical Data

International Bodies & Agencies

IMF Data and Statistics
Data | The World Bank
Data and maps — European Environment Agency (EEA)
Eurostat Home

Local Government

Inicio Misiones
Open Government Data Wien (OGD)
Open data – City of Brussels
Open Data – Brisbane City Council
Open data – Salford City Council
Sunderland City Council : Local Public Data
Welcome to the London Datastore | London DataStore
Leeds City Council – Open Data
Home – DataGM – Data Greater Manchester
Open Data | Derby City Council
Council data – Brighton & Hove City Council
Open Data – Birmingham City Council
Aberdeen City Council Open Data
Open Data – City of Waterloo
Open Data catalogue | City of Vancouver
Open Data Home – Open Data – Home | City of Toronto
City of Prince George – Open Data Catalogue
Open Data Ottawa | City of Ottawa
Open Data Catalogue – City of Red Deer
Open Data | City of Niagara Falls, Canada
Open Data Catalogue | City of Nanaimo – Residents – Publications and Open Data Catalogue
City of Medicine Hat Open Data Catalogue
Kamloops open data
Open Data Catalogue Kelowna
City of Hamilton – Open Data
City of Fredericton – Open Data Home
City of Edmonton Open Data Catalogue
City of Somerville, MA
Data.Seattle.Gov | Seattle’s Data Site
City of Scottsdale
Welcome – Santa Cruz Open Data
Data | San Francisco
Open Raleigh – The Official City of Raleigh Portal
Datasets | Portland OR
OpenDataPhilly – Connecting People With Data
NYC Open Data
Greater New Orleans Community Data Center
City of Madison | Open Data
City and County of Honolulu
US/Data Catalog District of Columbia
Denver Open Data Catalog | The Cook County Government Open Data Website
City of Chicago | Data Portal
Open Government | City of Boston
OpenBaltimore / City of Baltimore’s Open Data Catalog | Open Austin
OpenDataAsheville – Connecting People With Data
GovHK: About Data.One Singapore

Machine Learning Challenges

Competitions – Kaggle
Data – Repository – Causality Workbench
TunedIT – Data mining & machine learning data sets, algorithms, challenges

Machine Learning Datasets

TunedIT – Data mining & machine learning data sets, algorithms, challenges
mldata :: Welcome
UCI Machine Learning Repository: Data Sets

Miscellaneous Data Sources

IHME | Institute for Health Metrics and Evaluation
Gapminder: Unveiling the beauty of statistics for a fact based world view.
Doing Research in New York City Public Schools and Requesting Data – NYC Data – New York City Department of Education
RITA | BTS | Title from h2
Oregon Climate Data
Quantnet :: Start
Data Tools – Locators
My Data | Measured Me
Webscope from Yahoo! Labs Research Data
Online Data – Robert Shiller
Obtaining Data From the NSSDC
Cancer Program Data Sets
The Cancer Imaging Archive (TCIA)
Million Song Dataset | scaling MIR research
Google Ngram Viewer
Data | GeoDa Center
Home – GEO DataSets – NCBI
The Financial Data Finder A – G
Frequent Itemset Mining Dataset Repository
Europeana Professional – Linked Open Data
Inforum – EconData
Summary of Data Sets by Application Area
Data Sets | Pew Research Center’s Internet & American Life Project
Cosm – Explore
Advanced NFL Stats: Play-by-Play Data

National Governments and States

Portal de Obligaciones de Transparencia
Junta de Andalucía – Datos abiertos
Reutilización de la Información del Sector Público | Reutilización de la Información de los Servicios Públicos
Portal de Datos Abiertos de JCCM
Ayuntamiento de Zaragoza. Datos de Zaragoza Reutilización
Dades obertes Lleida – Ajuntament de Lleida
Dades Obertes. Generalitat de Catalunya
Dades Obertes CAIB
Reutilización de la Información del Sector Público en Gijón
Open Data Euskadi ataria, Eusko Jaurlaritzaren datu publikoen irekitzea
Data for Hawaii |
Florida Has A Right To Know
Commonwealth Data Point
Open Data |
Connecticut Transparency Website Open Data
NYS Data Center DataShare
State of Alabama –
Open Government for the State of Tennessee | Government | State Facts and History
OpenDoor – Kentucky | Open Illinois
SOM – Michigan Data Store
Louisiana Transparency and Accountability Portal | State of Missouri Data Portal
DATAshare |
Minnesota open data // your portal for Minnesota data transparency
Open Data Texas
Welcome to Oklahoma’s Official Web Site
KanView: Kansas Transparency Taxpayer Act – Kansas Revenues and Expenditures Search
OPEN SD :: South Dakota Government Information
North Dakota GIS (Geographic Information Systems)
State Government Data New Mexico The Official State Web Portal
Arizona OpenBooks | – Arizona Transparency Finances in Detail
Utah Data – | Data Transparency for the State of California
Oregon Data | Opening Oregon’s Data
Data.Washington | Washington State’s Data Site
Home |
Portal de Datos Públicos – Inicio | Portal del Estado Uruguayo
Bem vindo – Portal Brasileiro de Dados Abertos
Directorio de Empresas, Marcas registradas, Normas legales y Teléfonos en Perú – The Portal to Ireland’s Official Statistics | The Belgian open data initiative het open dataportaal van de Nederlandse overheid
PortalU – German Environmental Information Portal
Statistical database | Portalul datelor guvernamentale deschise al Republicii Moldova
Offene Daten Österreich |
Vitajte – | I dati aperti della PA
Δημοσια, Ανοικτά Δεδομένα
Open Kenya | Transparent Africa
SAUDI | National e-Government Portal – Home – New Zealand government data online »
Open Data Canada
OpenAid – Start | Åpne offentlige data i Norge – Difi
Portada |
Open Data Colombia
home |

Open Companies Data Sources

Yelp’s Academic Dataset | Yelp
Data Export – Prosper
Lending Club Statistics – Lending Club

U.S. Agencies Data Sources

Federal Agency Participation |
FRB: Data Download Program (DDP)

Various Lists of Data Sources

Programming Challenges: What are some good “toy problems” in data science? – Quora
Data: Where can I find large datasets open to the public? – Quora
Data Analysis: What’s your favorite free data source? – Quora
What are some publicly available market data feeds? – Quora
Is there a reliable free source for per country LinkedIn statistics? – Quora
@pskomoroch #dataset – Delicious
Free, Public Data Sets | Hacker News
List of European Open Data Catalogues at
Open Data
Datasets Archive
Some Datasets Available on the Web » Data Wrangling Blog

Research Quality Datasets by Hilary Mason

Lending Club Loan Data
SMS Spam Collection
Flickr personal taxonomies
Yahoo Data for Researchers
ICWSM Spinnr Challenge 2011 dataset
Quantum Chaotic Thoughts: Facebook100 Data Set
Public Data Sets on Amazon Web Services (AWS)
The ClueWeb09 Dataset
Census Bureau Home Page
Data | The World Bank
What is Twitter, a Social Network or a News Media? – WWW’10
dotbot | help – arXiv Bulk Data Access – Amazon S3
YouTube Dataset
Face Recognition Homepage – Databases
Pajek datasets
UCI Network Data Repository
Datasets for “The Elements of Statistical Learning”
Enron Email Dataset
MovieLens Data Sets | GroupLens Research
Translation Task – EMNLP 2011 Sixth Workshop on Statistical Machine Translation
Project Gutenberg
About WordNet – WordNet – About WordNet
Aligned Hansards of the 36th Parliament of Canada
CRCNS – Collaborative Research in Computational Neuroscience – Data sharing
USENET corpus
UCI Machine Learning Repository
Gene Expression Omnibus (GEO) Main page
Social Science Data
IMDB dataset
Stanford Large Network Dataset Collection
Google Books n-gram dataset
Million Song Dataset | scaling MIR research
Belly Button Biodiversity 2.0
Sharing PyPi/Maven dependency data « RTFB
Click Dataset | Center for Complex Networks and Systems Research
The Electric Rice Cooker — One year of deleted weibos archive
Registered meteorites that has impacted on Earth visualized – AnalyticBridge
GeoJSON files for real-time Virginia transportation data.
NYPD Crash Data Band-Aid
11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts | Research Blog
Big data set – 3.5 billion web pages – made available for all of us – Big Data News
Data.Seattle.Gov | Seattle’s Data Site
New Crawl Data Available! | CommonCrawl
Detailed data on pass rates, race, and gender for 2013
Data Download