Data for Database Research
Sometimes, it can be hard to find just the right time series for testing a new algorithm.
My own stuff
I have my own data repository. Quite modest!
Hunt and Kill Terrorists
- Terrorism in Western Europe: Events Data (TWEED) (you might need GNU PSPP software)
Geology
- Earthquake Catalogs & Data (Seismic waveforms)
Astronomy
- Sloan Digital Sky Survey (access is through online SQL queries)
- The Sunspot Cycle (sun spots time series)
Motion Capture
- CMU Graphics Lab Motion Capture Database (includes time series)
Biomedical
- EEG Data
- RR series (from ECG)
- Physiobank (various biomedical including ECGs)
- AtGenExpress (gene expression)
e-Commerce
Financial
- FXHistory : historical exchange rates
- Yahoo Finance
- Penn World Table
Meteorological
- The EREC Weather Station makes available
lots of weather data and even more.
Climate and environmental data
- Earth Trends : data tables.
- lifeunderyourfeet: soil data collected from wireless sensors
- Envirofacts Data Warehouse
- European Climate Assessment&DataSet
- National Space Science & Technology Center
- Climatic Research Unit, University of East Anglia
Motion
Blog
OCR
Collaborative Filtering
Text and XML
- Wikipedia dumps (over 900 GB)
- XMLData Repository
- Linguistic Data Consortium
- GOV2 corpus
- Yahoo’s research data sets
- The Google n-gram data set
- Enron Email DataSet
- US Patents
- Reuter Corpora
- Project Gutenberg
- Oxford Text Archive
- Piers Plowman Electronic Archive
- DBLP XML dump
Web Monitoring
- Google Blog Search allows you to get the data back as XML, and for any time frame you specify (in days).
- a large collection of feeds as well as software to collect more
- Feedster
- Blogdigger
- Pubsub
Web Graphs
Sounds
Voice
Web 2.0
(Where people share data sets freely.)
- MB*Base (giant, generic, data warehouse: very good)
- Swivel (giant, generic, data warehouse)
- Many Eyes
Time Series (Various)
- United Nations data (lots of various year-by-year time series)
- Time Series Data Library (wide range of times series, most of them small)
- Eamonn Keogh’s collection of Time Series and his more recent shape data
Various
- UCI Machine Learning Repository
- Frequent Itemset Mining Dataset Repository (includes a traffic accident data set having 340,184 records with 50 dimensions, plus the WebDocs data set having over one million transactions)
- Several links to good data sets for research on del.icio.us
- StatLib-Datasets Archive (good stuff!)
- UCI KDD (includes EEG)
Not really data, but useful nonetheless:
Synthetic data generator
- TPC DBGEN (data warehousing)
General documentation on Time Series theory
General Time Series software
- TDDTool is great time series plotting software
Subscribe to this blog ![]()
in a reader
or by Email.
Unfortunately, Le Chronologue is no longer operative: it used dir.com, which stopped crawling in Jan 2007
Comment by David — 26/4/2007 @ 7:36