Title: Collections
TODO: Organize these somehow, add one-line blurbs
Organize by usage? (classification, recommendation etc.)
## Collections of Collections
- [ML Data](http://mldata.org/about/)
... repository supported by Pascal 2.
- [DBPedia](http://wiki.dbpedia.org/Downloads30)
- [UCI Machine Learning Repo](http://archive.ics.uci.edu/ml/)
- [http://mloss.org/community/blog/2008/sep/19/data-sources/](http://mloss.org/community/blog/2008/sep/19/data-sources/)
- [Linked Library Data](http://ckan.net/group/lld)
via CKAN
- [InfoChimps](http://infochimps.com/)
Free and purchasable datasets
- [http://www.linkedin.com/groupItem?view=&srchtype=discussedNews&gid=3638279&item=35736572&type=member&trk=EML_anet_ac_pst_ttle](http://www.linkedin.com/groupItem?view=&srchtype=discussedNews&gid=3638279&item=35736572&type=member&trk=EML_anet_ac_pst_ttle)
LinkedIn discussion of lots of data sets
## Categorization Data
- [20Newsgroups](http://people.csail.mit.edu/jrennie/20Newsgroups/)
- [RCV1 data set](http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm)
- [10 years of CLEF Data](http://direct.dei.unipd.it/)
- [http://ece.ut.ac.ir/DBRG/Hamshahri/](http://ece.ut.ac.ir/DBRG/Hamshahri/)
(Approximately 160k categorized docs)
There is a newer beta verson here:[http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/](http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/)
(Approximately 320k categorized docs)
- Lending Club load data [https://www.lendingclub.com/info/download-data.action](https://www.lendingclub.com/info/download-data.action)
## Recommendation Data
- [Book usage and recommendation data from the University of Huddersfield](http://library.hud.ac.uk/data/usagedata/)
- [Last.fm](http://denoiserthebetter.posterous.com/music-recommendation-datasets)
\- Non-commercial use only
- [Amazon Product Review Data via Jindal and Liu](http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html)
-- Scroll down
- [GroupLens/MovieLens Movie Review Dataset](http://www.grouplens.org/node/73)
## Multilingual Data
- [http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php](http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php)
\- 308,000 subtitle files covering about 18,900 movies in 59 languages
(July 2006 numbers). This is a curated collection of subtitles from an
aggregation site, [http://www.openSubTitles.org]
The original site, OpenSubtitles.org, is up to 1.6m subtitles files.
- [Statistical Machine Translation](http://www.statmt.org/)
\- devoted to all things language translation. Includes multilingual
corpuses of European and Canadian legal tomes.
## Geospatial
- [Natural Earth Data](http://www.naturalearthdata.com/)
- [Open Street Maps](http://wiki.openstreetmap.org/wiki/Main_Page)
And other crowd-sourced mapping data sites.
## Airline
- [Open Flights](http://openflights.org/)
\- Crowd-sourced database of airlines, flights, airports, times, etc.
- [Airline on-time information - 1987-2008](http://stat-computing.org/dataexpo/2009/)
\- 120m CSV records, 12G uncompressed
## General Resources
- [theinfo](http://theinfo.org/)
- [WordNet](http://wordnet.princeton.edu/obtain)
- [Common Crawl](http://www.commoncrawl.org/)
\- freely available web crawl on EC2
## Stuff
- [http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html](http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html)
- [4 Universities Data Set](http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/)
- [Large crawl of Twitter](http://an.kaist.ac.kr/traces/WWW2010.html)
- [UniProt](http://beta.uniprot.org/)
- [http://www.icwsm.org/2009/data/](http://www.icwsm.org/2009/data/)
- [http://data.gov](http://data.gov)
- [http://www.ckan.net/](http://www.ckan.net/)
- [http://www.guardian.co.uk/news/datablog/2010/jan/07/government-data-world](http://www.guardian.co.uk/news/datablog/2010/jan/07/government-data-world)
- [http://data.gov.uk/](http://data.gov.uk/)
- [51,000 US Congressional Bills tagged](http://www.ark.cs.cmu.edu/bills/)