Datasets for testing

During our time in the search business we have collected some datasets and wordlist that we think can be useful for testing and research into search and information retrieval in general .



By Searchdaimon

  • Wikipediadoc

    This dataset consist of 67 537 Wikipedia articles converted to Word format. The data set was made by parsing an xml database dump of Wikipedia and converting it to individual html files. Each html files was then open in Microsoft Word 2002 (Office XP), so saved by Word as .doc .

    Creative Commons Attribution 3.0
  • English and Norwegian word stemers with word lists for reverse lookup

    A stemmer is a heuristic transformator that aim at reducing a word to it's stem, base or root form. For exsample walked, walks, walking are all derivates from the word walk.

    This dataset consist of some Perl scripts that can stem English and Norwegian words. Aka: stem(walked) –> walk, and two word lists for lookup of reverst steming. Aka: lookup(walk) –> walked, walks, walking. The word lists was created by steming the most common word found on 64 million webpages.


    29979 English words
    66479 Norwegian words

    Creative Commons Attribution 3.0
  • Lists of adult words

    Lists of words and two words phrases often seen on pornographic sites.

    Creative Commons Attribution-ShareAlike 3.0


Recommended third party resources

  • EDRM Enron Email Data Set v2

    One of the best and most used data sets in information retrieval research. This data set contains Enron e-mail messages and attachments from about 150 users, mostly senior management of Enron, organized into folders. This data was originally made public, and posted to the web, by the US Federal Energy Regulatory Commission during its Enron investigation. The data set was created by EDRM.

    Contains XML description, EML files with attachments, native attachments, text email bodies and text email attachments.

    Creative Commons Attribution 3.0 United States License
  • Geocities - 641 GB of fun

    Have you been missing the blink tag lately? Or maybe you want to test your html parser on some real data. The Geocities torrent is on of the largest data collection of real hand-made documents, created by millions of users. The data set was created by The Archive Team.

    Unknown license

