NOTE: currently offline - Russian db of advertising slogans. surprisingly boring. - Word Lists in general - the largest(?) word list, in specific


BB post on textbook prices has some good sources on free online textbook resources. - used by Darius Kazemi (locally scraped) in dialogue (as noted in a comment). - Internet Sacred Text Archive (link to Nostradamus section)


API documentation for the National Nutrient Database for Standard Reference (NDB). - with images (after processing multiple pages)


USDA Complete PLANTS checklist - The Complete PLANTS Checklist is nearly 7 MB and includes Symbol, Synonym Symbol, Scientific Name with Authors, National Common Name, and Family. Fields in this text file are delimited by commas and enclosed in double quotes. You can import this file into many databases or spreadsheets. For example, first save the .txt file, then open in Microsoft Excel by specifying “Text Files” in the file type scroll box, and import by specifying “Comma” as the delimiter. Or use the file directly in Excel: copy it from the screen, paste it into a new worksheet, and select “Text to Columns...” from the Data menu. The complete PLANTS checklist may have more records than Excel allows, so you might need to split it into two worksheets. (direct link) (text + images after search)


NASA technical documents - at Internet Archive


Numbers - airport data - misc. texts containing mostly numbers on Gutenberg

I had some luck in the past including A List of Factorial Math Constants in corpuses (?!??) used for Markov generation.



Project Gutenberg

WHICH DOES NOT LIKE SCRAPING - iterator (python)



Downloadable ebooks

are, for the most part, not strictly legal. But widely available. so, what to say? - of dubious legality.




Stanford Politeness Corpus - (zip)
Brown Corpus - Wikipedia:Brown_Corpus


See the Internet Archive’s Language Commons for more.


Switchboard Dialog Act Corpus - notes @


Enron Spam Corpus



I have parsed a portion of Enron 1 (from here) into chunks for Poetical Bot.


Things like corpora, but not actually corpora

Character Relations - 2700 annotated character relations from 109 literary texts
Harvard Sentences - MetaFilter, also see Wikipedia:Harvard_sentences




TODO: tools, scripts, etc. for extracting/cleaning up e-texts
cleanup tools - some of my scripts. ongoing.
Using SPARQL to scrape DBPedia



