On this page... (hide)
http://www.textart.ru/database/slogan/list-advertising-slogans.html - Russian db of advertising slogans. surprisingly boring.
http://www.usemod.com/cgi-bin/mb.pl?WordList - Word Lists in general
http://www.dcs.shef.ac.uk/research/ilash/Moby/ - the largest(?) word list, in specific
BB post on textbook prices has some good sources on free online textbook resources.
http://www.sacred-texts.com/nos/index.htm - Internet Sacred Text Archive (link to Nostradamus section)
API documentation for the National Nutrient Database for Standard Reference (NDB).
http://mutants.maizegdb.org/doku.php?id=info:index_of_phenotypes - with images (after processing multiple pages)
USDA Complete PLANTS checklist - The Complete PLANTS Checklist is nearly 7 MB and includes Symbol, Synonym Symbol, Scientific Name with Authors, National Common Name, and Family. Fields in this text file are delimited by commas and enclosed in double quotes. You can import this file into many databases or spreadsheets. For example, first save the .txt file, then open in Microsoft Excel by specifying “Text Files” in the file type scroll box, and import by specifying “Comma” as the delimiter. Or use the file directly in Excel: copy it from the screen, paste it into a new worksheet, and select “Text to Columns...” from the Data menu. The complete PLANTS checklist may have more records than Excel allows, so you might need to split it into two worksheets. (direct link)
http://plants.usda.gov/gallery.html (text + images after search)
NASA technical documents - at Internet Archive
http://ourairports.com/data/ - airport data
http://www.gutenberg.org/wiki/Mathematics_%28Bookshelf%29#Constants_and_Numerical_Sequences - misc. texts containing mostly numbers on Gutenberg
WHICH DOES NOT LIKE SCRAPING
http://www.crummy.com/2013/11/30/0 - iterator (python)
are, for the most part, not strictly legal. But widely available. so, what to say?
http://project.cyberpunk.ru/lib/ - of dubious legality.
See the Internet Archive’s Language Commons for more.
I have parsed a portion of Enron 1 (from here) into chunks for Poetical Bot.