Sources

http://textfiles.com/

 

A Piece of Text - a random snippet from our own Amazing Text

NOTE: currently offline

http://www.wikipedia.org/wiki
http://www.textart.ru/database/slogan/list-advertising-slogans.html - Russian db of advertising slogans. surprisingly boring.
http://www.usemod.com/cgi-bin/mb.pl?WordList - Word Lists in general
http://www.dcs.shef.ac.uk/research/ilash/Moby/ - the largest(?) word list, in specific

 

BB post on textbook prices has some good sources on free online textbook resources.

 

http://dreambank.net/ - used by Darius Kazemi (locally scraped) in dialogue (as noted in a comment).

 

https://www.marxists.org

 

http://www.sacred-texts.com/nos/index.htm - Internet Sacred Text Archive (link to Nostradamus section)

 

http://davycrockettsalmanack.blogspot.com/2013/08/forgotten-spicy-detective-stories-dan.html

 

https://www.npmjs.com/package/gutencorpus

 

API documentation for the National Nutrient Database for Standard Reference (NDB).

 

http://mutants.maizegdb.org/doku.php?id=info:index_of_phenotypes - with images (after processing multiple pages)

 

USDA Complete PLANTS checklist - The Complete PLANTS Checklist is nearly 7 MB and includes Symbol, Synonym Symbol, Scientific Name with Authors, National Common Name, and Family. Fields in this text file are delimited by commas and enclosed in double quotes. You can import this file into many databases or spreadsheets. For example, first save the .txt file, then open in Microsoft Excel by specifying “Text Files” in the file type scroll box, and import by specifying “Comma” as the delimiter. Or use the file directly in Excel: copy it from the screen, paste it into a new worksheet, and select “Text to Columns...” from the Data menu. The complete PLANTS checklist may have more records than Excel allows, so you might need to split it into two worksheets. (direct link)

 

http://plants.usda.gov/gallery.html (text + images after search)

 

NASA technical documents - at Internet Archive

 

Numbers

http://ourairports.com/data/ - airport data
http://www.gutenberg.org/wiki/Mathematics_%28Bookshelf%29#Constants_and_Numerical_Sequences - misc. texts containing mostly numbers on Gutenberg

I had some luck in the past including A List of Factorial Math Constants in corpuses (?!??) used for Markov generation.

 

 

Project Gutenberg

WHICH DOES NOT LIKE SCRAPING
http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project#Downloading_Via_BitTorrent
http://www.gutenberg.org/
http://www.crummy.com/2013/11/30/0 - iterator (python)

 

 

Downloadable ebooks

are, for the most part, not strictly legal. But widely available. so, what to say?

 

http://project.cyberpunk.ru/lib/ - of dubious legality.

 

 

Corpora

Stanford Politeness Corpus - (zip)
Brown Corpus - Wikipedia:Brown_Corpus

 

http://www.nltk.org/nltk_data/

 

See the Internet Archive’s Language Commons for more.

 

Switchboard Dialog Act Corpus - notes @ https://catalog.ldc.upenn.edu/LDC97S62

 

Enron Spam Corpus

Wikipedia:Enron_Corpus
http://www.aueb.gr/users/ion/data/enron-spam/
https://github.com/shenzhun/creating-enron-spam-corpus-from-raw-data
https://www.technologyreview.com/s/515801/the-immortal-life-of-the-enron-e-mails/
https://www.cs.cmu.edu/~enron/

 

I have parsed a portion of Enron 1 (from here) into chunks for Poetical Bot.
See https://github.com/MichaelPaulukonis/NaPoGenMo2016/tree/master/corpus/spam

 

Things like corpora, but not actually corpora

Character Relations - 2700 annotated character relations from 109 literary texts
Harvard Sentences - MetaFilter, also see Wikipedia:Harvard_sentences

 

 

tools

TODO: tools, scripts, etc. for extracting/cleaning up e-texts
cleanup tools - some of my scripts. ongoing.
Using SPARQL to scrape DBPedia

 

 

See Also

Processing.Text
TextMunger
WritingMachines
Programming.Wordnik - API for getting words.

 

 

Tags

Text Source


 

Comments

Comment Page  

     - 14.12.2016 - 09:25    
good blog

Thank you for providing information that is nice
<a href=“http://www.mgmdomino.com/(approve sites)”>raja poker</a><br />

 

 

Add Comment

Heading:
 Your Message
 
 Enter value ← Have you entered the code number?
Author: