(redirected from Programming.MarkovTextProject)




I’m (finally) building a (c#) application to do a variety of processing on a variety of inputs.


Oh, yeah, there are plenty of those about, but I want one of my own, with the ability to set up a set of inputs, apply my own twiddling to the output, and some other things.


Instead of taking only from a defined source of text -- i.e., file or text-area, I want to be able to pull from a number of online resources.


What I really want is an algorithm that approximates the bizarre thought patterns that lead to an XRML page.


What I want is a hermetic encoder. (Not THIS, THIS.) cf, Jerry Cornelius’ Hermetic Garage.


And to build a naive, simplistic, even moderately-interesting output engine, I must have a better understanding of my own opaque, oxygen-starved processes.


This means both processing algorithms, but also sourcing algorithms. IE, where does text come from, how does it interrelate. The landscape, textscape, textriver ideas, cut ups, concrete poetry, misunderstood maths, art, lit, pop culture. How do I replicate the blenderize in my headscape?


I should stop worrying about “perfection” -- work towards a first approximation -- something that provides me with a jumping-off point for real editing, curation, modification, etc. Which is the major goal, isn’t it?



The source is hosted online at https://code.google.com/p/text-munger/.



no more line ears

I’m sick of linearity, word-follows-word, line-follows-line, left-to-right end-of-line DING zzzzzzzzzip back to the start and next-line-again until we get to the marginalized page and turn to next page which follows the previous page.


Where’s the up? Where’s the down? Where’s the round-the-world, “I don’t think we’re in flatland-anymore, Toto!” Now, while, XRML _looks_ like flatland, it wants to be the technicolor escape from the black-and-white of linear-land when you look at the history of edits. Which doesn’t exist in any visually accessible form. Each page/the-whole is part of a text river than changes -- it’s not a linear textile, it’s a stacked set of frames, that the camera can pan through and focus on whichever plane makes the most sense, while the foreground and background panes adding context. Thanks, Walt Disney.


THIS is the end-goal of Text Munger. To be a planar text-editor. That’s a long way down the yellow-brick-road. But I’ve got the ruby1 slippers, and I’ve taken the first steps....



Hermetic Encoder

Notes on a|my process


Willard: They told me that you had gone totally insane, and that your methods were unsound.


Kurtz: Are my methods unsound?


Willard: I don’t see any method at all, sir.




How much of this is “original” ? See also, Word Salad.Appropriations Committee
This project will be process (and (re)mixing) existing texts, perhaps min, mostly others’.
When I worked “manually”, I usually had texts in front of me, or they hyper-jumped scramble of memory... tv, radio, advertising, newspapers, receipts, collages, etc.



Crazy Thoughts

Some NLP rules? Paragraphs, words, sentences, footnotes?
What about word-breaking?
U&LC analysis?
word-substitution? line-references to each other? shifts?
Should this stuff be automated?
Uh, it’s pie-in-the-sky. Of course it can be automated; is it worth it?


Break apart from a monolithic program, to having all transformers be standalone apps that take command-line parameters ?
This could allow for some interesting chaining in other ways..... but would also mean a lot of weird calling inside the app?
The Unix model.....
But it would allow programmatic-scripting of.... something.



Store the n-gram model externally, although not necessarily in a database. This would allow for (re)generation from a large-corpus without length re-processing, continue-reprocessing at a given point (i.e., start from an arbitrary seed within an extant output), etc.



Core code

The core is no longer a Markov engine, although was what I once thought it would be


So, core code is two parts
a) selection of texts

  • local library
  • online sources

b) application of processing rules

  • this includes formatting, etc.




Markov processor

Markov text generation is naive.


Naive in the sense that I thought the extant corpus of XRML was large enough to generate interesting output when used as the whole source.


Naive in that I thought with some tweaking I could get it to provide an interesting output.


And finally naive in that it has nothing to do with language at all - it’s a happy accident of statistics that it produces output that looks like language. there is no “understanding” of language at all (ignore that no application “understands” language).


And therein lies the [w|r]ub. If I want an algorithm to encode meaning there has to be some meaning behind it. A statistically-significant-series [of words] doesn’t mean much. And it only produces other series. To get something planar, with leverls of reference, there has to be something beyond word-sequence. There has to be some analysis of the words, and working with that to related words/concepts. Which is the Big Bugbear of NLP, isn’t it? We’re getting close to AI territory.  don’t want to go there, but my goal is more complicated than a linear Markov series.


n-gram models are often criticized because they lack any explicit representation of long range dependency. (In fact, it was Chomsky’s critique of Markov models in the late 1950s that caused their virtual disappearance from natural language processing, along with statistical methods in general, until well into the 1980s.) This is because the only explicit dependency range is (n-1) tokens for an n-gram model, and since natural languages incorporate many cases of unbounded dependencies (such as wh-movement), this means that an n-gram model cannot in principle distinguish unbounded dependencies from noise (since long range correlations drop exponentially with distance for any Markov model). For this reason, n-gram models have not made much impact on linguistic theory, where part of the explicit goal is to model such dependencies.


Another criticism that has been made is that Markov models of language, including n-gram models, do not explicitly capture the performance/competence distinction discussed by Chomsky. This is because n-gram models are not designed to model linguistic knowledge as such, and make no claims to being (even potentially) complete models of linguistic knowledge; instead, they are used in practical applications.


TODO: the above notes will probably be moved to WordSalad.ChainsOfLove




I’m using the last project, as it was the easiest to download, and followed a pattern I was familiar with. Simple dictionary.
I’ve found a few bugs in it [off-by-one boundary for random selection that resulted in never selecting the LAST element], and am extending it to use different parsing rules to break apart on words or chars, or to treat whitespace as significant (since it trims it to non-existence in the original).


need to look at other implementations, as one of them uses a node-structure that superficially confuses me. It could just be terminology.


However, the big thing for me is parameterizing/semi-automating the source input, editing the output, and programmatically pro-grammatically editing the output -- ie, First word is capitalized, sentences, paragraphs, etc. [if the target is not XRML


Which is all non-Markov stuff. That heavy lifting is over with. whatever.


Markov tokenizers

Breaking apart the sources (texts) into tokens is, for me at least, not a simple issue. Lots of tokenizers discard punctuation and whitespace. My own quirks mean, that unless I’m trying to disparage Python, I’m interested in whitespace and punctation as semantic elements. However, I’m not positive wether they should be considered as a block -- ie, {",-;....@} or broken into pieces { {"}, {,}, {-}, {;}, {.}, {.}, {.}, {.}, {@} }.
Plus, sometimes I like processing on the character-level, instead of the word level. Sentence-level would seem to be a non-starter for generating a “new” text as repeating sentences are not common. In most texts. (Even Gertrude Stein. Right? Hrm....)


So, My Markov transformer takes a tokenizer as a parameter, along with some other rules.


The source algorithm was modified, as it stored tokens in a string, using a space " " as token-delimiter. I’ve replaced the space with a non-printing control-character, and eliminated other uses of the space in the output.


I also want to revisit the storage model, as I’ve hand-built my own Markov-generators in the past using real data-structures. String concatenation seems slow, but it might be using a string-builder for all I remember at the moment...


I need to continue look into how I’m building my tokenizers.


NOTE: Word-based rule-application relies upon breaking apart the source-text into words. This is currently discrete code, and puts things back together with spaces, with less-than-perfect results. The same tokenizers and combiners [?!?!] should be used



matching brackets and other punctuation consideration

see Interference:2012/05/21/punctuation-art/
Better Punctuation Prediction with Dynamic Conditional Random Fields


random notes

Timeline → store a copy of the extant text with the rule that has been applied to it.
First step has a empty rule
This will allow for stepping through the process, and redefining it, deciding to go in another direction
this would also require that all transform rules have a common interface -- which means the Markov engine needs more tweaking to fit.
should be able to serialize all of this, so could be restored, re-processed?
Hrm. Once rule + text is serialized, should be trivial for the historical sequence.
This will not be a small file, though.




process at what level?
n chars, n words, n sentences, n paragraphs, pages, blocks, something else?
The current crop of Transformers/Rules are considered “All”, “Sentence” or “Word” -level granularity. Practically speaking, only “All” and “Word” -level rule application exists, so no real sentence-chunk rules exist. I’ve positied some other levels, but none have been implemented.
A rethink of the granularity implementation is required -- a given rule probably hits a RANGE of granulary -- ie, Pig-Latin works on the word-level only, but Reverse and Random-Caps can work on any level (they’re almost pointless on a char-level, though). Markov has no application on char-level, and can only work on word-level if a character-level tokenizer is used. And even then results are almost random with words with little to no repetition.


See Also: tokenizers, above.


GUI and editor

Select from a list of sources (online, or local cache, files, etc)
arrange “timeline” of transformations


I’ve begun work on the editor.
As one of the many rabbit-holes I’ve chased (am chasing) down on this project, I ended up having to create a winforms custom-control


The GUI is operable, if funny-looking.



the editor is a downstream project. Generation of text is the top priority, as that includes analyzing my own processes. the editor is a tool to assist. Which is nice. But not required. But devoutly to be wished.


visual editor of text at any given stage
Think about a matrix, instead of a stream of text -- edits should be in a grid, so that blocks can be picked up, moved, shifted, sliced, etc.


Grid? http://msdn.microsoft.com/en-us/library/system.windows.controls.grid.aspx




editors that allow vertical (block) selection -- and of course, Emacs.
sadly, {{#Develop}} doesn’t (yet?) support block-highlighting
call developer to implement
Dissecting a C# Application: Inside SharpDevelop - have a look at Chapter 11, “Writing the editor control” (I have the ebook inside of dropbox)



potential transform rules


letter-position shifter, with first and last letters intact

ruby quiz where this is referred to as a “Text Munger”.

various “mungers”
re-spacing - take existing word breaks (spaces, punctuation) and move them about.

“this is a text” := “thi sisat ext”

random-loss of letters
translation of source-text? uh. I dunno. pseudo-translation, maybe, like the now-defunct snoop-dog translator
internet slang converter

The Snoop-Dog “Shizzolator” is long gone, but some samples live on:

Dialectalizer: http://rinkworks.com/dialect/works.shtml
12-year-old AO Ler? , whose author says “don’t copy this”, so, don’t


translation into another language (selected at random?) [French, German, Spanish, Italian, Latin]
random re-order
splice -- ie, split in n pieces and re-arrange.

Effects will vary with granularity

Automatic Translation of English Text to Phonetics by Means of Letter-to-Sound Rules - 1976 paper

Bayesian replacement - I had thought about this vaguely, but it looks as though Adam Parrish thought about this concretely (python source code).
purposefully bad implementation of a bowlderizer - an instant clbuttic!!!


TODO: Translator sub-interface of the ITransformation (which are really rule-based pseudo-translators)
Translators retain the Source, Munged methods, but add Translate and Reverse
on the assumption that the translation is somewhat bi-jective
In the case of pig-latin, that may not be strictly true, which could be interesting.
[ie. the two english words “wall” and “all” translate to the single pig-latin allway, which means that it could have two possible reverse transformations, and only contextual analysis could tell which one, and that’s a huge scope-creep.]
Nevertheless, most of the pseudo-translators have rules that can be easily reversed.



archive.org’s Compact Rhyming Dictionary looks to be a dodgy OCR
archive.org’s Walkers rhyming Dictionary also has dodgy OCR

TODO: find a better file



Density of text -- if input is all char-heavy, allow density filter to intrude by adding in chunks of periods?
but how probable are they?
If source text has NO PERIODS then adding 2000 pages of periods won’t help, as those periods will only fire on a probability rule for the last letter before they start
need some rule for injecting these at OTHER points, beyond the existing Markov rules
since I’m not strictly following a markovian generation -- I’m doing custom building. new rules.


set via percentage 0..99
100 is no added punctuation -- all words run together with no spaces (retain source punctuation?)
0 would be all punct, no source, so not available... unless we want to generate a blank slate? hrm....
Not sure how the intermediate would be. Not numerical -- at the upper bound (somewhere in the 1..10 range) is an XRML page that has but one word on it.
Fill with default punct mark -- the period
small chance of other characters intruding at random (mostly punct, some alpha -- weighted list? I prefer “x”, for some reason)
chance of words splitting with a few chars in-between the splits
chance increases as density... decreases?
but number of spaces can’t be too large -- we don’t want them on separate pages (i.e., 40+ lines between each part)


What is the density a measure of -- source-text density, or punctuation?
It should be source-text. Raw punct/blank slate [ie, all periods] should be 0 density.
So, need to edit the above to indicate that.


Slightly randomized density around the indicated amount is implemented, although the algorithm could use some tweaking.
And, again, once it’s in place, I see that it still is “not enough.”
I think the density should “wander” -- that is, not remain fixed, but go up and down, sometimes discretely, sometimes discontinuously. Weighted randomness based on history?
I’m getting a lot of algorithmic workouts, here.
Which is half the point in doing a project. But I need to remember that the product [oh! how crude!] is the goal, not the process, right? The text, not the application. But as a programmer, it is also a goal....


Elaborate over time, including syllable-breakage
Eventually, would like some splits to be vertical, not just linear.
That is waaaaay down the road.


Granularity needs to be less static than randomization around a fixed point. We need a random walk points, with randomization around those.


i.e., 5,5,4,7,3,5,15,17,14,15,25,24,12,13,11,12


algorithm: random walker with weighted jumps, some discontinuity, but not 0..1840 generally. Some weighted, random amount for the number of points around that walk-stop.


TODO: some sort of d--n terminology
TODO: clean up the above mess




disemconsonant (cf, disemvowell, but, well, it should be obvious)
white-space to punctuation
homophonic replacement: http://www.peak.org/~jeremy/dictionaryclassic/chapters/homophones.php
density - to a second approximation. have a sliding scale of 0..100% := 0..1840 puncts,, but no randomness yet.
technically, Markov belongs in here, as it is just one of several rules.
leet-speak replacement
replace letters/vowels with punctuation or other mark -- “x” or “-“


grid-based transformations

rotate 90|180|270 degrees (is anything else practicable?)
shift n chars -- ie, end of line flows into start of next line, end of block flows into start of block



Some think that interactivity is a highlight of contemporary epoetry/e-text generation.


I’m not sure where I stand on that, but more interactivity would be nice.
See jGnoetry for an example.


not sure how the method would be used, but the current implementation would not support it, at this point.
Web Rendering? is great for allowing text to be “object” with styling, links, etc.
Text controls in C#, not so much.


So, what about embedding a web-renderer inside of a Winform?




Dissenting Voice

co-worker Jon Langdon suggested just using a RichTextBox control. Click point can be found, word-beginning and end determined (if not selected), and background-color modified to show that it has been selected.
Probably much easier to integrate than re-tooling everything for HTML output



Getting Source material

I’m building source inputs -- could be called TextGetters, for want of a better name
Working on WebGetters, to grab from Xrays Mona Lisa, Gutenberg, and eventually Textfiles.com
See: Text Shopping Word Salad/Generators Internet Meme Text Lorem Ipsum Word Salad/Spam for more ideas


I found that processing 60 pages of XRML provides a strikingly boring output. due to a high lack of self-similarity, the rendered output doesn’t vary all that much. Dropping the key-length amount is one option. Adding in alternate sources is another.
However, processing 60 XRML pages and an entire novel now skews towards the novel, and I want output to “look” like XRML.
So, need to figure out some methods of modifying non-XRML source to be more similar.


Invisible Literature

see Word Salad.Invisible Literature


Scraping Project Gutenberg

I found that looking @ http://www.gutenberg.org/browse/recent/last1 was an interesting source
And from the generic links on that page

I could build a direct link to the plaintext by appending “.txt.utf8″


Now, there’s still some boilerplate that, for my purposes, would be good to eliminate


Auto Converting Project Gutenberg Text to TEI offers some code (in python) that was used to remove boilerplate and do some reformatting
referenced in the above link and code is easier to read (due to formatting)



Scraping wikipedia


extract markup from Wikipedia
Parse to natural language





Scraping pmwiki (this site)

Specifically, I would like to process [Xrays Mona Lisa]


So, the following may be required:




http://www.pmwiki.org/wiki/Cookbook/TextExtract - was pointed to as a suggestion, but it turns out it can’t really strip the markup. Will have to continue to look into this.
Possibly, render to HTML and THEN strip all tags?
Uh, since I’m writing an external application, it would probably be retrieving the HTML, anyway. so there.


UPDATE: my first iteration is just scraping the HTML, using XPATH to get what I need (for both page-list links, and content)


Looks like there are some C# solutions:



TODO: At some point, bundle up some source-texts and provide them as a download on the google-code page.
Some texts that might be fun to provide as defaults:
1811 Dictionary of the Vulgar Tongue by Francis Grose
Lectures on Landscape by John Ruskin
Canturbery Tales
King James Bible
Futurist Manifesto - but it’s in Italian, need a translation?

and particularly Technical Manifesto of Futurist Painting for the x-ray quote

Art, by Clive Bell -- why this one?
A History of Art for Beginners and Students by Clara Erskine Clement Waters
Manual of Egyptian Archaeology and Guide to the Study of Antiquities in Egypt - may be more suited to a different project
Books on Arthurian romances (sorted by popularity)

Books on Classical literature (sorted by popularity)
Oz books by L. Frank Baum
books on Philosophy
Tractatus Logico-Philosophicus by Ludwig Wittgenstein - unfortunately, only in PDF or TEX. Which makes sense for all of the logical equations.
some sort of encyclopedia ???
western bookshelf - which is surprisingly sparse, compared to actually searching for Westerns
actually searching for Westerns


Webapps: How do I download all English-language books from Project Gutenberg?


lyrics - scrape links from this page, for an online sourcer?

for all online sourcers, I suggest caching content locally





On further thought, these are all pretty vanilla linear texts [not necessarily narrative]. But I’m interested in some other forms, or changing into other forms, of the invisible literature -- the headings of the Gutenberg texts [question -- what variations are there? which etext has the longest?], receipts, recipes, code, what else should be thrown into the mix?


Beyond Cyberpunk Manifestos (TOC)
Random POMO stuff (like the Panic Encyclopedia)


Cryptome.org - I have a note from 2007 saying to use this as source material. Sheesh, I need to catch up on my TODO lists....


Programming/Tech articles, such as Douglas Crockford on JavaScript


emoticon lists/explanations?
list of file-formats
white-pages listings (businesses, not personal)
bank statements
list of numbers
receipts [one of the early inspirations for the hyper-dense cash-register-tape manual-typewriter typing I did in 92-95]


THE CONTENTS OF MY SPAN FOLDER. How could I have ignored this trove for so long?


What hermetic algorithm do I need to encode the narratives to those formants?




Other than the jumble above, and a roadmap on the google-code-wiki, TM has no documentation.
Other high-powered projects I’ve looked at have scattershot documentation, or are only available in PDF form.
And then, some projects have TONS of documentation:


http://www.cutnmix.com/esoteric/finnegans_wake.html - this is a great promo page for the tool, that basically “rebrands” the idea of character n-grams.


The “Shannonizer” directly links its pedigree to Claude Shannon’s information theory, and boats a suite of “editors” that, as far as I can tell, pre-seeded markov banks. But the coder ‘’sells’ it so well....


The Gibberizer has a ton of documentation. I haven’t actually run the thing yet, so I can’t say what the doc:performance ratio is.




similar projects to investigate / Prior Art

SCIgen - An Automatic CS Paper Generator



I forgot about this -- and the code is available!


Dada Engine

Dada Engine web interface


Waffle Generator




rmutt http://www.schneertz.com/rmutt/


Infinite Monkeys

InfiniteMonkeys - is an open source random poetry generator written in FreeBASIC. It is largely considered the Industry Standard in SPAM generation. No variables, no loops.


See also : Wikipedia:Snowclone for some interesting script ideas


GTR Language Workbench

wait. is that what I’m trying to build, here?!??!
It’s a huge program, because it’s built around Eclipse. yipes.


rwetext? reading and writing electronic text

Notes from a course; code in Python.



review of JN @ GnoetryDaily




Web-Based (javascript) Computer Poetry Generation Programs

See software by EddeAddad




the Gibberizer is on the JRE
Has some good documentation, compared to other projects.



various tools

I’ve just gotten in-touch with some of the gnoetry-daily guys.



Other broad avenues of research


poetry generator applet (java)


what??? http://www.nictoglobe.com/new/notities/text.list.html


How To Build a Naive Bayes Classifier - with some discussion of “stop words”, a link to one catalog, and ideas for pulling out of Gutenberg texts.




pseudo-translation (in Perl) - like the Shizzolator.
random characters
Chris Pound’s language machines
Non-Linear Fully Two-Dimensional Writing System Design - some interesting ideas, well articulated. doesn’t like the grid. the small example shown is still more linear than I prefer. update: on reading a follow-up, seems like this is more of designing a constructed-language writing system. You know, writing, as concept. not writing writing. But, some interesting articulations on non-linearity in there....


Interactive Poetry Generation Systems - an illustrated overview - several projects covered above. As the title indicates, an nicely illustrated overview. Good looking GUI ideas....


auto-imported generated wiki pages



Hrm. That could be.... interesting....


Pushing output of the app back into the wiki (my website, and XRML home)


sounds more like a standalone app that the output is pushed into, though....




See Also

Word Salad.Chains of Love
Word Salad.Automatic for the People - in particular, the notes on Philip Parker
Word Salad.Text Shopping - perhaps some sources (which is vaguely the idea behind the creation of this page in the first place)
Xrays Mona Lisa.Archaeological Notes - some small thoughts on origins. Needs more earthquake.
XraysMonaLisa.ElectronicWriting - more notes on influences.
- not a lot of there there, but a start.
Word Salad.Appropriations Committee - Originality, what is that?
Word Salad.Electro Text
Word Salad.Electro Poetics
Interference Pattern posts tagged textmunger



1 silver in the book (↑)



Comment Page  

  Rock Hard Bull Reviews   - 19.08.2018 - 02:14    
I am the new one

I very lucky to find this web site on bing, just what I was looking for :D too saved to bookmarks. http://rebirthgaming.org/wiki/3_Main_Most_Penis_Enlargement_Products_Won_t_Work(approve sites)

  AndroTestin Reviews   - 15.08.2018 - 05:59    
Just wanted to say Hello.

I am extremely impressed with your writing skills and also with the layout on your weblog.
Is this a paid theme or did you modify it yourself? Anyway
keep up the nice quality writing, it is rare to see a great blog like this one these days. http://bbs.lybook.com.cn/space-uid-1511026.html?do=profile(approve sites)

  hotmail login   - 06.06.2018 - 22:57    
hotmail login

This is really interesting information for me. Thanks for sharing!
http://hotmailwiki.com/hotmail-login(approve sites)  

  seo plugin   - 23.05.2017 - 11:28    
http://www.SEORankingLinks.xyz(approve sites)

Hello Web Admin, I noticed that your On-Page SEO is is missing a few factors, for one you do not use all three H tags in your post, also I notice that you are not using bold or italics properly in your SEO optimization. On-Page SEO means more now than ever since the new Google update: Panda. No longer are backlinks and simply pinging or sending out a RSS feed the key to getting Google Page Rank? or Alexa Rankings, You now NEED On-Page SEO. So what is good On-Page SEO?First your keyword must appear in the title.Then it must appear in the URL.You have to optimize your keyword and make sure that it has a nice keyword density of 3-5% in your article with relevant LSI (Latent Semantic Indexing). Then you should spread all H1,H2,H3 tags in your article.Your Keyword should appear in your first paragraph and in the last sentence of the page. You should have relevant usage of Bold and italics of your keyword.There should be one internal link to a page on your blog and you should have one image with an alt tag that has your keyword....wait there’s even more Now what if i told you there was a simple Wordpress plugin that does all the On-Page SEO, and automatically for you? That’s right AUTOMATICALLY, just watch this 4minute video for more information at. <a href=“http://www.SEORankingLinks.xyz(approve sites)”>Seo Plugin</a>



Add Comment

 Your Message
 Enter value ← Have you entered the code number?