re: poofing and tarking

craig said:
you poofing friggleschnitz!
mommy, mommy, craig is starting a flamewar! he called me a poofing friggleschnitz, mommy! i demand that he be banned! ;+) *** david said:
One and the same, yes.
great! :+) and you did a little bit of work on... what was it?... plucker, right? ;+) if you are prepared to tackle the i.m.d.b., 15,000 e-texts should be a piece of cake...
...much like bringing 5 friends into a video store, and trying to agree on one movie for everyone to watch. Not going to happen ;)
i wish things were that inconsequential. but the project gutenberg e-texts make up the most important e-library, historically. although even now it's starting to be dwarfed by other efforts, it would be a proper tribute to michael if it were to be well-maintained...
That being said, I'd be interested in seeing a list of the tools people know of, or are working on, or have worked with in the past, that can be used to take a 7-bit ascii text PG work, and convert it into other formats.
"convert" is a rather loose and unspecific word, wouldn't you say? :+) nonetheless, i'll cut to the chase... the main problem with the e-texts is their formatting is _so_ inconsistent. so before you can do anything useful with them, you must write routines that can resolve their inconsistency. the inconsistency is very maddening, because it's so pointless. although some is understandable, considering how many hands created the e-texts, the sadder truth is that much of it could have been prevented; however, mr. newby and company simply fail to grasp the negative consequences of the inconsistency, and thus never made it their priority to minimize it. the good news is you _can_ write routines that will fix the problem. it is _not_ impossible, just thorny; the biggest expenditure of time is a quality-control check to make sure that you knew every inconsistency. their variety will amaze and astound. subsequent conversion to any format is straightforward once you have done the job of resolving the inconsistency. you don't even have to do that job, if you don't want to, you can just go to david moynihan at blackmask and get his files, as he has edited out almost all the inconsistency, which is what then allowed him to make a half-dozen versions of most e-texts in the entire library. if you're looking for explicit info, ron burkey did a converter called "gutenmark", and his website at http://www.sandroid.org/gutenmark does a good job of documenting the inconsistency he faced on the way, before he gave up the effort, saying:
the more perfect my automated conversions became, the farther (in my own mind) I seemed to be from having a perfect conversion.
i think that's a nice way of saying that the more he learned about the e-texts, the more he found out how bad they are, from the standpoint of consistency... there is also some basic information at: palmdigitalmedia.com/dropbook/converting but i'd guess that at this point in time, moynihan will have the most expertise about the problems you would be facing. much of it might be inside his noggin, but i do know he has a _lot_ of macros that undoubtedly embed gobs of wisdom. and, more to the point, david has shown, incontrovertibly, that mass conversions to a plethora of formats is fully possible. recently, david even _offered_ his files to project gutenberg, but -- as far as i know -- his gift was spurned, for some bizarre reason i'll never be able to grasp. oh yeah, i've written some routines that squash out most of the inconsistency, and there's a way you could pry 'em out of me -- namely, if you got support for my z.m.l. (zen markup language) built into plucker. it's a simple rule-set; you could probably have it up-and-running in a couple days... backchannel me if you're interested. :+) once you've vanquished the inconsistency, there are other concerns, which might or might not be a problem to you, including: 1. errors in the e-texts, lots of them. 2. styling lost or converted to all-caps. 3. information about images discarded. 4. image filenames are often not unique. 5. accents lost in many foreign e-texts. 6. a confusing redundancy of some books. 7. attacks levied if you reveal problems. oh yeah, also make sure that you are always working with the freshest e-texts available, as i'm not sure if they make an announcement whenever they make corrections to an e-text; they just quietly substitute in the new file... *** i would welcome you here, but i am on my way out the door _very_ soon... :+) there are a handful of tarking naugshlocks here _so_ unworthy of my help they made me decide to decline to do any work for project gutenberg, in spite of its great historical importance and my highest regard for the genius of michael hart. i'm sure others, like you, will cover my absence, while i will be happy grazing greener pastures... at any rate, have a nice day... ;+) -bowerbird

and you did a little bit of work on... what was it?... plucker, right? ;+)
Quite a bit more than "a little", but yes, thats me.
if you are prepared to tackle the i.m.d.b., 15,000 e-texts should be a piece of cake...
Yep, once the structures are laid out for the classifications of works covered by Gutenberg. A good bulk of this has already been done by hundreds of contributors over the years.
although even now it's starting to be dwarfed by other efforts, it would be a proper tribute to michael if it were to be well-maintained...
What other efforts are you alluding to? Why not help those people who insist on reinventing a fleet of new wheels, to collaborate with existing projects that have similar/same goals?
"convert" is a rather loose and unspecific word, wouldn't you say? :+)
Yes, and specifically chosen for that reason. Gutenberg etexts are nonspecific, and "converting" them means taking a slightly different approach, depending on what I'm converting; poems, plays, books, etc. for each work. You can't use a single rigid approach for all works.
the main problem with the e-texts is their formatting is _so_ inconsistent. so before you can do anything useful with them, you must write routines that can resolve their inconsistency.
And this is exactly what the Distributed Proofreaders project proposes to solve, and they've been pretty successful thus far, IIRC.
the inconsistency is very maddening, because it's so pointless. although some is understandable, considering how many hands created the e-texts, the sadder truth is that much of it could have been prevented; however, mr. newby and company simply fail to grasp the negative consequences of the inconsistency, and thus never made it their priority to minimize it.
I've had a lot of luck stepping out of the box, and analyzing the text based on the "style" of the text, versus the actual content itself. I was approached by someone who is doing a paper and his PhD thesis on this exact kind of approach. Basically (with my expertise and help) he's taking the bulk of Gutenberg, importing every word from every work into a database, and then running his own algorithms across the entire collection, to pull out the styles by known authors. For example, with his approach, you can determine that a work claiming to be by "A. Einstein", is the same author as one claiming to be by "Albert Einstein", (S. Clemens -> Mark Twain -> Samuel Clemens, etc.) From there, you can then begin correcting the inaccuracies in the titling, authoring, and inflection of the work itself, including basic things like sentence structure, spelling, and so on. I've extended the schema quite a bit to allow some interesting other queries to be run ("Show me all works larger than 100 pages, written by male authors between the years 1951 to 1957"). With that done, it is a (relatively) simple matter to convert the 7-bit ascii text to something more manageable, such as structured XML + an associated DTD to turn that into something else.
the good news is you _can_ write routines that will fix the problem. it is _not_ impossible, just thorny; the biggest expenditure of time is a quality-control check to make sure that you knew every inconsistency. their variety will amaze and astound.
And I assume you've done this? And your routines are made public somewhere, so others can improve and correct them to continue to be better? I don't recall seeing a URL to download your code or routines. Can you reply back with that, so we can take a look?
you don't even have to do that job, if you don't want to, you can just go to david moynihan at blackmask and get his files, as he has edited out almost all the inconsistency, which is what then allowed him to make a half-dozen versions of most e-texts in the entire library.
And where is his code? Where are his "routines"? I don't see them on his site at all. I'll send him an email later this week to see if he wants to contribute those all back. ALL of the talk of how "easy" this is, is completely irrelevant, if nobody wants to actually contribute that knowledge back so others can improve and benefit from it. If you're not willing to do this, then our conversation stops here. There is no point in continuing the discussion, if you intend on trying to retain "control" of this kind of logic within your own circle of projects.
if you're looking for explicit info, ron burkey did a converter called "gutenmark", and his website at http://www.sandroid.org/gutenmark does a good job of documenting the inconsistency he faced on the way, before he gave up the effort, saying:
I've talked to Ron before via email, and described some of my needs for improvements to his tool. He's no longer maintaining it, so it is up to me (if I choose) to update his code and improve it further.
recently, david even _offered_ his files to project gutenberg, but -- as far as i know -- his gift was spurned, for some bizarre reason i'll never be able to grasp.
What was that "bizarre reason"? Is he still on this list? Did anyone else obtain his code? Does it exist out there for download?
oh yeah, i've written some routines that squash out most of the inconsistency, and there's a way you could pry 'em out of me -- namely, if you got support for my z.m.l. (zen markup language) built into plucker. it's a simple rule-set; you could probably have it up-and-running in a couple days... backchannel me if you're interested. :+)
Not interested. Our code is freely available. If you want someone to support "your" format, then you'll probably have to take that first step by justifying and documenting it. The only page I could find describing the format was here: http://czt.sourceforge.net/zml/ And I assume thats not your project or code. If it is anything different than HTML, it would require significant re-engineering of the core parser components used in Plucker and a lot of testing to make sure it didn't break anything in the existing parser in the process. In other words, not a couple-of-days of effort as you suggest.
once you've vanquished the inconsistency, there are other concerns, which might or might not be a problem to you, including: 1. errors in the e-texts, lots of them.
What kind of errors? Incorrect hyphens? Broken paragraphs? Missing end quotes? (this is common)
2. styling lost or converted to all-caps.
Impossible to regain, unless you have the original work in-hand, to see if there were actual CAPS used, or not. Maybe the "errors" were intentional. Many authors use poetic license to express their thoughts, and sometimes those things break the rules of grammar and spelling.
3. information about images discarded.
Same, see above.
4. image filenames are often not unique.
How do you mean? You mean 1.jpg 1.jpg 1.jpg appearing in three places, but intended to represent 3 _different_ images? Where do you see this inconsistency? Give me an example of a Gutenberg work that shows this. I'd like to verify it for myself.
5. accents lost in many foreign e-texts.
Seems to be a problem with the auditor/editor's charset or support for those charsets in their editor. I agree that the original nature and charset of the document should be retained. How do you express a Cyrillic text in 7-bit ascii? You can't.
6. a confusing redundancy of some books.
Such as?
7. attacks levied if you reveal problems.
Are you revealing the "problems" in a condescending way? Or in a constructive way? The way you approach the "Hey, this is broke" process is very telling as to how you will be received and responded to for same.
oh yeah, also make sure that you are always working with the freshest e-texts available, as i'm not sure if they make an announcement whenever they make corrections to an e-text; they just quietly substitute in the new file...
..which is exactly why you should have your own mirror of Gutenberg, or a subset of it as you work on the pieces.
there are a handful of tarking naugshlocks here _so_ unworthy of my help they made me decide to decline to do any work for project gutenberg, in spite of its great historical importance and my highest regard for the genius of michael hart. i'm sure others, like you, will cover my absence, while i will be happy grazing greener pastures...
If you are "moving on", then it behooves you to try to contribute what you've learned (in terms of knowledge, code, or "routines") back to those who will continue to contribute and learn. We're only here to help the next generation learn and improve. If we're not leaving anything here by which others can remember us and grow themselves; if we're not teaching others as we learn ourselves, then what is the point? David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com
participants (2)
-
Bowerbird@aol.com
-
David A. Desrosiers