re: Re: [gutvol-d] re: poofing and tarking

david said:
Quite a bit more than "a little", but yes, thats me.
that's what the winkey-smile was all about... :+)
What other efforts are you alluding to? Why not help those people who insist on reinventing a fleet of new wheels, to collaborate with existing projects that have similar/same goals?
well, mainly because amazon is looking for any "help". and my guess is that google will be equally isolationist. moreover, the publishers they are both coddling do not share the "let's share" mentality that drove michael hart. stanford has talked about a huge digitization effort, but i'm not sure how far they are, or if they've even started. but it is safe to say they've got enough money that they won't be on the lookout for any volunteers to assist them. and i'm not sure what's up with the million-book project, or even who is in charge of that effort, but their objective seems slightly different than the one guiding things here; they're seem to make scan-books, not cleaned-up e-texts. (they _are_ doing o.c.r., but they're not proofing the stuff.) and the specific area that's of _most_ interest to me is the one comprised of the d.i.y. authors using cyberspace to connect directly to their audience, sidestepping the clutches of the middlemen who were necessary before. this is _new_ content, so i think it'll eventually eclipse the public-domain that is the thrust of project gutenberg...
And this is exactly what the Distributed Proofreaders project proposes to solve, and they've been pretty successful thus far, IIRC.
um... distributed proofreaders _might_ solve the problem of inconsistent formatting _sometime_ down the line, if their policies settle into place. but if you've looked at their output throughout their history, you'll know they have not done it yet. but, honestly, that shouldn't even be their job in the first place. project gutenberg should have had solid formatting rules down long before distributed proofreaders even came into being...
I've had a lot of luck stepping out of the box, and analyzing the text based on the "style" of the text, versus the actual content itself. I was approached by someone who is doing a paper and his PhD thesis on this exact kind of approach. Basically (with my expertise and help) he's taking the bulk of Gutenberg, importing every word from every work into a database, and then running his own algorithms across the entire collection, to pull out the styles by known authors.
well, that sounds interesting. :+) but "authorial style" is fairly irrelevant to the matter of formatting the text in the way that a typographer does it, or an e-book requires, which is actually the task at hand...
And I assume you've done this?
yep.
And your routines are made public somewhere, so others can improve and correct them to continue to be better?
nope. if i turned 'em loose now, the tarking naugshlocks here could just pick them up for their own nefarious purposes, and i have _no_ intention of letting my hard work be used that way. i came here to the project gutenberg listserves in the first place because i intended to share, but that intention has been squashed. besides, all the negative feedback i've gotten here has convinced me that people out in the world think that what i've done is impossible. so i figure there must be a couple bucks in it, and why give them up? i don't have a day-job, and my girlfriend deserves some nice things...
And where is his code? Where are his "routines"?
i don't believe that moynihan has ever made his macros available. he _did_ offer them to project gutenberg, but i guess they declined. i can't express to you just how stupid i think _that_ decision was...
ALL of the talk of how "easy" this is, is completely irrelevant, if nobody wants to actually contribute that knowledge back so others can improve and benefit from it.
if somebody thinks something (which would be very valuable to them) is impossible, and you know that it isn't, don't you think that you should _tell_them_? i do. that's why i'm here. i'm not gonna do it _for_ them, because they've abused me so -- are you in the habit of helping people who mistreat you? -- but, since i do believe so passionately in electronic-books, i feel i have an obligation to _try_ and make them wake up. that's why i've stayed here for so long, nearly a year, and taken all the abuse that they have dished out to me. because i believe in e-books, and i admire michael hart. (michael, by the way, has been very supportive of me.) but i'm about to give up, because they just won't listen. nonetheless, i feel _good_ about the fact that i _tried_...
If you're not willing to do this, then our conversation stops here.
ok. no problem. i believe in sharing, and told you what you could share with me, if you want me to share my work back with you, but if that's not acceptable to you, then i'm cool with that. i'll go my own way...
What was that "bizarre reason"?
you'll have to ask the people in charge.
If you want someone to support "your" format, then you'll probably have to take that first step by justifying and documenting it.
i've done that. i've posted it several times on this listserve. and i will send it to you backchannel. 11 dirt-simple rules.
The only page I could find describing the format was here: http://czt.sourceforge.net/zml/ And I assume thats not your project or code.
you're right, that's not it.
What kind of errors? Incorrect hyphens? Broken paragraphs? Missing end quotes? (this is common)
all of those, yes. and many more. every kind imaginable. and some that you never would have been able to imagine.
Impossible to regain, unless you have the original work in-hand, to see if there were actual CAPS used, or not.
right.
Maybe the "errors" were intentional. Many authors use poetic license to express their thoughts, and sometimes those things break the rules of grammar and spelling.
that's not it. the all-caps convention dates back to the days of keypunch machines, when computers had no lower-case characters. (there is a rumor that i started that michael hart actually entered "alice in wonderland" on a keypunch machine. don't know if it's true.)
How do you mean? You mean 1.jpg 1.jpg 1.jpg appearing in three places, but intended to represent 3 _different_ images?
no, i mean 1.jpg being used in 3 different _e-texts_. which means that you can't dump those e-texts into the same folder without experiencing a filename crash. which means that you need to rework all the filenames in the library if you want 'em to be unique, which is something that you do really want, if you value your sanity...
How do you express a Cyrillic text in 7-bit ascii? You can't.
right, that's the problem, though 8-bit e-texts have become common. many mostly-english texts, though, do have foreign words in them where an 8-bit diacritic was chopped down into a 7-bit character. some of these can be automatically replaced. but then you are running the risk of turning what _was_ a non-diacritic into one. (for example, burkey cites the change of "role" to "role" with a hat on the "o". but you know there are plenty of plain "roles" out there.)
Such as?
many large works have been split up into smaller sections, but then also "collected" into one e-text as well. there are also "collections" of certain authors, and so on. you'll want to cull out this redundancy...
Are you revealing the "problems" in a condescending way? Or in a constructive way?
i know of no more constructive way to reveal a problem than to diagram the code that will fix it and volunteer to write the app. i've done that, and had shit heaped at me. you be the judge.
The way you approach the "Hey, this is broke" process is very telling as to how you will be received and responded to for same.
and the way i am "received and responded to" is very telling as to whether i will _continue_ to offer my code that will help fix the problems...
which is exactly why you should have your own mirror of Gutenberg, or a subset of it as you work on the pieces.
i was just letting you know that piece of information. otherwise, you might think that you could get the d.v.d. of the e-texts, and simply work on the e-texts from that. odds are some of those files have already been replaced...
If you are "moving on", then it behooves you to try to contribute what you've learned (in terms of knowledge, code, or "routines") back to those who will continue to contribute and learn.
i've left a ton of messages, detailing the problems and laying out exquisite details about the fixes i suggested. feel free to mine that, if you can plow through the flack. (the vast bulk of my posts are over on the u.n.c. archives; i'm not sure if those have been brought to this list, or not. you should also be aware that the .html conversion program over on the u.n.c. machine was faulty, and many of the threads are cut off in midstream. the full thread is in the .html source, so you would have to recover the missing messages from there. that's a good example of how a conversion program can mess up.)
We're only here to help the next generation learn and improve. If we're not leaving anything here by which others can remember us and grow themselves; if we're not teaching others as we learn ourselves, then what is the point?
oh, don't get me wrong, david... :+) as i outlined above, there are plenty of arenas in the e-book world, project gutenberg is just one of them. all the others need help too. so i don't intend to stop speaking, or to stop working on e-books. i've been doing continuous work on e-books for over 25 years now. to the contrary, i intend to _continue_to_speak_, and _loudly_, and to finally _go_to_work_ and get some of my things finished, so e-book authors out there in the world can start _using_ them. the difference is, rather than speak here _quietly_ and _privately_ with the project gutenberg folks here _behind_the_scenes_ on their own listserves, trying to get them to pay attention to the problems in their library, i will instead speak _publicly_, using my new blog, making noise about the many problems that are being ignored here, so at least the rest of the world learns -- and grows -- from them. so instead of working to make the "people in charge" here smarter, which has essentially meant banging my head against a brick wall, i can instead spend some time _productively_ by making programs that people can use to spread e-books out into the world at large. my time, thoughts, and work-product are too valuable and important to continue wasting them here on people who do not appreciate them. yours probably are too, but maybe you're more "diplomatic", and maybe they won't ignore you or badger you when you say something that they desperately _need_ to hear, even if they don't _want_ to. oh yeah, and eventually i'll even come back and clean up the e-texts, when i've automated my various procedures to the fullest extent and implemented them in code, and the people-in-charge here have made a mess of the library in the process of trying to make x.m.l. work. because michael hart deserves better than that... -bowerbird
participants (1)
-
Bowerbird@aol.com