March 2005 - gutvol-d - lists.pglaf.org

re: [gutvol-d] lest the message be missed
by Bowerbird＠aol.com 08 Mar '05

08 Mar '05

codepoints for macroman encoding: > 142 é Bénédictine > 142 é naïveté > 144 ê crêche > 149 ï naïve > 149 ï naïveté. > 150 ñ cañon > 170 ™ Edition™ > 174 Æ Æneid > 190 æ antennæ > 231 Á Ántonia -bowerbird

2 1

re: [gutvol-d] lest the message be missed
by Bowerbird＠aol.com 08 Mar '05

08 Mar '05

josh said: > That's the most laughable thing I've read in a long time. evidently josh missed the message. and why does that not surprise me? nonetheless, i invite skepticism. when i do the entire "my antonia" -- sometime later in the week -- i'll log my time and document all of the changes i make on the file, and wipe all that skepticism away. *** david said: > If you're "smart enough", you could just retype the book from memory. i'm not that smart. are you? > Completely reversible, > provided that you're already familiar with the work > (which you have to be, else you wouldn't know that > Antonia needs an accent), > is a pretty lousy standard. yeah, that _would_ be "a pretty lousy standard". which is obviously why it's not the one i'm using. for those who are smart enough to think about it a bit, "completely reversible" in this type of situation means that once you change it, you can change it back any time. it doesn't mean you have to magically know what to change. (if you know that, _every_ change is completely reversible.) -bowerbird p.s. note to the list subscribers: i usually don't respond to david, since his points are too often paper bags that cannot hold water -- he's always clever enough to find a fault, but seemingly never clever enough to realize why it doesn't apply, or to find its obvious solution -- just like this post, but since this _was_ in regard to something that i was putting "on the record", i am compelled to respond to it. having done it once, though, i probably won't bother doing it again...

2 1

Database down?
by Pauline 07 Mar '05

07 Mar '05

Hiya All, Did I miss an outage notice? The PG server appears to be having hassles: I keep seeing "Could not connect to database server." when I try to access etexts. Thanks, P -- Help digitise public domain books: Distributed Proofreaders: http://www.pgdp.net "Preserving history one page at a time." Set free dead-tree books: http://bookcrossing.com/referral/servalan

10 17

Shakespeare's Birthday April 23rd
by Marcello Perathoner 07 Mar '05

07 Mar '05

I got a request from a proof-reader to celebrate Shakespeares birthday with a banner on the site. (The original request being to celebrate St. Georges Day, but I don't think that one qualifies.) Any ideas? -- Marcello Perathoner webmaster(a)gutenberg.org

1 0

re: march forth
by Bowerbird＠aol.com 06 Mar '05

06 Mar '05

jon said: > What type of global corrections were these? the type that is made easy by my tool. that's all i'll say for now. > One area is how to handle hyphenation, > and whether there was a short dash in the compound word > in the first place before the typesetter hyphenated the word. as i said, i ignored the issue of hyphenation for the time being. my tool will give a number of ways to deal with hyphenation, but the routines haven't been brought into the current version. but i can give a general overview. end-line hyphenation is removed. the hyphen in compound words is retained. to tell the difference, when there is ambiguity, you look at the rest of the text, to see if the word was handled consistently there. if it was, you match that. if not, you have more work to do. that's where it gets interesting. to go any further is to give too much information for here and now. > Hopefully you used the 600 dpi bitonal which should OCR the best. i did. > Antialiasing actually causes problems > (notwithstanding the much lower resolution.) right. i first thought the periods misrecognized as commas were the effect of anti-aliasing, but i used the 600-dpi scans. so it must be something else causing that problem. > One thing you could do is to look at the 600 dpi pages at 100% size > for which the punctuation was not correctly discerned. You probably > will see some errant pixels that fooled the OCR into thinking > it was some other punctuation mark than it is. i didn't care that much, really. the post-o.c.r. software can solve the problem well enough. i mentioned it for the record, for the sake of full disclosure, and to see if anybody knew why. > punctuation is a toughie for OCR to exactly get right, even if the recognition is admittedly somewhat difficult, i expect abbyy to correct "mr," and "mrs,", for instance. but even if abbyy doesn't, that's easy for me to program. > Resolving this usually requires a human being to go over, > especially for Works from the 18th and 19th century > where compound words with dashes were much more common if you want to retain those arcane spellings, it's difficult. if you wanna update them, the computer does it very easily. "to-day" and "to-morrow" become "today" and "tomorrow". instantly. > Sometimes one has to see what the author did > elsewhere in the text. is there some reason you think the computer can't do that? > In a few cases a guess is necessary based on understanding > what the author did in similar cases in the text. oh, i see. it takes "understanding". one of those rare precious human-being things. well then, i guess there's no way to program it. > Some of this can be automated. In other cases > it requires a human being to make a final decision. > I followed the UNL Cather Edition here. it's always easier to let other people make the decision, isn't it? ;+) > Whether or not it is an "unnecessary" distraction, > it is better to preserve the original text in the master etext version. well see, jon, that's where i differ with you. and other people do too. but like i said, as long as it's just one global change away, no big deal. i see lots of other cases, as well, where you diverge from the paper. a good many of the quotation-marks are set apart from their words. you're making editorial decisions whether you acknowledge it or not. > My thinking is that if someone wants to produce > a derivative "modern reader" edition of "My Antonia", > they are welcome to do so and add it to the collection > because the original faithful rendition is *already* there. whose "collection" are we talking about here jon? yours? do you have any intention of adding more "my antonia" editions? specifically a "derivative modern reader"? if so, i will submit mine. but surely you don't mean michael hart's project gutenberg collection? because, according to you anyway, he doesn't have a "faithful" rendition in his library, not even one, not *already* anyway. just a mangled one. another difference between your collection and michael's is you have 1 book in your collection and he has 10-15 thousand in his collection, depending on who is in charge of defining how the official counting is tabulated these days, it appears. whether you like it or not, that's a comment on the philosophies. > indicating this was more of a typesetter's convention > rather than something Cather specified. well that's a convenient dodge, isn't it? and of course you have no real _evidence_ that this is the case, do you? so you _really_ should enter each case as it _appears_, shouldn't you? at least if you want to stick to your philosophy? > In addition, the UNL Cather Edition closed off all the apostrophe s > (no spaces), but kept the space for many of " n't" words. > So here again I followed the UNL Cather Edition. and that's the difficulty with following an authority, ain't it? there are often so many, it's hard to know which one to follow! i know i can't keep up even with the editions of this one book! so how would a person possibly keep up on tens of thousands! and before you know it, you're having arguments about _that_! and not reading the book, or digitizing it, or playing at the park. and i don't know about you, jon, i don't think you're being consistent. you said you were reproducing what is right there in black-and-white on the page itself, even made high-resolution scans to prove it to us, and now you're making judges that are easy to spot. and to justify it, you're quoting some other figure of "authority". that's inconsistent. but heck, i have to be honest here. even if you _were_ consistent, and kept all of those quirks from the paper-book that _i_ consider to be distracting, the first thing i'm gonna do is global-change 'em. so all that hard work you did was for no good purpose to me. > Cather wanted the line length to be fairly short, > so this puts extra pressure on typesetters > who will either have to extend character spacing > for a particular line or scrunch it up more than usual, > depending upon the situation with the rest of the typesetting > on the page, and whether certain words can be hyphenated or not. oh!, hold it!, wait!, did i just hear you say what you just said? i think i did! yes, i'm quite sure i did! "cather wanted the line length to be fairly short". wow. you mean author-intent can go to _the_length_of_lines_? do you realize how significant that is to your philosophy, jon? it means you will need to respect willa's wishes on the matter. none of the long lines you might get in a web-browser! no sir! willa wanted short lines! (is that why the book looks so narrow?) > You mean accented characters? if they aren't in the lower-128 of the ascii range ("true ascii"), yes. > Accented characters are *always* important > to preserve under all situations. according to you, maybe. according to me, it depends. in this case, i say no. that's my prerogative as an editor. (and i _do_ consider myself an editor, not just a copyist.) > There's no need anymore, in these days of > Unicode and the like to stick with 7-bit ASCII. until unicode works flawlessly on every machine used by all the people i know, for texts like this that have only the occasional character outside the lower-127, where the meaning isn't changed, i'll stick to plain ascii. > I sense that you don't want to > properly deal with accented characters first of all, jon, i define what "properly" means for me, you don't. you can define it for yourself. but i won't let you define it for me. > I sense that you don't want to > properly deal with accented characters > since this poses extra problems with OCRing and proofing, nope. it's just that i see them as _unnecessary_ to this book. if a reader thinks it _is_ necessary, make the global-change. > something you are trying to avoid in your zeal to get everything > to automagically work. To me, that's going too far in simplifying. i'm not "simplifying". i'm consciously making a choice to use something that will work on the broad range of machines out there, as opposed to something that -- in far too many cases -- fails badly. it's a pragmatic decision based on real-life knowledge of the actual infrastructure of machines that exist out here in our real world. it's the same pragmatic decision that michael made when he crafted the philosophy guiding the building of this library of 10,000+ e-texts, in sharp contrast to your philosophy, which has built a 1-book library. > Preserving accented characters are important. in some cases, i'd agree with you. in others, not. in this case, not. > punctuation changes can sometimes subtly affect the meaning. you know, as a writer, i'd really like to think that's possible. as a person who uses a lot of commas, i _want_ to believe it. but i'll be darned if i can think of that many good examples. if you can, i would _love_ to hear them. and if you can show me _any_ in "my antonia", any at all, i'd give you extra bonus points. as it is, though, i just have to resign myself to the position that o.c.r punctuation errors are a distraction, but make no difference. i'll still root them out, due to my sense of professionalism, but i sure wish it felt _fun_, instead of feeling like _doing_chores_. and to the extent that i can automate the chores, i'll be _happy_. > They are hopefully caught by human proofers/readers > when grammar checkers don't (I do use Word to > help find both spelling and punctuation errors -- > when they find something, I then manually > check it in the page scans and the master XML.) oh, so you _do_ use an assist from your tools at times. that's good. > They are "sometimes" easy to spot. > Other times the automatic routines will not catch errors maybe the automatic routines you are using are just inferior. use my tool. if it doesn't spot something it should, let me know. > Usually true, but there are some rare exceptions where > an abbreviation can be mistaken for an end of a sentence. not if your routines are as smart as mine are. > Then there's the ellipsis issue i'm three-dozen layers deep on some of these issues, and you want to talk about level 2. i'm not interested. use my tool. if it doesn't give you the results you want, let me know. > This is also true, but as found in "My Antonia", > there are exceptions to pure nesting, such as > when a quotation spills over into several paragraphs > where the intermediate paragraphs are not terminated > by an end quotation mark (whether single or double.) is it really your considered opinion that i don't know this? that i haven't factored it into my thinking _and_ my tools? maybe you're grandstanding to the lurkers, but my goodness, jon, do you really think that _they_ are that stupid too? > Also, apostrophes are sometimes confused with single right quote marks. ditto. > With a smart enough grammar and parser, > the above might be properly parsed and the blah blah blah. use my tool. if it doesn't figure out your stuff, let me know. > But still, real-world texts tend to throw a lot of curve balls > that are sometimes hard to correctly machine process. i know how to hit 87 different pitches, from both sides of the plate, and you're telling me to "watch out for the curve balls". i laugh at you. > OCR is quite fast. It's making and cleaning up the scans > which is the human and CPU intensive part. wait! i thought you said _proofreading_ and _mark-up_ were the steps that take up the most time. didn't you? or do i have you confused with someone else? > Well, not all of the pages have been doubly proofed. > The team is not finished, and I plan to post a plea > somewhere for more eyeballs to go over it. have you heard about distributed proofreaders? might be able to find some people there... (ok, now you see what it feels like.) > I would like to receive error reports as well for this text, i'll tell you the same thing i told michael about project gutenberg: set up a system for the checking, reporting, correction, and logging of errors, a system that is transparent to the general public, and i will be more than happy to report errors to you, and help you out. otherwise, you waste my time, as i figure someone else can do it. which, by the way, is what everyone else is thinking. which is why errors in the texts are not being reported at nearly the frequency that they should be being reported. but i've got another message sitting here waiting to be sent where i discuss that topic in more detail, so i'll stop here now. > since Brewster wants highly proofed texts > for some experiments he plans to run similar to yours. i'll have to ask him about his tests. > But if I have to use the version you donate to PG, so be it. :^) probably, yep. if michael wants it. they say he'll take just about anything... > I did find one error in my text based on the list you gave. Thanks. you're welcome. but that's not the one i was talking about. :+) > I assume you discovered > the several different paragraph breaks in the PG edition? nope. i didn't even evoke the routines to examine paragraph-breaks. i considered doing so, once you said that there were differences, but decided it was just too inconsequential to even bother with it. it's another one of those things i would very much like to see a case where it made a difference, because i'd love to believe it _could_, but in the absence of a case (or even an _imaginary_ possibility, which i confess i can't come up with, not off the top of my head), i am forced to relegate it to the "too trivial to think about" pile. as above, i'll make the corrections, but i ain't gonna sweat 'em... -bowerbird

3 2

have a nice day
by Bowerbird＠aol.com 06 Mar '05

06 Mar '05

i believe there's no interest in these threads, other than from jon and myself, so i will reply to him backchannel and be done with it. if anyone else does want a copy, let me know, and i'll send it to you as well. thank you. the proof, as always, is in the pudding. other than that, have a nice day... :+) -bowerbird

1 0

Please test www-dev.gutenberg.org
by Marcello Perathoner 05 Mar '05

05 Mar '05

We are ready to migrate the web site to the new fast file server. Also some slight changes were made to the online catalog to make it better cacheable: The dynamic authrec pages have been dropped in favour of the static browse-by-author pages. Browse-by-author now includes all information from the authrec pages. Redirects are in place. The search has been optimized to redirect simple searches (searches for author only, title only) to the appropriate browse-by-author and browse-by-title pages. A preview is online at: www-dev.gutenberg.org Please test and report any oddities. -- Marcello Perathoner webmaster(a)gutenberg.org

5 9

re: Anatomy of a Book
by Bowerbird＠aol.com 05 Mar '05

05 Mar '05

andrew said: > http://www.bibliophilegroup.com/biblio/other/school/anatomy.html what a marvelous page! funny and informative at the same time! a winner! -bowerbird

1 0

re: [gutvol-d] new thread for noring
by Bowerbird＠aol.com 05 Mar '05

05 Mar '05

david said: > Those are called "half-titles," btw. oh cool, thanks. i figured they had a name, just didn't know it. -bowerbird

2 1

re: [gutvol-d] new thread for noring
by Bowerbird＠aol.com 05 Mar '05

05 Mar '05

jon said: > I hope you get an error rate that is one per ten pages i'll do my best. :+) > even if you do, I still believe a DP-like process > is necessary to catch errors that OCR can't handle human readers will _always_ be necessary. (and easy enough to find. if no one wants to read a book, there's little call to digitize it.) thus a system of "continuous proofreading" will be quite good enough if we can make the computer-guided processing accurate enough. > and for someone to properly assemble the pages, > structure the document, etc., after the OCRing/proofing > is complete. that's part of what i include in "post-o.c.r. processing". > I don't quite put the same level of > faith in OCR as you seem to. except that once you see the evidence i lay out, you will realize "faith" has nothing to do with it. as i've been saying all along, for professionally typeset books, the structure is _in_ the presentation. so o.c.r. gives you all the information you need, if you know how to look for it, and do so diligently. > Btw, I believe as you do that an error reporting system > is a good idea so readers may submit errors they find > in the texts they use -- sort of an ongoing > post-DP proofing process. post-d.p.? i see it _replacing_ d.p. for most books. and good thing, too. once the coming avalanche of scanned-books engulfs us, it'll be the only way most books have a chance to surface. that will take the pressure off distributed proofreaders, and they'll be able to focus on the books that _need_ them. > Obviously, it is necessary to make available > the page scans of the source document to aid in this process. > How can an error be properly verified and corrected > when the source work is not available? i've always said i think that page-scans should be publicly available. particularly if your mission is _transcribing_an_existing_edition_. (although, to remind people again, copyism is _not_ the mission that michael hart chose to embed within his project gutenberg.) but even in the case of project gutenberg's "amalgamated" e-texts, i believe that a page-image graphic-version should be made available. this would allow people to view it on a dvd-player, just as an example. > Scanning took quite a while (much more than four hours) that doesn't surprise me. nonetheless, i'll limit myself to 4 hours. that's quite enough time to devote to it. and to prove the point too. > I deemed it important for processing purposes that > the name of the image contain semantic information > of what it represents, and that > naming be consistent for file sorting purposes. as one improvement, i would suggest _not_ using "001.png", etc. instead, preface each one with a string that will make it _unique_, such as "ma2005feb001.png". it's easy to tell that to the o.c.r. app -- you just type it in one time -- and it's an unmistakable stamp. and of course, if you're going to do hundreds or thousands of books, you want to cook up a naming convention that conveys information. on big multimedia projects, it is not at all uncommon to have one _full-time_ employee dedicated _solely_ to maintaining filenames. because if things go wrong, it can waste a whole lot of man-hours. oh yeah, one more suggestion. your front-matter filenames were prefaced with an "r". my typical recommendation is that they be prefaced with an "f", and that the regular pages be named with a "p", so the front-matter files will sort _on_top_of_ the regular pages. i want to be able to depend on the operating-system filename sort to give me pages in the exact order they appear in the book itself. so i use a "q" on back-matter files, so they will drop to the bottom. for illustration plates, i use a name that sorts _them_ correctly; for instance, if an illustration page is between pages 168 and 169, name it "p168a.png". (and don't forget the blank verso side either!, which you will name "p168b.png".) > The publisher simply chose to start at page 3. Was this common? it's not uncommon. oftentimes there is a "title-page", consisting of nothing more than the name of the book, which is considered "page 1", with its blank verso being "page 2", so chapter 1 starts on "page 3". sometimes chapter 1 starts on page 7. or page 11. publishers are weird. > Maybe there was an intent to insert a page there, > which after typesetting it was decided not to.) sometimes that happens too, yep. an "unnecessary" page gets dropped when the typesetter realizes they didn't plan the signatures correctly. or when the preface runs two pages longer than was originally intended. or any number of other snafus spring up. shit happens. > It was my intent to reproduce each page for direct reading purposes -- > that is, if somebody wanted to read the book as it was printed, > then they could. yeah, and sometimes people want to do exactly that. which is why the page-images should be made available. for many illustrated books, the text alone is not enough. you want to be able to see the pages as they were printed. my viewer-program will work with either, text or images. it'll even work in "hybrid" mode, so you can display the text in one of the 2-up pages, and the page-image on the other side. (and of course that is the mode which is used for proofreading.) that's why things like _blank_pages_ are so important to include. because if you toss them out, you screw up the left/right sequence. a convention of paper-books is that odd pages always go on the right. screw that up and you make yourself look silly. anyway, that's all for now. -bowerbird

3 2