February 2005 - gutvol-d - lists.pglaf.org

www.gutenberg.org
by Marcello Perathoner 25 Dec '05

25 Dec '05

As of my request ibiblio has changed our apache virtual host ServerName from gutenberg.net to www.gutenberg.org. What? Nothing changes in web site operation except one little particular (if you hit a non-existing url you get redirected to www.gutenberg.org instead of gutenberg.net) Why? The .org top-level domain is meant for non-profit organizations while the .net domain is meant for network infrastructure such as providers, backbones etc. We own both gutenberg.net and gutenberg.org, but using the latter one is just more standard-compliant. Although both urls www.gutenberg.net and www.gutenberg.org give exactly the same results, you should start using www.gutenberg.org in all publications, papers etc. -- Marcello Perathoner webmaster(a)gutenberg.org

3 2

Enlightened self-interst
by Ron Aitchison 04 Apr '05

04 Apr '05

Having discovered Jane Austen regrettably late in life I have down-loaded a couple of novels and since I find the raw text format unpleasant to read I have reformatted for my own use. It seems to me since I have the ability to produce PDFs and OpenOffice formats and even - heaven forfend - MS doc format should they be wanted, it would be churlish not to make such an offer. If you can point me at a standard for PDF, page width, font size etc, etc., and let me know what formats you do want I would be happy to undertake the small additional work for the two novels I have currently downloaded. I cannot supply DocBook at this time but hope to have that available shortly. Regards -- Ron Aitchison

3 2

Re: Arabic eTexts
by Juhana Sadeharju 19 Mar '05

19 Mar '05

Are those Arabic OCR software open source and free? Having no Arabic OCR software has not prevented us from digitizing Arabic texts earlier. If only buying a $$$$ software gets you motivated to digitize arabic texts, then it is fine by me. However, I feel the arabic texts should be digitized first as image files. Specially if the text is written by hand. This apparoach will be cheaper and faster as well. Please don't make the mistage of not archiving and making available the images if you choose the OCR approach. I'm pleased to archive any arabic digitizations as image files for now and for future use. Only image files can preserve the text as close to original as possible. Juhana -- http://music.columbia.edu/mailman/listinfo/linux-graphics-dev for developers of open source graphics software

4 5

David Wyllie Email address
by Alex Wilson 04 Mar '05

04 Mar '05

About a month ago Greg Newby offered to get me in touch with David Wyllie--who provided the English translation of Kafka's Metamorphosis for PG--and I haven't heard from him since. I'm thinking Greg's emails or mine are ending up in a junk mail folder, so I'm wondering if anyone here knows how I can get in touch with Mr. Wyllie. Thanks. Alex. http://www.telltaleweekly.org - Funding a Free Audiobook Library

5 9

Enlightened Self Interest
by Ron Aitchison 02 Mar '05

02 Mar '05

Understand the issue of editing. My proposal would be to supply an editable file in OpenOffice or MS doc format (BTW if you are not using the Open Source OpenOffice suite I recommend you check it out - the features are great, at least as feature rich as MS word, plus - one button PDF creation, output as doc, text or native XML format and a great price = $0! http://www.openoffice.org ). I propose to take nothing away you will have edit control over the file. This also opens up another question over what base document formats you have standardized for editability and portability e.g. OASIS etc.. Maybe that is a topic another list. Finally I note you have PDF formats available for some other books. Andrew Sly wrote: >One possible problem is that PDF files are not easily editable. > >All of our older texts are being gradually worked through, >corrected, supplied with a new PG header (which puts all the >legal "small print" at the end of the file instead of the >beginning) and REPosted into the currant directory structure. >When this process is done it will make some of the back-end >organization much easier to deal with. > >However, if during this process, we come across a non-editable file >(PDF, Lit, whatever), we cannot update it, and it's generally moved >into an "old" directory, where it is still availible if someone >goes looking for it, but otherwise is not shown in the catalog. > >Andrew > >> Having discovered Jane Austen regrettably late in life I have >> down-loaded a couple of novels and since I find the raw text format >> unpleasant to read I have reformatted for my own use. >> It seems to me since I have the ability to produce PDFs and OpenOffice >> formats and even - heaven forfend - MS doc format should they be wanted, >> it would be churlish not to make such an offer. >> If you can point me at a standard for PDF, page width, font size etc, >> etc., and let me know what formats you do want I would be happy to >> undertake the small additional work for the two novels I have currently >> downloaded. >> I cannot supply DocBook at this time but hope to have that available >> shortly. >> Regards >> >> > > -- Ron Aitchison

10 13

Re: [gutvol-d] Interesting message on TeBC from NetWorker (about fixing errors, Frankenstein, etc.)
by Bowerbird＠aol.com 28 Feb '05

28 Feb '05

jon said networker said: > Project Gutenberg e-texts satisfy none of these wishes well, i guess networker will have to start his own project, eh? give him my best wishes! :+) > Conclusions > Of course, the goal of this exercise was not to establish the > provenance of the Project Gutenberg e-text of _Frankenstein_, maybe not. but having done so, it is _refreshing_ to know that -- when that's factored in -- only "a handful" of errors surface. so once again, in spite of some very big noises, it ends up that this fails to stand as a good example of an error-ridden e-text. > nor to discover if there are any errors in the PG e-text, but > to determine if there was an automated method of reducing errors > in newly scanned e-books for which a Project Gutenberg e-text > already exists. I'm afraid the jury is still out on this question. as for this "conclusion", the jury may still be out in _his_ mind, but in mine, the answer is very clear, and i've said it before here: if you do the scanning properly, manipulate those scans correctly, use abbyy in the best way, and subject its results to the right tools, you will reduce the errors in your text to a relatively small number. (the number we've been kickin' around is 1 error for every 10 pages, and at that point, proofreading by the public becomes very viable.) if you then have the rare luxury of evaluating your output against an existing version of the book -- like a project gutenberg e-text -- with the right tool (which networker obviously does not yet have), the comparison between the two, alongside the page-images, should make the process of coming to an error-free version simply a breeze. since this is _exactly_ what will need to be done _increasingly_, as the page-images from the internet archive and (we hope) google -- plus the work done by individual people scanning everywhere -- emerge into cyberspace, that's where my tool-development efforts are now being focused. i suggest networker start reading my blog; it should start being updated on a daily basis starting next week... -bowerbird

2 1

ok, this is my last post for the time being, really
by Bowerbird＠aol.com 28 Feb '05

28 Feb '05

jon said: > Bowerbird, I'll be happy to put up > a ZML regularized text version of My Antonia. if you prepare your browser-based versions carefully, copying text from the browser-window will give z.m.l., so there's no real reason to create a separate version... (of course, if you don't do the correct preparation...) for me, "round-tripping" is one of the biggest priorities. and by that, i mean that when a z.m.l. file is presented, the end-user should be able to copy text out _as_ z.m.l., so that -- with just a few global search-and-replaces -- when reloaded into a z.m.l.-viewer, it will look the same. (even if its text-styling is stripped away, as can happen when it's saved as a plain .txt file, it should be restored. automatically.) i have attained this in the z.m.l. viewer-program already. (this was pretty simple, as i control all the operations.) i've also attained it in my .pdf version, where i'm able to work around the limitations of acrobat's copy operation -- if you've ever copied text from a .pdf, you realize how awful it mangles the formatting -- by controlling what my viewer-program writes to the .pdf in the first place. (to answer the first question of a knowledgeable person, i write a dummy-line as a separator between paragraphs, so a global replace restores the blank line between them.) when i get around to making my zml-to-html converter, i will try to make sure that the .html that's created will copy out of the browser-window correctly too. however, browsers do some funky crap in their copy operations, so it might not be possible to preserve _everything_, at least until the browser-programmers tighten up their act there. when i do that work, i'll share any tips people need to know in order to prepare an .html version to copy out good .zml... but in many cases, even now, a copy out of a browser-window can produce text that is .zml, or can be easily converted to it. for instance, jon, your website that gives your listserve rules creates a nice .zml file. consistent formatting yields good .zml. *** jon said: > Yes, if this is the case, it is mysterious > since IA will gladly host them > once the etext version is out the door. sometimes i wonder if the internet archive is quite as accommodating as you always seem to make them out to be. i don't know otherwise, but it seemed if they were, then d.p. would've put their scans up long ago. (unlike their site, page-scans wouldn't need quick response-time.) pourlean said: > An accessible archive of posted > projects & images is in the works. i look forward to the day it comes online! (if there is any particular stumbling block, do please let me know, as maybe i can help.) *** jon said: > It's on me. In private email send me your address and > I'll burn and mail you a disk of the 600 dpi and 120 dpi scans hey, thanks for the gift, jon, i appreciate it! but i can't use 30 megs worth of scans in my little project; it's just a demo. so i wrote a quick program to grab a few dozen from the site. and now that i've done that, i can grab 'em all, if i ever need; since i was just looking for a way to do it in one fell swoop, i shoulda just done that straightaway, instead of bugging you. but maybe someone else will now be able to make good use of the zipped package of the scans that you added to your site... -bowerbird p.s. now if you'll all excuse me, i really need to go back to work... :+)

1 0

Interesting message on TeBC from NetWorker (about fixing errors, Frankenstein, etc.)
by Jon Noring 28 Feb '05

28 Feb '05

[I'm forwarding the following message by "NetWorker" posted to The eBook Community. Any followup to the specific points NetWorker raises are maybe better posted over there, especially if you'd like NetWorker to see your comments ("ebook-community" at YahooGroups). NetWorker is very thorough... Jon] NetWorker wrote [a few days prior]: > Project Gutenberg e-texts may yet have a role to play in the > production of high-quality e-books. Rising to the bait, I have placed > a hold on my local library's one(!) copy of Frankenstein, which I > will scan when it arrives. I will try to create a highly structured > e-text (certainly not as fine as what Jon did with "My Antonia"). I > will then try to find a way to preprocess both the OCR'ed text and > the Project Gutenberg e-text in such a way that the two files can be > meaningfully "diffed." Hopefully, I can come up with a method that > will allow existing PG e-texts to be an automated "proofread" of next > generation Public Domain e-books. Boy, _that_ was an interesting experience! The project goals: As an e-book consumer, I want an e-book that contains _lots_ of metadata; the more the better. I want the metadata to be patterned, so that I can use automated tools to manage a collection; sorting by author, genre, publication date, publisher, contributors such as editors and illustrators, etc. I also want the actual text to be marked-up in such a way that 1) I can view the text with all the presentational richness traditionally associated with a paper book, if I choose to do so, 2) I can convert unambiguously from one markup language to another, and 3) so that I can do a structural analysis of the book using automated tools. I also want a mechanism to know if apparent errors in the text are due to transcription errors or the author's intent -- this can be accomplished by including source information in the metadata, or providing access to page scans; both would be preferable. Project Gutenberg e-texts satisfy none of these wishes; to create an e-book which _does_ satisfy them pretty much requires starting from scratch. Scanning technologies are quite advanced these days, but OCR is still not 100% accurate, and automated spell checking can only go so far. Clearly the most time-consuming -- and most error-prone -- part of producing a reasonably accurate e-book is proof-reading by a human being. My goal was to discover if Project Gutenberg e-texts, which are presumably fairly accurate as to the _words_, if nothing else, could be used as yet another automated preprocessing step to reduce typographical errors to a minimum before the actual proof-reading begins. The process: To test my theory, I decided to use the novel _Frankenstein_, by Mary Wollstonecraft Shelly. _Frankenstein_ is clearly in the public domain, is known to have at least two versions, and has been the subject of a fair amount of discussion on this list in the recent past. I obtained a copy of Frankenstein from the public library; it was published in the "Barnes & Noble Classics" series in 2000. I was fairly pleased with the edition, as it was printed in a rather old-seeming type-face which gave the appearance that it was in fact a photo reproduction of a much older text; it seemed likely that it had not gone through much in the way of re-editing to modern conventions. I scanned and OCR'ed the book using ABBYY FineReader. I then did a spell-check of the book from within FineReader so I could compared "misspelled" words to the actual scanned image. I then saved the text as an HTML file. In the past I have written a couple of programs to help in the creation of e-books. TidyeBook is based on the HTML Tidy code base. It fixes some of the inaccurate HTML produced by ABBYY, strips headers and footers but leaves page numbers intact, if invisible, and merges broken paragraphs when it can do so without question. html2txt, based on an earlier C++ version of HTML Tidy, takes an HTML document and reduces it to simple text similar to that used by Project Gutenberg. Next I ran "frankenstein.html" through TidyeBook to clean up the HTML. I then hand-edited the HTML to fix paragraph breaks not fixed in the automated process, or which should not have been broken. I also fixed those instances where hyphenated words spanned a page break (very easy to do given the output of TidyeBook). I then generated an Impoverished Text Format version of the HTML text using html2txt. My strategy was to use the Gnu "diff" program to detect differences between the simplified version of my work product, and the Project Gutenberg version. Because "diff" is line-oriented I needed to normalize the two texts so there was a greater likelihood that lines would be correctly matched. I did this by writing yet another program (this could probably have been done more efficiently by a Perl or AWK script, but I am not very familiar with scripting languages, but am a highly proficient C/C++ programmer; it was easiest for me to use the tools at my disposal). The new program would reduce each file to lines of no more than 60 characters (the shorter the line the easier for a human to find the difference detected). Additionally, the program would start a new line whenever it encountered what is conventionally accepted as sentence-ending punctuation (!.?) or two newline characters in a row, which would signal the beginning of a new paragraph. All whitespace was reduced to a space character, including multiple whitespace characters. I used the new program to normalize the text produced by html2txt and that of frank14.txt from Project Gutenberg. I then compared the two resultant files using gnu diff and Microsoft's WinDiff. The results: I was quite surprised to find literally thousands of differences between the two texts. Most of the differences were changes in punctuation and capitalization. Many em-dashes were converted to semicolons or omitted altogether, and many semicolons were converted to commas. Some words capitalized in my scan (eg. Paradise) were converted to lower case (paradise). Some phrases were "fixed" ("our uncle Thomas's book" became "our Uncle Thomas' book"; "an European" became "a European"). Some words were Americanized ("tranquillise" became "tranquillize") yet other words are not ("favourite" remained "favourite"). In an attempt to discover the source of these differences, I visited a number of not-so-local libraries, and checked out a number of different printings of _Frankenstein_. Two of the most interesting are Leonard Wolf's _The Annotated Frankenstein_, Clarkson N. Potter, 1977, which claims that "In order to ensure the authenticity of the text, we arranged with the Library of Congress in Washington, D.C., to microfilm a copy of the first edition. That text has been reproduced in this volume by the photo-offset process," and the Penguin Classics edition which includes an appendix identifying the differences between the 1818 and 1831 editions (while significant, they are neither as pervasive nor as substantive as has been earlier suggested). Neither of these editions contained the punctuation or spelling changes of the Project Gutenberg edition. One of the books I checked out, rather serendipitously as it turns out, was the Bantam Classic edition, which was first published in 1981. Of all the editions I consulted, only the Bantam edition contains virtually all of the changes I noted in the Project Gutenberg e-text. The PG edition is apparently based on Mary Shelly's revised 1831 edition (although it has lost both the "Author's Introduction to the Standard Novels Edition (1831)" by Mary Shell, and the "Preface" to the 1818 edition by Percy Shelly). I thus believe that the PG edition is based on the Bantam Classic edition of 1981. <sidebar> Interestingly, copyright law provides protection to changed versions of public domain texts if those changes are of a nature that they are more than mechanical and provide some modicum of creativity. Clearly, the punctuation changes are not merely mechanical, and in some cases actually change subtly the nuances of the text. Ironically, of all the textual bases that Project Gutenberg could have used for its e-text of _Frankenstein_, it choose the one which is apparently still protected by copyright! </sidebar> I modified my text normalization program to discard all punctuation except hyphens and underscores (and of course excluding the sentence-ending punctuation mentioned earlier). This reduced the noise to signal ratio enough that the differences started to become meaningful, although it still resulted in at least 500 differences. It allowed me to discover a handful of OCR errors that had been missed by the earlier automated methods, and I have so far also found a handful of errors in the PG text ("But must finish." should be "But I must finish.", "every sight ... seem still to" should be "every sight ... seems still to", "destroy radiant innocence" should be "destroy such radiant innocence", etc.) Conclusions: Of course, the goal of this exercise was not to establish the provenance of the Project Gutenberg e-text of _Frankenstein_, nor to discover if there are any errors in the PG e-text, but to determine if there was an automated method of reducing errors in newly scanned e-books for which a Project Gutenberg e-text already exists. I'm afraid the jury is still out on this question. If the texts are a different as the PG edition of _Frankenstein_ and virtually all other editions, the process of sorting through the chaff to find the grain of wheat may not be worthwhile; I believe that the OCR errors discovered so far were blatant enough that they would have been easily discovered in the first proof-reading, and I believe that human proof-reading will always be required no matter how good our automated tools become. Some time ago I produced an HTML e-book version of Mark Twain's _Pudd'n'head Wilson_. I believe I will put that e-book through that same process. I will then attempt a new scan of some other, perhaps more obscure, PD work that already has a PG version. Having at least three data points, I will report again later. p.s. -- If someone from Project Gutenberg wants my diff file to update the PG e-text, I will be happy to e-mail it to you; it is approximately 95k in size.

1 0

re: [gutvol-d] one more thing, for jon noring
by Bowerbird＠aol.com 28 Feb '05

28 Feb '05

robert said: > English-language novels tend to go quickly. > If they qualify for the "easy" queue, > they spend little time queuing, and > often complete proofreading within a few days. when i said "fly through d.p. in a few hours" i meant literally, not figuratively. (well, i guess the "fly" part was figurative, :+) but the "few hours" part was quite literal, especially since first-time producers get to go to the head of the queue. go ahead, time it.) -bowerbird p.s. a text that is already this clean would most definitely be put into the "easy" queue. moreover, re-doing existing e-texts might be a very good test of the formatting rounds that are now being contemplated for the d.p. future. (of course, those are a complete waste of time, in my humble opinion, but they _are_ the plan...) p.p.s. robert, didn't you make a d.p. forum post that also listed a bunch of plain-text formats recently? or was that dazb? or someone else?

1 0

re: [gutvol-d] one more thing, for jon noring
by Bowerbird＠aol.com 28 Feb '05

28 Feb '05

jon said: > But there was a deadline to finish the first beta of the > cleaned-up text, so there was no time to have this done at DP. this text would fly through d.p. in a matter of hours... > and found over 200 differences but my comparison shows that most of those are minor, to the point of total insignificance to the average reader. when the focus is narrowed to meaningful differences, the number is less than 20. it is good to correct them -- and the less-significant ones too -- very good, but this is hardly a good example of an error-ridden e-text. > Unfortunately, since the 120-dpi scans are > antialiased greyscale (while the 600 dpi are bitonal), > the size difference is surprisingly not that different. > I updated the My Antonio index page to include > downloading all the 120-dpi scans in a ZIP file, > which is still over 30 megs in size: the .pngs on the website would seem to be much smaller. roughly 400 of those, at about 20k each, would be 8 megs. is my arithmetic wrong? or am i missing something? -bowerbird

3 2