Re: [gutvol-d] Re: barriers to XML posting

----- Original Message ----- From: Steve Thomas <stephen.thomas@adelaide.edu.au>
A question (possibly better put over on the DP list):
Is it possible to OCR a scan directly to XML? Or is the output from OCR always going to be text?
That is a very DP related question, but I'll answer here as best as I understand the future plans (and let others correct me where needed). The plan at DP is to move from the current 2 round proofing model to a (probably) 4 round proofing/markup model. The content provider will take the scans and OCR them normally. That part doesn't change. Then, there are 2 rounds of proofing that concentrate on typos, spelling, etc. Very similar to the 2 rounds we have now. Then, there are 2 MORE rounds of markup. Here is where all the markup like poetry, italics/bold, footnotes, chapter headings, thoughtbreaks, etc, etc are done. Then, when the final result gets out of 4 rounds, it is nicely marked up (in theory) XML. The post-processor does his/her normal magic, combining all the pages, running validators on it, etc. As far as the OCR process, we currently run some pre-processors on text to fix common scannos, etc. I'd be surprised if those pre-processors didn't improve/change as the XML world emerges at DP. Josh

At DP, most of us use ABBYY Finereader (versions ranging from 5.0 to 7.0) to do the OCR work. It does not currently have an option to save the result as XML, though I suppose they might well implement something like that eventually. Also, for proofreading purposes, it is much easier to work with material that does not yet have all the XML tags, etc. We have always planned to have formating rounds, and, in fact, they are currently in active development and I hope they will be in place by the end of the year. I expect that the nature of the formatting rounds will change with time. My hope, however, is that in most cases, even the people working in the formating rounds will not have to see all the verbosity that goes with XML and that can be represeted unambiguously in a more reader friendly way. Paragraph markers are an example that springs easily to mind. They can be added automatically later and it would be a serious waste of volunteer time to have to type those in in place of the blank line that we currently use. That case is trivially obvious, but most others may not be. One of the things that we will have to work out through experience is exactly what kinds of markup happen at which stage of the process. When one really gets into the details, there are a staggering number of them. A formating/markup issue that we've struggled with recently is teaching people how to know when to include a period inside italics and when not to. And then getting them to do it correctly. Yes, I know this doesn't matter for html, and won't matter for XML, but it does matter for plain text versions that use underscores to mark italics. I mention it only as an example of the tiddly, little, significant details that must be worked out in a day-to-day production environment. But back to my point. I expect that we will end up with a combination of automatic tools and manual intervention. Exactly what will happen where in the process remains to be determined. We'll try something, which inevitably won't be the right thing, and we will proceed with incremental changes until we end up with a system that works reasonably well. Ideally the output of that system will be an XML master file that can then be used to generate versions in whatever form any individual user requests. And answering another request from another message: I have LOTS of scanned title pages. Where would you like them? JulietS DP Site Admin ----- Original Message ----- From: "Joshua Hutchinson" <joshua@hutchinson.net> To: "Project Gutenberg Volunteer Discussion" <gutvol-d@lists.pglaf.org> Sent: Friday, October 22, 2004 8:54 AM Subject: Re: [gutvol-d] Re: barriers to XML posting ----- Original Message ----- From: Steve Thomas <stephen.thomas@adelaide.edu.au>
A question (possibly better put over on the DP list):
Is it possible to OCR a scan directly to XML? Or is the output from OCR always going to be text?
That is a very DP related question, but I'll answer here as best as I understand the future plans (and let others correct me where needed). The plan at DP is to move from the current 2 round proofing model to a (probably) 4 round proofing/markup model. The content provider will take the scans and OCR them normally. That part doesn't change. Then, there are 2 rounds of proofing that concentrate on typos, spelling, etc. Very similar to the 2 rounds we have now. Then, there are 2 MORE rounds of markup. Here is where all the markup like poetry, italics/bold, footnotes, chapter headings, thoughtbreaks, etc, etc are done. Then, when the final result gets out of 4 rounds, it is nicely marked up (in theory) XML. The post-processor does his/her normal magic, combining all the pages, running validators on it, etc. As far as the OCR process, we currently run some pre-processors on text to fix common scannos, etc. I'd be surprised if those pre-processors didn't improve/change as the XML world emerges at DP. Josh _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

Juliet Sutherland wrote:
And answering another request from another message: I have LOTS of scanned title pages. Where would you like them?
Anywhere I can get them. Or, if you prefer, zip and mail them to me. I can put them up at PG. -- Marcello Perathoner webmaster@gutenberg.org

And answering another request from another message: I have LOTS of scanned title pages. Where would you like them?
If there is diskspace + bandwidth to host and serve them, I think it would be useful to post every title page somewhere. Or, at least to start with the top 100 or 1000 or some reasonable subset. For example, I recently did a massive review of the "catalog" data (posted to GUTCAT, alas with minimal response). With so many inconsistencies between various sources, I would like to be able to reference the original. -- Scott Practical Software Innovation (tm), http://ProductArchitect.com/

On Fri, Oct 22, 2004 at 12:00:34PM -0400, Scott Lawton wrote:
And answering another request from another message: I have LOTS of scanned title pages. Where would you like them?
If there is diskspace + bandwidth to host and serve them, I think it would be useful to post every title page somewhere. Or, at least to start with the top 100 or 1000 or some reasonable subset.
For example, I recently did a massive review of the "catalog" data (posted to GUTCAT, alas with minimal response). With so many inconsistencies between various sources, I would like to be able to reference the original.
I do have all of the title pages & verso pages submitted electronically. This is thousands and thousands of images. Our new copyright system makes it relatively easy to just find one online (though I have not made this feature available - but it's easy). I think this will be a good method for the future. The older system is not as easy, but I still have the images. Just email me if you need images for a particular item. If you'd rather, I could package up the older clearances (pre-August '04 or so) and get them to you. It's probably < 2GB total. N.B., this stuff is not suitable for public redistribution with our eBooks. Many scans are not very high quality. Some are, and it would be fine with me to make them publicly available somewhere. I don't have much opinion about including these with the eBooks themselves - that's something for the producer to decide. Most title & verso pages are pretty boring, though, so probably are not worth including as part of an eBook. -- Greg

Greg typed:
I do have all of the title pages & verso pages submitted electronically. This is thousands and thousands of images.
Just email me if you need images for a particular item. If you'd rather, I could package up the older clearances (pre-August '04 or so) and get them to you. It's probably < 2GB total.
Thanks much for the offer. I've moved beyond cataloging onto the main part of my project. If I ever revisit the entire catalog, it would be great to have a DVD of all the title+verso pages (if they don't get posted somewhere in the meantime). Meanwhile, I may indeed make some one-off requests.
N.B., this stuff is not suitable for public redistribution with our eBooks. Many scans are not very high quality. Some are, and it would be fine with me to make them publicly available somewhere. I don't have much opinion about including these with the eBooks themselves - that's something for the producer to decide. Most title & verso pages are pretty boring, though, so probably are not worth including as part of an eBook.
I completely agree that there's no reason to include these in the default .zip file for the HTML or any other edition/format. I just think it's important to make them available somewhere people can find them, e.g. for librarians, scholars or other catalogers. -- Scott Practical Software Innovation (tm), http://ProductArchitect.com/
participants (5)
-
Greg Newby
-
Joshua Hutchinson
-
Juliet Sutherland
-
Marcello Perathoner
-
Scott Lawton