a review of some digitization tools -- 007

i hope you all had a nice thanksgiving weekend... :+) now let's ease back into our series on digitization tools. *** our overall objective here is designing a workflow that can be used in a _collaborative_system_ of digitization, which does not preclude digitization by individuals too. offline programs are fine -- indeed _preferred_ -- with the proviso that they have online equivalents, and these online versions can work in batch mode, since the emphasis is on the workflow of a _system_. it is worth restating, however, that neither p.g. nor internet archive is open for improvements, meaning we're unfettered by any pesky legacy considerations. we are free to start anew... (which is a good thing, since both of those systems now have workflows which are significantly flawed.) *** you will recall our three overarching requirements for the process of digitizing an existing paper-book: 1. obtain the text (via scanners, and o.c.r. software) 2. clean the text (the first set of tasks for our tools) 3. turn the text into e-books (the second set of tasks) the first requirement is largely taken care of, thanks to the big scanning efforts of google and internet archive. under the second of those requirements, we covered: 2a. do a spellcheck 2b. fix spacey punctuation 2c. restore styling, e.g., italics all of these tasks can be managed with a text-editor, whether offline or online. nothing here is "difficult"... so we're working on the tasks for the third requirement: 3a. tagging the structural aspects of the text 3b. converting the text into .html (intermediate and final) 3c. converting intermediate .html into e-book output-files our biggest task here is 3a, tagging the structural aspects... and now, to just tell you where i am going with all of this. i'll inform you that i already programmed such a system... indeed, i've previously showed you the various parts of it. one part was the "editing/proofing" system, a demo that i did using the "sitka" book that rfrank was also digitizing, using his system. this part takes care of requirement #2. yes, i can put that back up online if anyone wants to see it. the other part was the online demo for creating e-books from a z.m.l. text-file, the one jana "helped" with... ;+) this web-app performed the tasks under requirement #3. and yes, i can put that back up, if anyone wants to see it... *** again, the "human" need here occurs largely on task 3a, the tagging task. once that's done, scripts can take over. in order to know what "tags" we will be needing to apply, we are examining some books that have been digitized... just for fun, we're using books that jim adcock digitized. so far, we've found that quite a few of the books we want to digitize require only a barest minimal number of tags. most books consist largely of paragraphs within chapters. (this really shouldn't be a major surprise to anyone here.) of the books from jim that we've analyzed so far, we find he's used tags to label the following structural features: a. headers -- [h1]-[h6] b. paragraphs and breaks -- [p] and [br] c. styling -- [i] and [b] d. blocks -- [pre] and [blockquote] e. horizontal rules -- [hr] f. images -- [img] g. links -- [a name] and [a href] mostly for the t.o.c. we're gonna look at a few more books from jim, where we will find need for tags for a few more structural elements, but what we will learn is that this job is pretty darn simple. after that, we're going to see how to generate our output, in the form of e-book files (such as .pdf, .epub, and .mobi), from a simple plain-ascii input-text z.m.l. "master" source. as i said, i already did a demo of an _online_ app to do this, so the new twist for this series is that i am now releasing an _offline_ program to do the same thing, with arbitrary text. (the earlier online demo was for a given text -- "the jungle", by upton sinclair, the influential classic from the early 1900s.) yep, i am turning up the heat on all you lobsters in my pot... -bowerbird

we're gonna look at a few more books from jim, where we will find need for tags for a few more structural elements, but what we will learn is that this job is pretty darn simple.
Just to state the obvious, this "analysis" by BB is BS, since I choose to pick books that don't have really hard parts re formatting issues, because I don't work on books I don't know how to do, or that I know how to do but look like a lot of work for the results. So, for example, I don't do books that require "typesetting" mathematical formulas. I don't do books that require lots of font changes. And I don't do books that have indexes. For that matter, I don't do page numbers, because they can easily be added back in by anyone who really cares, they do not fit in well with my work flow (which is very different from that which BB images) and because the systems I've seen so far for page numbers do not work on one or more platforms. So, I don't do this things -- but other people making PG books do. So, what BB imagines as a markup set *does* seem to work pretty well for large parts of many late 1800s, early 1900s novels -- but that is not everything PG book transcribers want to do. Every book I've worked on has some parts which don't fit these ideas of simple markup sets. [What BB is contemplating IS a "Markup Set" it's just that it's a contextually implied markup set -- a big ouch IMHO for anyone who has tried such -- its much easier just to tell a computer what you want it to do in the first place.)

I would not call BB's idea or concepts BS. They have their merit as you well say, below. I agree doing references is a big pain. The problem is that there is no sure way to get it right. Some can be done semi-automatically. Yet, most will have to been corrected. 1) general page references will be hard and only work if you have marks from the original. Yet, for an e-book the have to be adjusted so that they refer to the part of the text in question, because an ebook has so to say dynamic pages. 2) Bibliographic references can be done, if you tag the Bibliography. It is possible to have this done automatically. 3) Indexes can be done if you have original page marks. A simple script can do this, but they will be some false positives, that is we have to mark the exact word or paragraph. You mention that you do not do mathematical formula, well others do. BB is trying to develop a minimal mark-up set. It is designed for consistency in a work flow. Naturally, there will always be books that will need special consideration and an extra feature set. regards Keith. Am 01.12.2011 um 01:44 schrieb Jim Adcock:
we're gonna look at a few more books from jim, where we will find need for tags for a few more structural elements, but what we will learn is that this job is pretty darn simple.
Just to state the obvious, this "analysis" by BB is BS, since I choose to pick books that don't have really hard parts re formatting issues, because I don't work on books I don't know how to do, or that I know how to do but look like a lot of work for the results. So, for example, I don't do books that require "typesetting" mathematical formulas. I don't do books that require lots of font changes. And I don't do books that have indexes. For that matter, I don't do page numbers, because they can easily be added back in by anyone who really cares, they do not fit in well with my work flow (which is very different from that which BB images) and because the systems I've seen so far for page numbers do not work on one or more platforms. So, I don't do this things -- but other people making PG books do.
So, what BB imagines as a markup set *does* seem to work pretty well for large parts of many late 1800s, early 1900s novels -- but that is not everything PG book transcribers want to do. Every book I've worked on has some parts which don't fit these ideas of simple markup sets. [What BB is contemplating IS a "Markup Set" it's just that it's a contextually implied markup set -- a big ouch IMHO for anyone who has tried such -- its much easier just to tell a computer what you want it to do in the first place.)
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

BB is trying to develop a minimal mark-up set.
Last time I checked BB was still trying to sell his idea that txt70 contains "all we need" to do automated markup. He sells this idea by pruning away everything which doesn't fit his idea. If BB or others are now acknowledging that this txt70 idea is a bad idea then I would agree that a standard set of markup for PG would be a "good idea." The next area of disagreement would presumably be how one should markup documents. The rest of the world, after many decades of debate, has converged on the idea that explicit opening and closing tags are the way to go, "convenience" in markup being discarded in exchange for techniques that are robust and which can be well-automated. Which then brings us to some flavor of an XML markup scheme. What remains then is reaching agreement on what the "useful" set of tags should be. Suggest one could start by looking at the DP set of tags.
participants (3)
-
Bowerbird@aol.com
-
Jim Adcock
-
Keith J. Schultz