
Please picture this scenario: I'm a volunteer who has scanned a public-domain book and wants to make it available through the PG distribution mechanism (free of charge, available until the Internet collapses under the weight of spam and next-generation pornography, yadda, yadda, yadda). Today, if I can convert this book to plain text (according to some stated formatting conventions), I may submit the book. If I'm ambitious, I can create an HTML version, which presents the same information, but allows "real" formatting rather than _italic_ and *bold*. In the background, however, there is this Whole New World(tm) of semantic tagging, which presumably will allow the book to make snacks and provide entertainment during the reading process. But, for me, as a volunteer, who spends a considerable amount of time working on books, but enjoys actually finishing one and seeing it posted, I can't get my arms around the benefits. Except for recognizing the acronyms, I am agnostic to XML/ZML/TEI/ABC/EIEIO. Could someone please explain the benefit of semantic tagging and why it won't horribly lengthen the amount of time required to produce an eBook? Thank you.

John Hagerson wrote:
...
Could someone please explain the benefit of semantic tagging and why it won't horribly lengthen the amount of time required to produce an eBook?
Well, I'll try: First, let me say that for many works, for the purpose of *reading* the work, it doesn't matter. (I'll probably be flamed for that, but never mind.) Your simple, basic, novel, in which there are a great many paragraphs of text, divided into chapters with obvious headings like "CHAPTER II", don't really need much more than the very basic, simple HTML P tag. However, not all works are so simple. Yesterday I had cause to look at Immanuel Kant's /The Science of Right/, in which the author chose to use a great many divisions, subdivisions, sections, etc. -- all with their own headers. Since I converted this from plain text to HTML, I needed to determine from the plain text which were headings, subheadings, sub-sub-headings, etc. And unfortunately, this has required some guess-work by me. So, one benefit of more detailed tagging would be that for such a work, it would be made obvious and explicit which were headings, and which sub-headings. In other words, the structure intended by Kant is recorded in the tagging. Another example: look at any play. You have speech, names of speakers, stage directions, headings, and divisions into Act and Scene. All of these are made explicit by the tagging. Without tagging, there may well be confusion at some point as to what is speech and what is stage direction, for example. In a plain text file, we do make some effort to distinguish different elements of a work: quotations are indented, headings in UPPER CASE and centered, etc. But any kind of complexity in the work tends quickly to make that unworkable. Regards, Steve -- Stephen Thomas, Senior Systems Analyst, Adelaide University Library ADELAIDE UNIVERSITY SA 5005 AUSTRALIA Tel: +61 8 8303 5190 Fax: +61 8 8303 4369 Email: stephen.thomas@adelaide.edu.au URL: http://staff.library.adelaide.edu.au/~sthomas/

Steve makes a good answer in another post, but I wanted to add my personal holy grail that hopefully a TEI-Lite master format will help bring about... A single master document. Right now, I create a ASCII version and then a HTML version. If I make the ASCII version first, it almost never fails that I find at least one more mistake when I then do the HTML version. I fix it there, but I have to remember it and go back to the ASCII version and make the fix there. And god forbid the fix requires another rewrap. A master document format that is auto-converted to the others (at an acceptable level) would be wonderful and, imo, worth a little extra up front effort to prepare it. If someone could get a working bit of code in place, I'd be happy to start testing it like crazy and work on old texts to get it converted to that format. Josh John Hagerson wrote:
Please picture this scenario:
I'm a volunteer who has scanned a public-domain book and wants to make it available through the PG distribution mechanism (free of charge, available until the Internet collapses under the weight of spam and next-generation pornography, yadda, yadda, yadda).
Today, if I can convert this book to plain text (according to some stated formatting conventions), I may submit the book. If I'm ambitious, I can create an HTML version, which presents the same information, but allows "real" formatting rather than _italic_ and *bold*.
In the background, however, there is this Whole New World(tm) of semantic tagging, which presumably will allow the book to make snacks and provide entertainment during the reading process. But, for me, as a volunteer, who spends a considerable amount of time working on books, but enjoys actually finishing one and seeing it posted, I can't get my arms around the benefits.
Except for recognizing the acronyms, I am agnostic to XML/ZML/TEI/ABC/EIEIO.
Could someone please explain the benefit of semantic tagging and why it won't horribly lengthen the amount of time required to produce an eBook?
Thank you.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

I'll take your questions in reverse order.
why [semantic tagging] won't horribly lengthen the amount of time required to produce an eBook?
I think a two-part answer is important here. 1. The great news is that basic semantic tagging is roughly the same effort as HTML. And, if PG had acceptable MASTER-to-text conversion, the overall effort would be REDUCED compared to creating BOTH text and HTML by hand. Today, creating an eText involves throwing information away, e.g. converting what is clearly multiple levels of heading into ALL CAPS -- which loses any distinction between the levels. The key to creating a MASTER is to preserve this information. Sometimes this will require a tiny bit more time (to use the correct tag or add the appropriate attribute) but often it will take less time than manually converting to ALL CAPS or whatever. And, as I've argued elsewhere, there's no need to wait for widespread agreement on any particular set of XML tags. If used consistently, it's much, much easier to convert from one XML representation to another than to convert from text to HTML. In fact, it's also fine to skip XML and just use consistent HTML with appropriate div/span tags and/or attributes on regular HTML tags. What's important is to stop throwing useful information away and instead to capture it in a way that can be processed automatically. Takeaway point: reliable MASTER-to-text conversion would increase the number of eTexts produced per unit of volunteer time investment. (And, as DP folks have argued, additional automation would streamline other stages too.) 2. There's a second level of semantic tagging that *does* require more effort: adding information that's useful but isn't represented in print. For example, perhaps we want to label every quotation with the name of the speaker. That's easy in a play, since the name is printed. That's quite a lot of work in prose since the name may or may not occur adjacent to the quote, and even when it does, could be before or after, and may be represented several ways (e.g. "Arthur", "The King", "His Majesty"). I'm actually a fan of rich semantic markup, but, to be honest, the benefits of this second level are much smaller and the effort much greater. In the foreseeable future, this is likely only to be done when the volunteer has a specific end use in mind.
Could someone please explain the benefit of semantic tagging
Others have addressed this, but I want to summarize and add a few points. 1. A single MASTER copy from which all other versions can be generated automatically. Plain text and HTML of course, but also PDF and the various eBook formats. Just as important: more than one rendition of any particular format can be created, e.g. a set of HTML files split by chapter or even page, or PDF formatted for a particular screen size, paper size, or printing layout (e.g. as a booklet). 2. Capture information that's beyond what is generally printed, but is useful to certain audiences and/or in certain contexts. e.g. (from an earlier thread) the MASTER can capture a mistake AND the correction; or other variations. See Re[2]: [gutvol-d] Indexing Editors, etc. from Oct. 4, 2004 for details. 3. Automated processes that "add value" in some way, e.g. using a different computer voice for different characters, or creating an index by character. -- Scott Practical Software Innovation (tm), http://ProductArchitect.com/
participants (4)
-
John Hagerson
-
Joshua Hutchinson
-
Scott Lawton
-
Steve Thomas