
as i watch all these p.g. e-texts float across my screen, i just can't help but have some thoughts recur to me... in the old days, when -- for some very good reasons -- a p.g. e-text was considered to be an _amalgamation_ of different versions of a book (even when it really was not, a fiction advised by p.g. legal counsel at that early time), that gave a good reason to remove end-line hyphenation and reflow text (without hyphenation) to p.g. margination. after all, hyphenation mostly causes problems in e-books. in the current era, however, where most p.g. e-texts are pegged to a specific version of a book (and where, for the most part, the scans are now retained to cement this direct correspondence), it no longer makes sense to discard the line-breaks, or even the end-line hyphenation, to be frank. yes, end-line hyphenation should be _marked_ in some way, so it can be automatically eliminated, but the _default_action_ should be to retain it. it would defeat the purpose of saving the line-breaks if you didn't also retain end-line hyphenation, because the goal here would be to duplicate the print version. (don't bother arguing that there would never be such a desire; maybe you'd never have any need for it, but _someone_ might. i can think of half-a-dozen such reasons -- want to hear them?) if you want to see what the future of electronic-books looks like, see the "digital reprints" that jose menendez has been producing.
http://www.ibiblio.org/ebooks/Mabie/ http://www.ibiblio.org/ebooks/Cather/ http://www.ibiblio.org/ebooks/Einstein/
the deep links to the actual .pdf "digital reprints" are these:
http://www.ibiblio.org/ebooks/Mabie/Books_Culture.pdf http://www.ibiblio.org/ebooks/Cather/Antonia/Antonia.pdf http://www.ibiblio.org/ebooks/Einstein/Einstein_Relativity.pdf
aside from the unfortunate fact that jose is using the .pdf format (a format which makes it far to difficult to repurpose the content), these "digital reprints" carve out an awesome model for e-books. they replicate the original paper-book to a high degree of fidelity, and do so using a small percentage of the disk-space of the scans. yet because it is an e-book, it gives all the benefits that they give. (at least it _would_, if it wasn't a .pdf. but that part can be fixed.) and the secret of these "digital reprints" is extremely simple, folks; all that jose has done is merely to retain the original line-breaks... so, once again, i recommend and request that you start retaining this valuable information, instead of intentionally tossing it away. (it is very ironic, to me, that distributed proofreaders _retains_ the line-breaks during their proofing -- because it makes that process so much easier -- but then they discard the line-breaks! hey, there might be some end-users out there who need 'em too!) honestly, folks, when i look at your p.g. e-texts, what i see is that they're gonna be thrown on the trashpile one day -- maybe soon. in a world that is awash in scans, and where o.c.r. is a commodity, it'll be trivial to convert those scans to text. so if someone needs to have the ability to duplicate the print version -- i.e., they _need_ to have the line-break information you are routinely discarding -- they'll simply o.c.r. the scans again. they will be required to do that, because your e-texts simply won't do the job that they want done... that's not to say that your p.g. e-texts will be _completely_ worthless. as an independent digitization, they'll go a long way toward helping to move any new o.c.r. effort up to an absurdly high level of accuracy. but since the absurd level of accuracy can be applied to either e-text, and since the new effort will have retained the line-break information, that will be the one that's retained. the p.g. e-text will be thrown away. and it would break my heart to see all your hard work just thrown away. on the other hand, if y'all started retaining that line-break information, then it'd be _your_ version which would be kept (because of its primacy), and the new o.c.r. effort would just be seen as a tool to increase accuracy. if project gutenberg wants to remain as the premiere library in cyberspace, you're going to have to fix this glitch, and do it quickly. mark my words... -bowerbird p.s. at some of you aren't good at reading between the lines, i'll tell you that i intend to mount such a massive o.c.r. effort, so the question about which version, p.g. or not, receives the higher accuracy is a very real one. i don't want to challenge the p.g. library, _unless_ you've made it deficient. i'm trying to help you by giving you this advice before it becomes crucial...

Bowerbird's lengthy essay is just one more example of how publishers, editors, etc., put their own needs ahead of those of the readers. While there might be some value in keeping references to arcane modes of pagination and margination for those who actually have those other reasons for opening books other than to simply read their contents, a certain respect for the reader, ostensably for whom all is being done by the publishers and editors, should clearly indicate that no longer is there any need for a slavish mentality to conserve the paper pages by introducing end of line hyphenation, or to create some appearances that there were actually the same number of characters on every line, when it is obvious to anyone who cares to look that there are not. And, as Mr. Bowerbird points out, end of line hyphenation can be some serious pain in the neck, depending on what programs you use to read, search, edit, etc. So, while I obviously agree that there are to camps in his model of a world of eBooks, I disagree as to which is primary. The reader is primary. Any effort to preserve items of interest only to publishers, editors, etc., should be the efforts that are invisible to the naked eye, with the option to bring them into view when desired, rather than defaults being of the nature that it is the millions of readers who have to do the process to eliminate them, rather than just a few who will prefer to have them visible. Thanks!!! Give the world eBooks in 2006!!! Michael S. Hart Founder Project Gutenberg Blog at http://hart.pglaf.org

Although I agree with Michael that there is no need to preserve things as linebreaks in most texts -- if you really need to go to that level of detail, there is always the original or the scans to fall back upon -- I want to make a case for preserving page numbers, if not at least as recognisable anchors in text, and only for those books being referenced to regularly by other books. This excludes most fiction, but is particularly important for scientific works, which have constructed a kind of paper web with cross references mainly based on page numbers. In long term, such references of course should give way to proper references to the actual paragraph or sentence being referenced, but as a practical ad-interim solution, staying with page numbers will increase the number of texts we can digitize with our limited means. This leads me to one place where further work could be done on the PG collection: turning it from a collection of static texts into an enriched web of knowledge. I've seen a lot of websites grabbing all of PG, and republishing it in a slightly modified form. I would however, like to see the collection be incorporated in a kind of wiki-like system, where people can add -- without tampering with the static source texts -- annotations, add tagging and create live cross references: both for own use, smaller dissemination in a group or publicy. I've added a large number of texts related to the Philippines to PG, and many of these text interact. Some critize each other, others provide opposing views, and so forth. It would be great to build a system that makes that easy to follow for everybody, such that people can immediately see, when reading a text, where it has been cited or referenced in other works. It would be great also to provide study introductions or synopises, to give users a grasp of the material, and enable them to find what they really need within reasonable time. Search enginges are a great tool, but only to a certain extend. Jeroen.

There are places such as wikisource.org, where you could add the texts and start providing links such as you mention here immediately. Andrew On Sat, 24 Jun 2006, Jeroen Hellingman (Mailing List Account) wrote:
This leads me to one place where further work could be done on the PG collection: turning it from a collection of static texts into an enriched web of knowledge. I've seen a lot of websites grabbing all of PG, and republishing it in a slightly modified form. I would however, like to see the collection be incorporated in a kind of wiki-like system, where people can add -- without tampering with the static source texts -- annotations, add tagging and create live cross references: both for own use, smaller dissemination in a group or publicy.
I've added a large number of texts related to the Philippines to PG, and many of these text interact. Some critize each other, others provide opposing views, and so forth. It would be great to build a system that makes that easy to follow for everybody, such that people can immediately see, when reading a text, where it has been cited or referenced in other works. It would be great also to provide study introductions or synopises, to give users a grasp of the material, and enable them to find what they really need within reasonable time. Search enginges are a great tool, but only to a certain extend.
Jeroen.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

[cc: Jose Menendez] Jeroen Hellingman wrote:
Although I agree with Michael that there is no need to preserve things as linebreaks in most texts -- if you really need to go to that level of detail, there is always the original or the scans to fall back upon -- I want to make a case for preserving page numbers, if not at least as recognisable anchors in text, and only for those books being referenced to regularly by other books.
First off, I agree with Bowerbird in the sense that it is a good thing to preserve both the line breaks and page breaks in the master marked- up texts converted from a source book. I assume with the DP work flow that this would not be that difficult of a thing to do, so why not do it if it could be done (mostly) automatically? For the OpenReader Publication Format, which is in an advanced stage of development, we're now putting together an OpenReader namespace set of elements to do various tasks. These elements may be used for all XML content documents which OpenReader now supports (an XHTML subset) and plans to support in the future (such as a subset of TEI). The namespaced elements include (attributes not described here): <or:hlink> ... </or:hlink> (simple hypertext linking) <or:object/> (embedding images, video and audio) <or:page/> (page break in a paper source) <or:lb/> (line break in a paper source) <or:marker/> (a generic marker) (both or:hlink and or:object will be defined using XLink.) With the permission of Jose Menendez, he is letting us use his copy of "My Antonia" (which is more accurate than the one I've been working on which hasn't yet been completely proofed), to put it into a demo of the OpenReader format. I've "diffed" it to my version and checked all differences found by consulting the original page scans, and it's been restored to the original 1918 edition (including textual errors -- the errors are specially marked however, including what the text should be based on both the Univ. of Nebraska online edition and Jose's edition), and have added precise line breaks and page breaks. For line breaks, I've placed the line breaks at the precise place of hyphenation. If the broken word does not have a natural hyphen, I use a (a soft hyphen) to indicate that -- if the broken word does have a natural hyphen at the break, the hard hyphen character "-" is used. Here's an example paragraph (the 63rd paragraph in the text) which includes a page break, soft and hard hyphens: **************************************************************************** <p id="p0063">The little girl was pretty, but Án-tonia —<or:lb/> <or:page id="page026"/>they accented the name thus, strongly, when<or:lb/> they spoke to her — was still prettier. I re<or:lb/>membered what the conductor had said about<or:lb/> her eyes. They were big and warm and full<or:lb/> of light, like the sun shining on brown pools<or:lb/> in the wood. Her skin was brown, too, and<or:lb/> in her cheeks she had a glow of rich, dark<or:lb/> color. Her brown hair was curly and wild-<or:lb/>looking. The little sister, whom they called<or:lb/> Yulka (Julka), was fair, and seemed mild and<or:lb/> obedient. While I stood awkwardly confront<or:lb/>ing the two girls, Krajiek came up from the<or:lb/> barn to see what was going on. With him was<or:lb/> another Shimerda son. Even from a distance<or:lb/> one could see that there was something strange<or:lb/> about this boy. As he approached us, he began<or:lb/> to make uncouth noises, and held up his hands<or:lb/> to show us his fingers, which were webbed to<or:lb/> the first knuckle, like a duck’s foot. When he<or:lb/> saw me draw back, he began to crow delight<or:lb/>edly, “Hoo, hoo-hoo, hoo-hoo!” like a rooster.<or:lb/> His mother scowled and said sternly, “Ma<or:lb/>rek!” then spoke rapidly to Krajiek in Bo<or:lb/>hemian.</p> ***************************************************************************** If the above is rendered in plain text preserving the line breaks (ignore the page break), we have: (since this is an ASCII text email, I've converted the A-acute in "Antonia" to a unaccented A, em-dashes to "--", and curly quotes/apostrophes to the straight varieties.) ***************************************************************************** The little girl was pretty, but An-tonia -- they accented the name thus, strongly, when they spoke to her -- was still prettier. I re- membered what the conductor had said about her eyes. They were big and warm and full of light, like the sun shining on brown pools in the wood. Her skin was brown, too, and in her cheeks she had a glow of rich, dark color. Her brown hair was curly and wild- looking. The little sister, whom they called Yulka (Julka), was fair, and seemed mild and obedient. While I stood awkwardly confront- ing the two girls, Krajiek came up from the barn to see what was going on. With him was another Shimerda son. Even from a distance one could see that there was something strange about this boy. As he approached us, he began to make uncouth noises, and held up his hands to show us his fingers, which were webbed to the first knuckle, like a duck's foot. When he saw me draw back, he began to crow delight- edly, "Hoo, hoo-hoo, hoo-hoo!" like a rooster. His mother scowled and said sternly, "Ma- rek!" then spoke rapidly to Krajiek in Bo- hemian. ***************************************************************************** Of course, comments welcome on the above! Jon Noring

Jon Noring <jon@noring.name> writes:
**************************************************************************** <p id="p0063">The little girl was pretty, but Án-tonia —<or:lb/> <or:page id="page026"/>they accented the name thus, strongly, when<or:lb/> they spoke to her — was still prettier. I re<or:lb/>membered what the conductor had said about<or:lb/> her
No result, if you grep for "remember". Consider to encode it as follows: <reg orig="remembered">remembered<or:lb/></reg> -- http://www.gnu.franken.de/ke/ | ,__o | _-\_<, | (*)/'(*) Key fingerprint = F138 B28F B7ED E0AC 1AB4 AA7F C90A 35C3 E9D0 5D1C

Jon Noring grudgingly admits:
<or:page/> (page break in a paper source) <or:lb/> (line break in a paper source) <or:marker/> (a generic marker)
Why not use <tei:pb> , <tei:lb> and <tei:milestone> ? Insisting on making your own when there are perfectly good elements in TEI is just plain ... sub-optimal.
he began to crow delight<or:lb/>edly,
Sorry to rain on your parade but your (at best) half-baked proposal has following shortcomings: 1. Non-standard use of The soft-hyphen is a "non-printable" character that may be replaced with a "printable" hyphen by processors before output. Your use is to record the place where an existent hyphen has been stripped. You got it backwards. You confuse the very different stages of text feature recording and text output. 2. Throws off grep An xml-grep could find "delight<tei:lb/>edly" if searching for "delighted", but it surely won't find "delight<tei:lb/>edly". 3. Redundant text feature documentation All you are doing here is repeatedly "documenting" that the character used to hyphenate words in this text is the hyphen. You don't have to repeat that statement through all of your text. A single statement to that effect in the TEI header will suffice. 4. Incompatibility with LOTE Remember that in LOTE you have to deal with cases like the German "ck" and "fff" which got hyphenated this way: dachdecker dachdek-ker Schiffahrt Schiff-fahrt Also remember French and Italian elisions that don't happen at line breaks. 5. Dependance on one edition All those hard-coded 's will marry your electronic text to one edition. You have no provision to encode different editions of the very same text like hardcover and paperback (which may very well have different line endings). Conclusion My advice is: forget entirely about line breaks. They are random artefacts introduced by the person operating the typesetting machine and indirectly by the person who chose paper size and font. They have no raison d'être once you separate the ebook from the scans, ie. after it left DP. (That this suggestion was by "You Know Who" should have tipped you off immediately.) But if you belong to that fastidious class of people who can't throw away even the most useless random artefact, I suggest doing it this standard way: <html:p> ... he began to crow de<tei:lb ed="paperback" />light<tei:lb ed="hardcover" />edly, ... </html:p> A standard XHTML browser (OpenReader ?) will simply throw away the unknown tags and render the normalized text. A special processor may be used to reconstruct the paper layout of the text. -- Marcello Perathoner webmaster@gutenberg.org

Marcello wrote:
Jon Noring grudgingly admits:
<or:page/> (page break in a paper source) <or:lb/> (line break in a paper source) <or:marker/> (a generic marker)
Why not use <tei:pb> , <tei:lb> and <tei:milestone> ? Insisting on making your own when there are perfectly good elements in TEI is just plain ... sub-optimal.
Actually, a very good idea. We've not fixed the "custom" elements yet. I'll have to look at the TEI-defined semantics of the use of the TEI equivalents, but *if* reasonably close to what we need, will likely embrace them. It will add to the list of namespace declarations, but that downside is pretty minor. Thanks.
he began to crow delight<or:lb/>edly,
Sorry to rain on your parade but your (at best) half-baked proposal has following shortcomings:
No, I'm submitting the idea for feedback, and your feedback is valuable.
1. Non-standard use of
The soft-hyphen is a "non-printable" character that may be replaced with a "printable" hyphen by processors before output.
Your use is to record the place where an existent hyphen has been stripped.
Yes.
You got it backwards. You confuse the very different stages of text feature recording and text output.
Actually, I've been debating whether or not to include the as it is used.
2. Throws off grep
An xml-grep could find "delight<tei:lb/>edly" if searching for "delighted", but it surely won't find "delight<tei:lb/>edly".
Well, with existing toolbases, this might be. I believe, however, that Unicode itself implies that text processors should ignore (U+00AD). One reference is: http://www.unicode.org/unicode/reports/tr14/#SoftHyphen In addition HTML discusses the use of the soft hyphen: http://www.w3.org/TR/html401/struct/text.html#hyphenation In summary, user agents, such as doing word searching, should ignore the soft hyphen character. That some don't is a real-world issue that unfortunately has to be pragmatically considered.
3. Redundant text feature documentation
All you are doing here is repeatedly "documenting" that the character used to hyphenate words in this text is the hyphen. You don't have to repeat that statement through all of your text. A single statement to that effect in the TEI header will suffice.
Two points (based on what I interpret you are saying): 1) We are not focusing on TEI documents, thus many XML documents will not have a TEI header. 2) The Unicode annex statement on the use of the soft hyphen (see above link) takes into account other characters used for word breaking purposes. It does not imply a "hard hyphen", but some character used for linebreaking depending upon the text's language and country code (required for all OpenReader Content Documents)
4. Incompatibility with LOTE
Remember that in LOTE you have to deal with cases like the German "ck" and "fff" which got hyphenated this way:
dachdecker dachdek-ker
Schiffahrt Schiff-fahrt
Also remember French and Italian elisions that don't happen at line breaks.
Good points. I'll have to check the Unicode annex document (URL above) to see what it talks about regarding this.
5. Dependance on one edition
All those hard-coded 's will marry your electronic text to one edition. You have no provision to encode different editions of the very same text like hardcover and paperback (which may very well have different line endings).
Yes, this is an issue. I do plan to allow addition of an attribute to both the page break and line break pointing (via Binder identifier) to the source work. So the markup may contain multiple source works. Things get messy if in two works the same word is broken, but in different places. But I think my system will work for this. Example of identifier attribute (still using OR namespace): <or:page bid="book2" .../> <or:lb bid="book2"/> In the Binder document, in the "descriptions" section (now being amended), we might have: <markdesc id="book2">Second Edition Issued in 1922</markdesc>
My advice is: forget entirely about line breaks. They are random artefacts introduced by the person operating the typesetting machine and indirectly by the person who chose paper size and font. They have no raison d'être once you separate the ebook from the scans, ie. after it left DP. (That this suggestion was by "You Know Who" should have tipped you off immediately.)
Disagreed. There may be a need, for example, to continue proofing work in the future. Knowing where line breaks occurred makes it easier with DP and similar processes. It also better correlates to the "bounding box information" from OCR which is being preserved. And *someone* may want to know this for formatting purposes. It is information about the source which by and large is easy for user-agents to ignore. Regarding you-know-who, I think you know that I often have profound disagreements with him, but when I agree with him, I agree. I don't let personal issues get in the way of acknowledging when I think he is right. Those who believe in objectivity evaluate what a person says.
But if you belong to that fastidious class of people who can't throw away even the most useless random artefact, I suggest doing it this standard way:
<html:p> ... he began to crow de<tei:lb ed="paperback" />light<tei:lb ed="hardcover" />edly, ... </html:p>
A standard XHTML browser (OpenReader ?) will simply throw away the unknown tags and render the normalized text. A special processor may be used to reconstruct the paper layout of the text.
Well, the real issue is dealing with the "fff", etc. issue of LOTE. I'll have to reread the Unicode annex. In OpenReader we reference that spec, and recommend user agents follow its guidelines. But it might not cover the particular LOTE "exceptions" you brought up. Thanks for your frank feedback. Definitely needed. Jon Noring

Marcello Perathoner <marcello@perathoner.de> writes:
My advice is: forget entirely about line breaks. They are random artefacts introduced by the person operating the typesetting machine and indirectly by the person who chose paper size and font. They have no raison d'être once you separate the ebook from the scans, ie. after it left DP. (That this suggestion was by "You Know Who" should have tipped you off immediately.)
I agree. Before encoding a text you have to decide if you are encoding the expression of the text or the manifestation of the text.[1] Marking up an expression is the structure and text of the text. This is what the author has created and has handed over to a publisher. Marking up a manifestation is all about layout and presentation. This is the realm of the publisher and this is where you get into fonts, line breaks etc. You can easily mark up a text as either one or the other, but it's not practical to try to do both in the same markup. There are a few examples of texts and manuscripts which would be worth having an expression level markup and a second manifestation markup, but these will be rare. I seriously doubt that any manifestation of Willa Cather's work would fall into this catagory :) Dead tree books fix a manifestation into a permanent arrangement. Electronic manifestations, which use systems like CSS to mold the manifestation to the moment and to the device on the fly, are liquid, if you try to hold them in your hand it just escapes through your fingers. The world of print books puts the publisher, and the manifestation at the center. The manifestation is more important than the author who has takes a back seat to the glorious manifestation that was made of the expression of her work. But when copying and distribution is for all practical purposes free and the manifestation has been reduced to an algorithm which an electronic reader interprets, the manifestation itself takes a back seat to the expression. The Age of the manifestation and the publisher is drawing to an end and we are slowly seeing the emergence of the Age of the expression and the author. PG is well named. Gutenberg's press was the first instance of fixing a manifestation so that millions of identical copies could be made. Before Gutenberg, each copy of a text was a different manifestation. Being able to make error free copies was a revolution, but came at the expense of easily being able to mold manifestations for different uses and environments. But you can make exact copy of an electronic text without it depending on any one manifestation of it. This is just as significant as Gutenburg's press. Is it useful to include some information from some manifestations in an expression level markup? Damn yes -- page breaks are the anchor and hyperlink in the world of paper. Countless millions of references to page numbers have been made over the last two centuries. Preserving page breaks is an essential part of preserving all those references which use them. So if you want to create a markup of a text which preserves a specific manifestation that's fine, there are whole sections of TEI devoted to allowing you to pick the tiniest bit of navel lint and preserving it for eternity. But for most purposes page scans of the original manifestation will provide enough of this information for most questions about a text, as well as provide the source material for the lint pickers to encode away to their heart's content for specific manifestations. But electronic books will mostly be in the business of preserving the expression of a work which can then be converted into other markup languages like XML or OR for dynamically generating flexible, ephemeral manifestations on the fly. b/ Footnotes: [1] I am using work, expression and manifestation as defined in the FRBR (Functional Requirements for Bibiographic Records). work :: the concept representing an intellectual or creative creation. expression :: includes the specific sequence of words, images and structure of work. manifestation :: includes the specific layout, typography, pagination etc of a specific expression. -- Brad Collins <brad@chenla.org>, Banqwao, Thailand

One could make the argument that the paragraph and perhaps the chapter are useful tags. Poetry and sidenotes and footnotes seem fairly established in PG without attitional tagging. Also will we be scanning 20 editions of Dickens, all with different line breaks and page numbers? nwolcott2@post.harvard.edu ----- Original Message ----- From: "Brad Collins" <brad@chenla.org> To: "Project Gutenberg Volunteer Discussion" <gutvol-d@pglaf.org> Sent: Saturday, June 24, 2006 10:06 PM Subject: Re: [gutvol-d] the end of the line Marcello Perathoner <marcello@perathoner.de> writes:
My advice is: forget entirely about line breaks. They are random artefacts introduced by the person operating the typesetting machine and indirectly by the person who chose paper size and font. They have no raison d'être once you separate the ebook from the scans, ie. after it left DP. (That this suggestion was by "You Know Who" should have tipped you off immediately.)
I agree. Before encoding a text you have to decide if you are encoding the expression of the text or the manifestation of the text.[1] Marking up an expression is the structure and text of the text. This is what the author has created and has handed over to a publisher. Marking up a manifestation is all about layout and presentation. This is the realm of the publisher and this is where you get into fonts, line breaks etc. You can easily mark up a text as either one or the other, but it's not practical to try to do both in the same markup. There are a few examples of texts and manuscripts which would be worth having an expression level markup and a second manifestation markup, but these will be rare. I seriously doubt that any manifestation of Willa Cather's work would fall into this catagory :) Dead tree books fix a manifestation into a permanent arrangement. Electronic manifestations, which use systems like CSS to mold the manifestation to the moment and to the device on the fly, are liquid, if you try to hold them in your hand it just escapes through your fingers. The world of print books puts the publisher, and the manifestation at the center. The manifestation is more important than the author who has takes a back seat to the glorious manifestation that was made of the expression of her work. But when copying and distribution is for all practical purposes free and the manifestation has been reduced to an algorithm which an electronic reader interprets, the manifestation itself takes a back seat to the expression. The Age of the manifestation and the publisher is drawing to an end and we are slowly seeing the emergence of the Age of the expression and the author. PG is well named. Gutenberg's press was the first instance of fixing a manifestation so that millions of identical copies could be made. Before Gutenberg, each copy of a text was a different manifestation. Being able to make error free copies was a revolution, but came at the expense of easily being able to mold manifestations for different uses and environments. But you can make exact copy of an electronic text without it depending on any one manifestation of it. This is just as significant as Gutenburg's press. Is it useful to include some information from some manifestations in an expression level markup? Damn yes -- page breaks are the anchor and hyperlink in the world of paper. Countless millions of references to page numbers have been made over the last two centuries. Preserving page breaks is an essential part of preserving all those references which use them. So if you want to create a markup of a text which preserves a specific manifestation that's fine, there are whole sections of TEI devoted to allowing you to pick the tiniest bit of navel lint and preserving it for eternity. But for most purposes page scans of the original manifestation will provide enough of this information for most questions about a text, as well as provide the source material for the lint pickers to encode away to their heart's content for specific manifestations. But electronic books will mostly be in the business of preserving the expression of a work which can then be converted into other markup languages like XML or OR for dynamically generating flexible, ephemeral manifestations on the fly. b/ Footnotes: [1] I am using work, expression and manifestation as defined in the FRBR (Functional Requirements for Bibiographic Records). work :: the concept representing an intellectual or creative creation. expression :: includes the specific sequence of words, images and structure of work. manifestation :: includes the specific layout, typography, pagination etc of a specific expression. -- Brad Collins <brad@chenla.org>, Banqwao, Thailand _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

Norm Wolcott wrote:
One could make the argument that the paragraph and perhaps the chapter are useful tags. Poetry and sidenotes and footnotes seem fairly established in PG without attitional tagging. Also will we be scanning 20 editions of Dickens, all with different line breaks and page numbers?
Well, since I sort of initiated this sub-thread, let me note that the addition of an optional "page break" element in OpenReader is instigated mostly by the needs of modern educational books, where there may be mixed use with co-existing paper and ebook versions. And, yes, this feature has been asked for by a user agent vendor working with the educational community. Of course, this feature may be used to preserve page breaks for other purposes and sources, such as PG/DP. Do note that there exist lots of scholarly references which point to particular pages in particular paper manifestations of a work, so having page break info may eventually prove useful to interlink all the old stuff (provided, of course, that the focus is on preserving "manifestation" information in the master digital documents.) I don't see as much use for the line break empty tag, but we plan to include it so it's there for those who wish to use it. In the demo OpenReader Publication of "My Antonia", the line break element will be included. I'm still going over Marcello's suggestions, plus rereading the Unicode annex about line breaking (which *does* cover, in a general way, the unusual ways line breaks are done in LOTE, such as older German and Dutch.) The other part of this sub-thread, the discussion of FRBR, is also interesting. I discovered the FRBR a few years ago, and find it very useful to understand how to categorize textual works. I like to refer to the system it describes as "WEMI", which rhymes with "hemi" (for you auto buffs out there): Work -- Expression -- Manifestation -- Item http://www.ifla.org/VII/s13/frbr/frbr.pdf (WEMI is the mnemonic I use to remember the system!) Regarding "expression" versus "manifestation" in the digitization of public domain materials, such as done by DP and PG, I've made my thoughts known the last couple years, so I'll refrain from getting into that again at this time! Jon Noring
participants (9)
-
Andrew Sly
-
Bowerbird@aol.com
-
Brad Collins
-
Jeroen Hellingman (Mailing List Account)
-
Jon Noring
-
Karl Eichwalder
-
Marcello Perathoner
-
Michael Hart
-
Norm Wolcott