Enlightened Self Interest

Understand the issue of editing. My proposal would be to supply an editable file in OpenOffice or MS doc format (BTW if you are not using the Open Source OpenOffice suite I recommend you check it out - the features are great, at least as feature rich as MS word, plus - one button PDF creation, output as doc, text or native XML format and a great price = $0! http://www.openoffice.org ). I propose to take nothing away you will have edit control over the file. This also opens up another question over what base document formats you have standardized for editability and portability e.g. OASIS etc.. Maybe that is a topic another list. Finally I note you have PDF formats available for some other books. Andrew Sly wrote:
One possible problem is that PDF files are not easily editable.
All of our older texts are being gradually worked through, corrected, supplied with a new PG header (which puts all the legal "small print" at the end of the file instead of the beginning) and REPosted into the currant directory structure. When this process is done it will make some of the back-end organization much easier to deal with.
However, if during this process, we come across a non-editable file (PDF, Lit, whatever), we cannot update it, and it's generally moved into an "old" directory, where it is still availible if someone goes looking for it, but otherwise is not shown in the catalog.
Andrew
Having discovered Jane Austen regrettably late in life I have down-loaded a couple of novels and since I find the raw text format unpleasant to read I have reformatted for my own use. It seems to me since I have the ability to produce PDFs and OpenOffice formats and even - heaven forfend - MS doc format should they be wanted, it would be churlish not to make such an offer. If you can point me at a standard for PDF, page width, font size etc, etc., and let me know what formats you do want I would be happy to undertake the small additional work for the two novels I have currently downloaded. I cannot supply DocBook at this time but hope to have that available shortly. Regards
-- Ron Aitchison

I think the long term view, at least from the Distributed Proofreader's supply chain, is to provide a TEI-Lite document for each text, and from it programmatically create HTML, plain text, PDF, etc on the fly. I'm not sure when this will happen, but I expect that some of the precursor activities at DP will take place this year. I don't know if DP will try to replace all previous versions of texts with TEI-Lite documents, but my guess is that once a system is in place, there will be volunteers that will go back and rework the texts, just as we have volunteers today providing revised editions of earlier texts with HTML and text versions that follow the current formatting guidelines. As always, volunteer in the ways you see fit, but I suspect many here (at least us DPers) would argue that working on new texts hitherto unavailable to PG is probably a better use of your time than providing multiple reformatted versions of existing works. Bruce http://www.pgdp.net/vision/ For Charlz' vision http://www.tei-c.org/Lite/ For information on TEI-Lite http://www.pdgp.net For volunteering at Distributed Proofreaders Ron Aitchison writes:
Understand the issue of editing. My proposal would be to supply an editable file in OpenOffice or MS doc format (BTW if you are not using the Open Source OpenOffice suite I recommend you check it out - the features are great, at least as feature rich as MS word, plus - one button PDF creation, output as doc, text or native XML format and a great price = $0! http://www.openoffice.org ). I propose to take nothing away you will have edit control over the file. This also opens up another question over what base document formats you have standardized for editability and portability e.g. OASIS etc.. Maybe that is a topic another list. Finally I note you have PDF formats available for some other books.

As always, volunteer in the ways you see fit, but I suspect many here (at least us DPers) would argue that working on new texts hitherto unavailable to PG is probably a better use of your time than providing multiple reformatted versions of existing works.
I would agree; it seems to me that converting into a format that cannot be programmatically converted into other formats (including other "master" formats like DP-TEI, whenever that gets specified), is rather a waste of one's time. Anything that isn't a value-add (like converting straight text to Word or PDF without adding, say, bookmark information) also strikes me as not too useful. I could blast all of PG into Weasel format without a lot of trouble, for example, but I don't see a benefit as anybody who could make use of it could easily do the conversion as well. (pie-in-the-sky: being able to on-the-fly convert TEI to format of user's choice on download would be nearly Grail-like.) __________________________________ Do you Yahoo!? Yahoo! Mail - now with 250MB free storage. Learn more. http://info.mail.yahoo.com/mail_250

On Fri, Feb 25, 2005 at 05:20:08PM -0800, Jon Niehof wrote:
As always, volunteer in the ways you see fit, but I suspect many here (at least us DPers) would argue that working on new texts hitherto unavailable to PG is probably a better use of your time than providing multiple reformatted versions of existing works.
I would agree; it seems to me that converting into a format that cannot be programmatically converted into other formats (including other "master" formats like DP-TEI, whenever that gets specified), is rather a waste of one's time.
Anything that isn't a value-add (like converting straight text to Word or PDF without adding, say, bookmark information) also strikes me as not too useful. I could blast all of PG into Weasel format without a lot of trouble, for example, but I don't see a benefit as anybody who could make use of it could easily do the conversion as well.
Well put. What we call "blind format conversions" -- conversions from one format to another, based on your own preferences, without any value-added input such as, say, illustrations from an eligible edition -- are not things that we really want to post, without some special reason. We have done it in the past, and it hasn't worked well. Sites like Blackmask http://blackmask.com do a better job of managing such content than we do, and in fact David Moynihan of Blackmask has offered us all of his converted files if we want them. We discussed it a few years ago, and decided against.
(pie-in-the-sky: being able to on-the-fly convert TEI to format of user's choice on download would be nearly Grail-like.)
You don't need TEI just for conversion. Today, HTML is the Universal Format for converting _from_. It may not be so always, and HTML has limits; it ain't great on mathematical texts, for instance, but given HTML, you can very easily get to any of the common reader formats in one step. jim

Whoa there. Clearly I walked into a minefield and feel in imminent danger of having various limbs blasted from my poor undeserving corpus. Let me state my point of view or why I made the offer and why I think perhaps trees and forests may be getting a little confused. Now I'm new to this stuff and many of you good folks have labored for years so if I lay a few mines of my own - so be it. 1. The primary reason for my offer was simply that since I found the simple text version unpleasant to read I thought there may be others and that having a choice of formats available may make the output - the books - more approachable hence reach a wider audience and all the good things that must flow from that. Seems to me this is that GP is about - outreach. 2. I fully understand the issue of editable text. and rampant variations - a maintenance nightmare. Untenable. So let me address the issue of maintenance and incidentally why I do not think that my offer need cause the end of the world as we know it. There are two parts to this argument: 1. The basic format that I have converted to is OpenOffice 's XML format from which multiple conversions - PDF and MS doc if you want - are derived. . All essentially driven from a set of DTD's. My brief reading of TEI is that it too uses an XML base. So we have a trivial level of commonality as a starting point. By looking at the conversion processes we could have a WSYIWYG editor off-the-shelf at $0 cost with output convertible to TEI output by driving it through appropriate XLST's and all that good stuff. OpenOffice has a pilot development with DocBook to do something similar. It is not making much progress but with the right effort it could. 2. The second point relates to the difficulty, of success possibility, of conversion. I used 4 styles in the book. Header 1, paragraph, page header and page footer (the last two could be easily removed but are tactically useful because of page numbering). For a simple text book I see no reason to use any more and the cost of replacement of header /footer with an alternate implementation is trivial in the extreme. Hard pagination is perhaps a bit more difficult to handle and I'm not sure I should have done it but in the absence of any instructions/suggestions to the contrary I did. So a set of simple rules in the period before an idealized solution is available would significantly reduce difficulties. Now whether TEI is better than DocBook or a converged OASIS standard is not for me to say. But it does seem to me there is a way forward in the short term by making the right intercepts - a combination of technology and rules - without building up a redundant and unmanageable nightmare. Or am I wrong? Finally does anyone want my pathetic conversions of Northanger Abbey and Persuasion !! -:) Or is it thanks but no thanks! Jim Tinsley wrote:
On Fri, Feb 25, 2005 at 05:20:08PM -0800, Jon Niehof wrote:
As always, volunteer in the ways you see fit, but I suspect many here (at least us DPers) would argue that working on new texts hitherto unavailable to PG is probably a better use of your time than providing multiple reformatted versions of existing works.
I would agree; it seems to me that converting into a format that cannot be programmatically converted into other formats (including other "master" formats like DP-TEI, whenever that gets specified), is rather a waste of one's time.
Anything that isn't a value-add (like converting straight text to Word or PDF without adding, say, bookmark information) also strikes me as not too useful. I could blast all of PG into Weasel format without a lot of trouble, for example, but I don't see a benefit as anybody who could make use of it could easily do the conversion as well.
Well put. What we call "blind format conversions" -- conversions from one format to another, based on your own preferences, without any value-added input such as, say, illustrations from an eligible edition -- are not things that we really want to post, without some special reason. We have done it in the past, and it hasn't worked well.
Sites like Blackmask http://blackmask.com do a better job of managing such content than we do, and in fact David Moynihan of Blackmask has offered us all of his converted files if we want them. We discussed it a few years ago, and decided against.
(pie-in-the-sky: being able to on-the-fly convert TEI to format of user's choice on download would be nearly Grail-like.)
You don't need TEI just for conversion. Today, HTML is the Universal Format for converting _from_. It may not be so always, and HTML has limits; it ain't great on mathematical texts, for instance, but given HTML, you can very easily get to any of the common reader formats in one step.
jim
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d
-- Ron Aitchison http://www.zytrax.com ZyTrax mailto:r.aitchison@zytrax.com 70 rue Notre Dame West Montreal Quebec H2Y 1S6 Tel:(514) 285.9088

Ron wrote:
1. The basic format that I have converted to is OpenOffice 's XML format from which multiple conversions - PDF and MS doc if you want - are derived. . All essentially driven from a set of DTD's. My brief reading of TEI is that it too uses an XML base. So we have a trivial level of commonality as a starting point. By looking at the conversion processes we could have a WSYIWYG editor off-the-shelf at $0 cost with output convertible to TEI output by driving it through appropriate XLST's and all that good stuff. OpenOffice has a pilot development with DocBook to do something similar. It is not making much progress but with the right effort it could.
For maximum archivability, repurposeability and accessibility, it is important for the XML markup vocabulary used in the master document to be wholly structural and semantic. Except where absolutely necessary (and maybe best solved using SVG and MathML), presentational markup should be avoided. TEI is primarily structural/semantic, but there are some presentational components. The base DP-TEI (I envision three levels of DP-TEI), when it comes into being, should not specify any presentational markup components. I am not familiar with OpenOffice's XML vocabulary, but I would guess that it, too, is a mix of structural/semantic tags with presentation tags (I also guess that it is much more presentationally-oriented than TEI, and doesn't have the structural/semantic richness of TEI.) If OpenOffice's XML vocabulary is to be used, it should be subsetted (at least at the base level) to not allow presentational markup. I do not recommend DocBook as the primary markup vocabulary for general books, but certainly it is intriguing to consider it as a second "blessed" vocabulary for particular types of documents it is designed for (primarily technical documents.) Just my $0.02 worth. Jon Noring

I didn't notice this discussion was heading to my favourite subject... TEI. I guess enlightened is on my mental spam filter... Jon Noring wrote:
For maximum archivability, repurposeability and accessibility, it is important for the XML markup vocabulary used in the master document to be wholly structural and semantic. Except where absolutely necessary (and maybe best solved using SVG and MathML), presentational markup should be avoided.
Since we are reproducing printed works, it is often not possible to reconstruct the intended semantics of the user. This is especially true of books before the mid 19th century, when typographic conventions where not as well established. For many older books the best we can do is capture the typography in some "reduced" way. The good thing about TEI is that it actually supports that.
TEI is primarily structural/semantic, but there are some presentational components. The base DP-TEI (I envision three levels of DP-TEI), when it comes into being, should not specify any presentational markup components.
I am not familiar with OpenOffice's XML vocabulary, but I would guess that it, too, is a mix of structural/semantic tags with presentation tags (I also guess that it is much more presentationally-oriented than TEI, and doesn't have the structural/semantic richness of TEI.) If OpenOffice's XML vocabulary is to be used, it should be subsetted (at least at the base level) to not allow presentational markup.
OpenOffice XML has a lot of features geared towards an office application and the nasty details of presentation. It is quite presentational, and I wouldn't recommend it as a long term archive format. However, it is much better structured than Microsoft .DOC format, and considerable more compact (using zip as it does).
I do not recommend DocBook as the primary markup vocabulary for general books, but certainly it is intriguing to consider it as a second "blessed" vocabulary for particular types of documents it is designed for (primarily technical documents.)
Reminds me of that old saying about standards, good to have so many to choose from... DocBook is fine for technical manuals written from scratch, not for capturing a nineteenth century novel, or sixteenth century history. Jeroen.

On Fri, Feb 25, 2005 at 10:04:01PM -0500, Ron Aitchison wrote:
Whoa there. Clearly I walked into a minefield and feel in imminent danger of having various limbs blasted from my poor undeserving corpus.
Minefield, yes. We really should put a sign up at the gates. :-) But nobody wants to blast you, I promise. It's an old, old, subject, and we've tried various things at verious times over the last 5 years or so -- some tries even pre-date that. I don't think there's one we don't regret. So it's not like we're dismissing your idea out of hand; it's one of those things that we've all thought of, and we'd all like to do, and we never quite forget it, and it pops up now and again even among old hands, but it's a net negative. And there's a lot of people here who have a lot of experience of the subject. There was probably a time when even I thought that posting individual blind format conversions was a good idea, but it must have been long ago.
Let me state my point of view or why I made the offer and why I think perhaps trees and forests may be getting a little confused. Now I'm new to this stuff and many of you good folks have labored for years so if I lay a few mines of my own - so be it. 1. The primary reason for my offer was simply that since I found the simple text version unpleasant to read I thought there may be others and that having a choice of formats available may make the output - the books - more approachable hence reach a wider audience and all the good things that must flow from that. Seems to me this is that GP is about - outreach. 2. I fully understand the issue of editable text. and rampant variations - a maintenance nightmare. Untenable. So let me address the issue of maintenance and incidentally why I do not think that my offer need cause the end of the world as we know it. There are two parts to this argument: 1. The basic format that I have converted to is OpenOffice 's XML format from which multiple conversions - PDF and MS doc if you want - are derived. . All essentially driven from a set of DTD's. My brief reading of TEI is that it too uses an XML base. So we have a trivial level of commonality as a starting point. By looking at the conversion processes we could have a WSYIWYG editor off-the-shelf at $0 cost with output convertible to TEI output by driving it through appropriate XLST's and all that good stuff. OpenOffice has a pilot development with DocBook to do something similar. It is not making much progress but with the right effort it could. 2. The second point relates to the difficulty, of success possibility, of conversion. I used 4 styles in the book. Header 1, paragraph, page header and page footer (the last two could be easily removed but are tactically useful because of page numbering). For a simple text book I see no reason to use any more and the cost of replacement of header /footer with an alternate implementation is trivial in the extreme. Hard pagination is perhaps a bit more difficult to handle and I'm not sure I should have done it but in the absence of any instructions/suggestions to the contrary I did. So a set of simple rules in the period before an idealized solution is available would significantly reduce difficulties. Now whether TEI is better than DocBook or a converged OASIS standard is not for me to say. But it does seem to me there is a way forward in the short term by making the right intercepts - a combination of technology and rules - without building up a redundant and unmanageable nightmare. Or am I wrong? Finally does anyone want my pathetic conversions of Northanger Abbey and Persuasion !! -:) Or is it thanks but no thanks!
Your conversions may well be lovely; their quality isn't at all an issue here. It's just not something that we do, except under some compelling special circumstances. jim

Ok, I leave the computer for one night and you all go nuts with the posts! :) hehe Anyway, as one of the people working on PGTEI, I figure this discussion could use an update where things stand. Currently, my efforts have concentrate on two fronts. 1 - Converting those texts that come through me from DP into PGTEI master format. I then use the online PGTEI -> HTML conversion routine to convert them to HTML for posting to PG. Most of them are not converted to TEXT simply because someone else at DP did the text version before I got to them. In other words, I've been mostly concentrating on the PGTEI format itself and the HTML output that results from it. Here is a recent link to a posted book... from off the top of my head. There are many more I just don't have the list here on this computer. (Last count there were 20+ documents that I've put in PGTEI format sitting on my computer... most of which have been posted to the PG archives in HTML and/or TEXT format.) http://www.gutenberg.org/dirs/1/4/9/8/14986/14986-h/14986-h.htm Experimental Researches in Electricity, Volume 1 This is a pretty straightforward text, but it has an automatically produced Table of Contents and the generated footnotes, so it gives some idea of where we are at. One of the things I plan on fixing in the future is the lack of links from the footnote text BACK to the footnote anchor in the main text. 2 - Updating/expanding the PGTEI documentation. I've got more notes than I know what to do with and many many pages of additional documentation written in a rough draft. *** The eventual end I am hoping for is a standard encoding that makes conversion to other formats easy and quick. For instance, one of my next projects will be to take on of the VERY nasty math texts that DP has produced in TeX format and convert it to PGTEI. TEI uses TeX encoding for the math equations themselves, but the rest of the formatting is a little more intuitive AND because of the validation routines we have available, much easier to develop and fix. But, since I haven't tried the TeX on a massive scale yet within a PGTEI document, I don't know what bugs and gotchas I'm going to find. If there are any questions (or if anyone wants to see some of the PGTEI documents I've created, rough drafts of the documentation I've working on, etc), please let me know. Josh

Joshua Hutchinson wrote:
If there are any questions (or if anyone wants to see some of the PGTEI documents I've created, rough drafts of the documentation I've working on, etc), please let me know.
I'd like to see the draft documentation. -- Marcello Perathoner webmaster@gutenberg.org

Joshua Hutchinson wrote:
1 - Converting those texts that come through me from DP into PGTEI master format. I then use the online PGTEI -> HTML conversion routine to convert them to HTML for posting to PG. Most of them are not converted to TEXT simply because someone else at DP did the text version before I got to them. In other words, I've been mostly concentrating on the PGTEI format itself and the HTML output that results from it.
I've been producing all my ebooks as TEI (since 1997), but since Gutenberg can't deal with it, I've hardly ever been able to post them. Please don't convert any text I've submitted before asking me. All my HTML comes from a single stylesheet. Jeroen.

On Friday 25 February 2005 07:11 pm, Bruce Albrecht wrote:
I don't know if DP will try to replace all previous versions of texts with TEI-Lite documents, but my guess is that once a system is in place, there will be volunteers that will go back and rework the texts, just as we have volunteers today providing revised editions of earlier texts with HTML and text versions that follow the current formatting guidelines.
As always, volunteer in the ways you see fit, but I suspect many here (at least us DPers) would argue that working on new texts hitherto unavailable to PG is probably a better use of your time than providing multiple reformatted versions of existing works.
I'm one of the volunteers who is going back and providing reworked versions of existing older PG texts, and my approximate criteria for selection are: Older than (roughly) number 7000, is only in text version at PG, text version has many "hard" errors (tbe, arc, arid, etc. as opposed to "soft" problems such as formatting), illustrations not present, and most importantly, ones that I have a physical copy of the book from which to make the corrections from. This clearly falls under the "value-added" category of thinking. While I share your position that simple reformatting is mostly a waste of time, going back and rehabilitating existing works is not, and I hope that people interested in working on that aspect are not discouraged. I think of it much like carpentry; there are some people who are more of a framing temperament, those who are interested in finish work, and those who like to do restoration or renovation work. All of those skills/mindsets are necessary to complete a strong and attractive project.

D Garcia writes:
I'm one of the volunteers who is going back and providing reworked versions of existing older PG texts, and my approximate criteria for selection are: Older than (roughly) number 7000, is only in text version at PG, text version has many "hard" errors (tbe, arc, arid, etc. as opposed to "soft" problems such as formatting), illustrations not present, and most importantly, ones that I have a physical copy of the book from which to make the corrections from.
This clearly falls under the "value-added" category of thinking. While I share your position that simple reformatting is mostly a waste of time, going back and rehabilitating existing works is not, and I hope that people interested in working on that aspect are not discouraged.
I agree that your type of updates is needed for the older PG titles, and don't consider it a waste of time. However, it was my impression that Ron was offering to provide uncorrected reformatted editions of the titles in question.

On Fri, 25 Feb 2005 17:24:49 -0500, Ron Aitchison <ron@zytrax.com> writes:
Finally I note you have PDF formats available for some other books.
Primarily from TeX, which makes it easy to generate, and primarily for mathematical and scientific documents that pretty much have to be done in TeX. Jim Tinsley <jtinsley@pobox.com> writes:
Today, HTML is the Universal Format for converting _from_. It may not be so always, and HTML has limits; it ain't great on mathematical texts,
More importantly, HTML can't really do footnotes, and I doubt anything is doing decent transformations on what we kludge sidenotes into.
participants (10)
-
Bruce Albrecht
-
D Garcia
-
David Starner
-
Jeroen Hellingman (Mailing List Account)
-
Jim Tinsley
-
Jon Niehof
-
Jon Noring
-
Joshua Hutchinson
-
Marcello Perathoner
-
Ron Aitchison