
greg said:
What tournament? You're the one who imagined it
gee... somehow i got the impression that -- under the rubric of "crowdsourcing" -- you were _soliciting_ "improved" versions of p.g. books, versions that were "fixed" and "tweaked" such that they gave people "better" experiences, either because they had had their typos and errors "corrected", or because they were "more friendly" with some hardware/software, or a combination of both those factors. i _also_ thought that you proposed to "host" these "new" versions, _and_ that the "best" one -- out of many that were "submitted", thanks to dedicated efforts by volunteers -- might get "folded back into" the p.g. library (nudging the present files to the "old" folder) subject to careful "evaluation" and "approval" by the whitewashers who would "analyze" 'em closely, _and_ that the workflows and tools that created these "superior" versions could well be considered for "adoption" by p.g. but i dunno... i guess i got all of that wrong... i just can't tell where i got all those silly ideas. oh, yeah, right, i "imagined" it. hey, no wonder people always tell me i have a good imagination. because heck no, we don't "compete" here -- we get along and be agreeable with each other. ***
you are competing by yourself
oh please, greg. i already _won_ this horse-race. i won when you and marcello and p.g. officially adopted light-markup -- which was _my_ horse! your tei/xml horse went down on the backstretch. plus, with the charismatic marcello now doing his famous ram-it-down-their-throats schtick, i'm winning again, because you're _botching_ it. even with a better horse, your jockey can't win. even better horse _and_ he cheats, he can't win. you're more right than you seem to know, greg: yes, quite indeed, i _am_ competing by myself... *** but... ya know... we're all one big happy family, pulling together, living in peace and harmony... *** the doublespeak and lack of integrity here is starting to become seriously troubling... not to mention the _noise_, now overwhelming... signal, anyone? is anybody picking up any signal? -bowerbird

BBgee... somehow i got the impression that -- under the rubric of "crowdsourcing" -- you were _soliciting_ "improved" versions of p.g. books, versions that were "fixed" and "tweaked" such that they gave people "better" experiences, either because they had had their typos and errors "corrected", or because they were "more friendly" with some hardware/software, or a combination of both those factors. i _also_ thought that you proposed to "host" these "new" versions, _and_ that the "best" one -- out of many that were "submitted", thanks to dedicated efforts by volunteers -- might get "folded back into" the p.g. library....
What I heard Greg say was indeed he was going allow these various different "improved" versions to be posted targeting various platforms, and that indeed he said that WW'ers would have the option in the future of folding back *parts* of the new effort into the existing PG source files when they felt doing so would make a contribution. I don't think he ever said anything about totally replacing current versions with totally new versions in totally new source languages, and I certainly didn't hear him suggest that the current source file formats of the current books were going to be totally replaced with new source file formats. And then what I heard next was Marcello saying "H*ck No, I'm Not Going To Allow That!" And then again like always Marcello used the opportunity to push his own agenda (in typical SU fashion.) And I still don't see where people who want to post these "improved" versions targeting specific platforms can post their efforts where PG customers can find them, so, I guess, in practice, Marcello holds the keys to the fortress. TEI is too geeky, and creates a source file which is dead from the minute it is written, because the only person who cares to maintain it is the original author. And the HTML it generates stinks to high heaven. It makes Microsoft Word output look like the immaculate conception. And TEI "defeats" the third party distributors, which one can see by downloading the "TEI" books from one of these 3rd party distributors. The end result, viewed by that indirect customer of PG, is ugly and ill-formatted and loses a lot of what was there in the original. So TEI, in practice *does not* do a superior job of preserving what is there for future generation. It provides an inferior job of preserving what is there, at a cost of a much higher workload to the PP. RST is too simple, relies too much on tweaky escape codes, generates "Python User Manual" output page formats, but does generate reasonable "living" HTML for output which other people can then take as "living" HTML and maintain and build on. There are some people who minds work this RST way, and who don't mind memorizing little tweaky "printer escape codes" and RST is going to work for them, especially for things like Junior Readers. And there are other people whose minds don't work that way and really really don't want to have to memorize tweaky "printer escape codes" and further who don't want have to give up the superior tools that are out there to write and display [X]HTML code, and the quick turnaround development cycle that allows, rather than having to stop and spend a minute running it all through a python dev chain every time they change a comma. Not to mention the pain in the hindee to set up that python chain and try to maintain it. Or the need to keep sending your code to the "SU in the Sky" to see what has happened every time you've changed a comma.

On 2/4/2012 1:10 PM, Jim Adcock wrote
TEI is too geeky, and creates a source file which is dead from the minute it is written, because the only person who cares to maintain it is the original author. And the HTML it generates stinks to high heaven.
No, the HTML /that Mr. Perathoner's tool generates/ stinks to high heaven. Don't blame the markup language (which is technically about the best markup language ever designed for semantic markup of books) for the sins of the converter. That's like blaming Tom Clancy for the fact that Penguin, USA is clueless when it comes to creating his e-books.

Are there any sites where I can pull down books in TEI format? Do any publishers sell their books in TEI format? Are there TEI libraries? Reading the TEI materials, these seem to be the kinds of purposes for which TEI was invented. Who has found it so useful that they have adopted it as their standard medium of storage? Textbooks seem to be squarely in the target market. What textbooks are distributed and used in TEI format? I expect that PG texts in TEI format would be useful to roughly the same audiences. It would be nice not to be a pioneer for something so (you have to admit) complex.

We know that RST is widely used among Python programmers. We should have a good shot at that market, I imagine.

On 02/05/2012 12:11 AM, don kretz wrote:
Are there any sites where I can pull down books in TEI format?
African American Women Writers of the 19th Century African Languages Lexicon Project (ALLEX) Alex Catalogue of Electronic Texts American Memory from the Library of Congress American Verse Project Ancient Inscriptions of the Northern Black Sea Aphrodisias in Late Antiquity (2004) Archimedes Palimpsest Project ATLAS : ATLA Series Autour des Res Gestae Divi Augusti Berardier.org: Édition électronique de Bérardier de Bataut, Essai sur le récit (1776) Berlin Intellectuals 1800-1830 Boccaccio's Decameron Bodleian Library: Toyota City Imaging Project Brevier Legislative Reports British National Corpus British Women Romantic Poets, 1789-1832 Brown University Scholarly Technology Group Brown University Women Writers Project Bibliotheques virtuelles humanistes Bulgarian National Corpus Cambridge University Press Carl-Maria-von-Weber-Gesamtausgabe Cartulaire blanc of the Abbey of Saint-Denis CELT Project: The Corpus of Electronic Texts The Charrette Project Chinese Buddhist Electronic Text Association Chronicon The Chymistry of Isaac Newton The Charles Brockden Brown Electronic Archive and Scholarly Edition Colonial Despatches: The colonial despatches of Vancouver Island and British Columbia 1846-1871 Computational Linguistics for Metadata Building (CLiMB) Corpus Encoding Standard (CES) Corpus Toneelkritiek Interbellum Croatian Language Corpus (Hrvatski jezini korpus) Croatiae auctores Latini Cursus Project Dafydd Ap Gwilym Edition DALF - Digital Archive of Letters by Flemish Authors and Composers from the 19th & 20th century Dante's Lemmatized Works Deutsches Textarchiv (The German Text Archive) A Digital Comparative Edition and Translation of the Shorter Chinese Saṃyukta Āgama Digitale Bibliotheek voor de Nederlandse Letteren (dbnl) Digitales Wörterbuch der deutschen Sprache (The Digital Dictionary of the German Language) Digital Litteratur Digital Quaker Collection Dingler-Online: Dingler’s Polytechnisches Journal Documenting the American South The Doegen Records Web Project Early American Fiction Early Americas Digital Archive Early Canada Online Early Irish Glossaries Database Early Modern French Women Writers Early 19th Century Russian Readership & Culture The Electronic Cædmon's Hymn Emblem Project Utrecht The English-Norwegian Parallel Corpus English Poetry Full-Text Database The EpiDoc Collaborative Ernst Barlach: Bibliographical Listing of Secondary Literature Europeana Regia Ex Bibliotheca Gondomariensi Expérimentation de normes d balisage en langues partenaires Extracts from the Diary of Robert Graves The FIDA Corpus of Slovene Language For Better for Verse: Interactive scansion tutorial Fortunoff Video Archive for Holocaust Testimonies FreeDict Freiburger Anthologie - Lyrik und Lied The Future Fire Henrik Ibsen's Writings History of the Accademia di San Luca, c. 1590-1635 HyperGrammar Icelandic Online Dictionary and Readings Icon Programming for Humanists, Second Edition Indiana Authors and Their Books Indiana Magazine of History Indiana University Board of Trustees Minutes Indiana University Bloomington Faculty Council Minutes Inscriptions of Roman Cyrenaica Inscriptions of Aphrodisias (2007) Inscriptions of Roman Tripolitania Integrating Digital Papyrology In Transition: Selected Poems by the Baroness Elsa von Freytag-Loringhoven Internet Library of Early Journals Japanese Text Initiative John Foxe's Book of Martyrs Variorum Edition Online JOS Corpora of Slovene Der Junge Goethe in Seiner Zeit The Kapellmeisterbuch of the Abbey of Einsiedeln Kolb-Proust Archive for Research The Legacy Tobacco Documents Library Mark Twain Project Online Le mariage sous l'Ancien Régime Medieval Nordic Text Archive - Menota The Medieval Review Miguel de Cervantes Digital Library Model Editions Partnership: Historical Editions in the Digital Age Morris On-Line Edition MULTEXT-East Multilingual Text Tools and Corpora (MULTEXT) eMunch: Edvard Munch's Written Materials MyManuskrip (Malaysian Manuscripts): a Digital Library National Corpus of Polish New Left Review Newton Manuscript Project New Zealand Electronic Text Centre Newcastle Electronic Corpus of Tyneside English Norsk Ordbok 2014 OCIMCO: Oxford & Cambridge Islamic Manuscripts Catalogue Online Old French Corpus - Base de Français Médiéval The Open Siddur Project The Orlando Project: An Integrated History of Women's Writing in the British Isles The Oslo Multilingual Corpus Oxford Text Archive Partial Transcription of John Lydgate's "Fall of Princes" Perseus Project Petrus Plaoul Editio Critica Comentarii in Libris Sententiarum Piers Plowman Electronic Archive Polish Language of the XX Century Sixties ProQuest Repertorium of Old Bulgarian Literature and Letters Resianica Dictionary Rôles et pouvoirs des femmes au XVIe siècle dans la France de l'ouest Saint Patrick's Confessio HyperStack Sandrart.net SARIT Scholarly Digital Editions of Slovenian Literature The Scholarly Electronic Text and Image Service Slovene Biographical Lexicon The Algernon Charles Swinburne Project Szeged Corpus: a natural language processed Hungarian corpus Thesaurus Musicarum Italicarum (TMI) The Versioning Machine The Writings of James Fenimore Cooper The Thomas MacGreevy Archive United Farm Workers of America (California) Collective Bargaining Agreements Collection University of Michigan Humanities Text Initiative (HTI) University of Virginia Electronic Text Center University of Virginia Library Using XML to generate research tools for Wittgenstein scholars by collaborative groupwork Victorian Women Writers' Project Vincent van Gogh – The Letters Voltaire Foundation Voices of the Holocaust Women's Travel Writing, 1830-1930 The World of Dante Wright American Fiction 1851-1875 The Yellow Nineties Online http://www.tei-c.org/Activities/Projects/ -- Marcello Perathoner webmaster@gutenberg.org

Yup, Been through some of that. Good project descriptions. Lots of Best Practices. I haven't found one yet that will let me have a TEI document. (A few will let me buy a CD that *may* have one on it.) On Sat, Feb 4, 2012 at 3:17 PM, Marcello Perathoner <marcello@perathoner.de>wrote:
On 02/05/2012 12:11 AM, don kretz wrote:
Are there any sites where I can pull down books in TEI format?
African American Women Writers of the 19th Century African Languages Lexicon Project (ALLEX) Alex Catalogue of Electronic Texts American Memory from the Library of Congress American Verse Project Ancient Inscriptions of the Northern Black Sea Aphrodisias in Late Antiquity (2004) Archimedes Palimpsest Project ATLAS : ATLA Series Autour des Res Gestae Divi Augusti Berardier.org: Édition électronique de Bérardier de Bataut, Essai sur le récit (1776) Berlin Intellectuals 1800-1830 Boccaccio's Decameron Bodleian Library: Toyota City Imaging Project Brevier Legislative Reports British National Corpus British Women Romantic Poets, 1789-1832 Brown University Scholarly Technology Group Brown University Women Writers Project Bibliotheques virtuelles humanistes Bulgarian National Corpus Cambridge University Press Carl-Maria-von-Weber-**Gesamtausgabe Cartulaire blanc of the Abbey of Saint-Denis CELT Project: The Corpus of Electronic Texts The Charrette Project Chinese Buddhist Electronic Text Association Chronicon The Chymistry of Isaac Newton The Charles Brockden Brown Electronic Archive and Scholarly Edition Colonial Despatches: The colonial despatches of Vancouver Island and British Columbia 1846-1871 Computational Linguistics for Metadata Building (CLiMB) Corpus Encoding Standard (CES) Corpus Toneelkritiek Interbellum Croatian Language Corpus (Hrvatski jezini korpus) Croatiae auctores Latini Cursus Project Dafydd Ap Gwilym Edition DALF - Digital Archive of Letters by Flemish Authors and Composers from the 19th & 20th century Dante's Lemmatized Works Deutsches Textarchiv (The German Text Archive) A Digital Comparative Edition and Translation of the Shorter Chinese Saṃyukta Āgama Digitale Bibliotheek voor de Nederlandse Letteren (dbnl) Digitales Wörterbuch der deutschen Sprache (The Digital Dictionary of the German Language) Digital Litteratur Digital Quaker Collection Dingler-Online: Dingler’s Polytechnisches Journal Documenting the American South The Doegen Records Web Project Early American Fiction Early Americas Digital Archive Early Canada Online Early Irish Glossaries Database Early Modern French Women Writers Early 19th Century Russian Readership & Culture The Electronic Cædmon's Hymn Emblem Project Utrecht The English-Norwegian Parallel Corpus English Poetry Full-Text Database The EpiDoc Collaborative Ernst Barlach: Bibliographical Listing of Secondary Literature Europeana Regia Ex Bibliotheca Gondomariensi Expérimentation de normes d balisage en langues partenaires Extracts from the Diary of Robert Graves The FIDA Corpus of Slovene Language For Better for Verse: Interactive scansion tutorial Fortunoff Video Archive for Holocaust Testimonies FreeDict Freiburger Anthologie - Lyrik und Lied The Future Fire Henrik Ibsen's Writings History of the Accademia di San Luca, c. 1590-1635 HyperGrammar Icelandic Online Dictionary and Readings Icon Programming for Humanists, Second Edition Indiana Authors and Their Books Indiana Magazine of History Indiana University Board of Trustees Minutes Indiana University Bloomington Faculty Council Minutes Inscriptions of Roman Cyrenaica Inscriptions of Aphrodisias (2007) Inscriptions of Roman Tripolitania Integrating Digital Papyrology In Transition: Selected Poems by the Baroness Elsa von Freytag-Loringhoven Internet Library of Early Journals Japanese Text Initiative John Foxe's Book of Martyrs Variorum Edition Online JOS Corpora of Slovene Der Junge Goethe in Seiner Zeit The Kapellmeisterbuch of the Abbey of Einsiedeln Kolb-Proust Archive for Research The Legacy Tobacco Documents Library Mark Twain Project Online Le mariage sous l'Ancien Régime Medieval Nordic Text Archive - Menota The Medieval Review Miguel de Cervantes Digital Library Model Editions Partnership: Historical Editions in the Digital Age Morris On-Line Edition MULTEXT-East Multilingual Text Tools and Corpora (MULTEXT) eMunch: Edvard Munch's Written Materials MyManuskrip (Malaysian Manuscripts): a Digital Library National Corpus of Polish New Left Review Newton Manuscript Project New Zealand Electronic Text Centre Newcastle Electronic Corpus of Tyneside English Norsk Ordbok 2014 OCIMCO: Oxford & Cambridge Islamic Manuscripts Catalogue Online Old French Corpus - Base de Français Médiéval The Open Siddur Project The Orlando Project: An Integrated History of Women's Writing in the British Isles The Oslo Multilingual Corpus Oxford Text Archive Partial Transcription of John Lydgate's "Fall of Princes" Perseus Project Petrus Plaoul Editio Critica Comentarii in Libris Sententiarum Piers Plowman Electronic Archive Polish Language of the XX Century Sixties ProQuest Repertorium of Old Bulgarian Literature and Letters Resianica Dictionary Rôles et pouvoirs des femmes au XVIe siècle dans la France de l'ouest Saint Patrick's Confessio HyperStack Sandrart.net SARIT Scholarly Digital Editions of Slovenian Literature The Scholarly Electronic Text and Image Service Slovene Biographical Lexicon The Algernon Charles Swinburne Project Szeged Corpus: a natural language processed Hungarian corpus Thesaurus Musicarum Italicarum (TMI) The Versioning Machine The Writings of James Fenimore Cooper The Thomas MacGreevy Archive United Farm Workers of America (California) Collective Bargaining Agreements Collection University of Michigan Humanities Text Initiative (HTI) University of Virginia Electronic Text Center University of Virginia Library Using XML to generate research tools for Wittgenstein scholars by collaborative groupwork Victorian Women Writers' Project Vincent van Gogh – The Letters Voltaire Foundation Voices of the Holocaust Women's Travel Writing, 1830-1930 The World of Dante Wright American Fiction 1851-1875 The Yellow Nineties Online
http://www.tei-c.org/**Activities/Projects/<http://www.tei-c.org/Activities/Projects/>
-- Marcello Perathoner webmaster@gutenberg.org
______________________________**_________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/**mailman/listinfo/gutvol-d<http://lists.pglaf.org/mailman/listinfo/gutvol-d>

"don" == don kretz <dakretz@gmail.com> writes:
don> Yup, Been through some of that. Good project don> descriptions. Lots of Best Practices. I haven't found one don> yet that will let me have a TEI document. (A few will let me don> buy a CD that *may* have one on it.) Did you try the Oxford text archive, http://ota.ahds.ac.uk/ ? What they call XML is TEI, 2506 of them. I have seen many more from other sources. Carlo

I did start to browse through it. I didn't find anything that purported to be a catalog, or an offer of a url to download a text. If you know of one, please pass it along. On Sat, Feb 4, 2012 at 3:50 PM, Carlo Traverso <traverso@posso.dm.unipi.it>wrote:
"don" == don kretz <dakretz@gmail.com> writes:
don> Yup, Been through some of that. Good project don> descriptions. Lots of Best Practices. I haven't found one don> yet that will let me have a TEI document. (A few will let me don> buy a CD that *may* have one on it.)
Did you try the Oxford text archive, http://ota.ahds.ac.uk/ ? What they call XML is TEI, 2506 of them. I have seen many more from other sources.
Carlo _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Your url is better than the one I had, and I found some documents. Here's a link to a representative case. http://ota.ahds.ac.uk/text/5506.xml It's whqt I would expect an XML document to look like. Does anyone think this is a markup we're likely to get DP proofers to adopt? There's a lot of markup for not much text - low signal/noise ratio. A lot of generic tags: <item>...</item> type stuff that's not encouraging. Is this a real candidate? On Sat, Feb 4, 2012 at 3:54 PM, don kretz <dakretz@gmail.com> wrote:
I did start to browse through it. I didn't find anything that purported to be a catalog, or an offer of a url to download a text. If you know of one, please pass it along.
On Sat, Feb 4, 2012 at 3:50 PM, Carlo Traverso <traverso@posso.dm.unipi.it
wrote:
> "don" == don kretz <dakretz@gmail.com> writes:
don> Yup, Been through some of that. Good project don> descriptions. Lots of Best Practices. I haven't found one don> yet that will let me have a TEI document. (A few will let me don> buy a CD that *may* have one on it.)
Did you try the Oxford text archive, http://ota.ahds.ac.uk/ ? What they call XML is TEI, 2506 of them. I have seen many more from other sources.
Carlo _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 2/4/2012 5:00 PM, don kretz wrote:
Your url is better than the one I had, and I found some documents.
Here's a link to a representative case.
http://ota.ahds.ac.uk/text/5506.xml
It's whqt I would expect an XML document to look like.
Does anyone think this is a markup we're likely to get DP proofers to adopt?
I don't think we're likely to get DP proofers to adopt anything beyond simple HTML. DP post-processors are more likely. But I don't think you're going to get any better response for ReST either. Acceptance by DP is probably not a good criteria in selecting a master markup language.
There's a lot of markup for not much text - low signal/noise ratio.
If you were to delete the recorded soft hyphens (<lb rend="hidden" type="hyphenInWord"/>) and surrounding white-space I think you would find the signal to noise a lot higher. Organizational constraints ("use this for that, don't use that for this, don't worry about recording soft hyphens or other artifacts of the print medium chosen") could reduce the noise a lot more.
A lot of generic tags: <item>...</item> type stuff that's not encouraging.
<item> is not a generic tag. It indicates an item in a list, corresponding to <li> in HTML.
Is this a real candidate?
I would think so. It does the job, has been well vetted by experts, and is fairly well accepted as a standard. It would certainly be a better choice that RYOML (Roll Your Own Markup Language).

I have less confidence in standards bodies peopled by experts than you do, I'm afraid. I've had to implement real-world applications despite them. (In some cases the real-world implementations that work - meaning real people find them easy enough to use that they can get real work done - end up forming the second wave of standards. Some of that is going on in healthcare right now.) But if someone could take some of the existing html files, and convert them into some form (like TEI or RST) that generates demonstrably acceptable epub as well as not-too-degraded html (I think it's reasonably possible that the html would be better too), and could present that to DP (possibly with an expression of appreciation and regard for the work they do), we might have a way forward. What's to be considered is how automated the conversion can be. Lee, is that similar to what you were expecting to do, or could do?

Lee, I don't understand how you might have a standard for PG that wasn't acceptable to DP.

On 2/4/2012 5:40 PM, don kretz wrote:
Lee, is that similar to what you were expecting to do, or could do?
Go to http://www.passkeysoft.com/~lee/tei/antonia.xml. Tell me what you think. Don't do a "view source" until you have formed an opinion.

It wasn't hard to guess what I would see. In chrome on my machine, it looks mostly like minimally formatted plain text, except at the top. We know browsers suppress displaying tags they don't recognize, so it's what I would expect, I guess. I would say it wouldn't be a display format we'd want to distribute, or a markup effort worth the result shown this way. But you know that. What should we learn? That TEI produces recognizable text in a brower? On Sat, Feb 4, 2012 at 5:03 PM, Lee Passey <lee@novomail.net> wrote:
On 2/4/2012 5:40 PM, don kretz wrote:
Lee, is that similar to what you were expecting to do, or could do?
Go to http://www.passkeysoft.com/~**lee/tei/antonia.xml<http://www.passkeysoft.com/~lee/tei/antonia.xml>. Tell me what you think. Don't do a "view source" until you have formed an opinion.
______________________________**_________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/**mailman/listinfo/gutvol-d<http://lists.pglaf.org/mailman/listinfo/gutvol-d>

On Sat, February 4, 2012 6:54 pm, don kretz wrote:
It wasn't hard to guess what I would see.
In chrome on my machine, it looks mostly like minimally formatted plain text, except at the top.
In Firefox I see a document formatted exactly the way God intended. The title page is unimportant, but when we get to the actual text, it is rendered in a moderately sized, sans-serif font; there are no right margins (if I want margins I can resize my browser); paragraphs are indented, but there are no blank lines between them; emphasized text is italicized; footnotes appear at chapter end, and is in a reduced siae test. Apparently you are not happy with this obviously perfect presentation. So, you should tell me how /you/ would have preferred to see this presented and I will accomodate your request. Be warned, however, that if you do I will be required to report you to the inquisition.
We know browsers suppress displaying tags they don't recognize, so it's what I would expect, I guess.
No, none of the tags are ignored. They are displaying exactly as I (and God) intended. Try changing the default presentation on your browser and see how it affects your view (it shouldn't change it at all).
I would say it wouldn't be a display format we'd want to distribute, or a markup effort worth the result shown this way.
I would say that it's exactly the display format you would want to distribute. How would you change it? (not a rhetorical question; I really need answers).
But you know that. What should we learn? That TEI produces recognizable text in a brower?
Yes, sort of. But more importantly, that TEI can mark up practically everything that is needed for an e-text, and that conversion from TEI to HTML is simple and straight-forward.

On 02/06/2012 06:43 PM, Lee Passey wrote:
Yes, sort of. But more importantly, that TEI can mark up practically everything that is needed for an e-text, and that conversion from TEI to HTML is simple and straight-forward.
No, it is not since: <p>Blah <table> rows </table> blah.</p> Is perfectly valid in TEI but perfectly invalid in HTML. Basically <p> in TEI can contain more things than in HTML. -- Marcello Perathoner webmaster@gutenberg.org

I don't think that one might adopt TEI for DP formatting, (sometimes too verbose), and DP markup cannot be directly transformed in TEI (the DP markup is ambiguous, it has too many "This will draw the attention of the PPer"). But making DP markup a bit more precise (e.g. adopting different markup for center, flush right and flush left insetad of /*...*/) DP markup might be easily transformed into elementary TEI. There are plans to have guiprep generate TEI output (instead/in addition to HTML). And TEI is simpler than HTML, since it is an intermediate step, that just marks the structure, and leaves the display to a next step. So you don't have to care to the bugs of IE 6 or firefox 4 beta. TEI is not meant for direct viewing in browsers, although some styles of TEI can be visualized with some standard XSLT. For example, some OTA TEI files display as unformatted text since there is no display engine. Just look at the source, either in a browser or in a text editor. Of course the difficulty is only moved to the next step. And it is what Marcello's tools can do (as epubmaker does for RST). And PG-TEI is crippled in the same way as PG-RST is crippled (RST in itself is extensible and can produce much better results, but these extensions are not supported by epubmaker). So people really wanting to do a service to PG should propose a different toolchain for PG TEI submissions, better configurable and giving better end results. By the way, where are the source fot the TEI toolchain? I coundn't find a mention of them on PG, (searching the site for TEI returns nothing). Carlo

I agree with Carlo: TEI indicates mostly the plain _structure_ of the text, without any formatting. You will have to add that in a later stage (and interpret the structure from the formatting when digitizing an existing book), and also, I do not use PG-TEI. I typically add rend-attributes to my TEI files, to help the conversion to HTML and ePub, and as a result get things like: http://www.gutenberg.org/files/38748/38748-h/38748-h.htm http://www.gutenberg.org/files/38571/38571-h/38571-h.htm (Just a few of hundreds of books I have produced using TEI. Don't look at the ePub's, they are not mine; I can provide them if needed from my tool-set, but the PG process doesn't allow them in.) That HTML is reasonable clean, and works on most browsers. By the way, my tooling allows me to adjust the formatting by use of CSS to almost anything, including making the book look like the original in a high degree. It is not TEI to blame, it is the misunderstanding of what it is and how to use it that causes trouble... Jeroen. On 2012-02-05 08:33, Carlo Traverso wrote:
I don't think that one might adopt TEI for DP formatting, (sometimes too verbose), and DP markup cannot be directly transformed in TEI (the DP markup is ambiguous, it has too many "This will draw the attention of the PPer"). But making DP markup a bit more precise (e.g. adopting different markup for center, flush right and flush left insetad of /*...*/) DP markup might be easily transformed into elementary TEI. There are plans to have guiprep generate TEI output (instead/in addition to HTML).
And TEI is simpler than HTML, since it is an intermediate step, that just marks the structure, and leaves the display to a next step. So you don't have to care to the bugs of IE 6 or firefox 4 beta. TEI is not meant for direct viewing in browsers, although some styles of TEI can be visualized with some standard XSLT. For example, some OTA TEI files display as unformatted text since there is no display engine. Just look at the source, either in a browser or in a text editor.
Of course the difficulty is only moved to the next step. And it is what Marcello's tools can do (as epubmaker does for RST). And PG-TEI is crippled in the same way as PG-RST is crippled (RST in itself is extensible and can produce much better results, but these extensions are not supported by epubmaker).
So people really wanting to do a service to PG should propose a different toolchain for PG TEI submissions, better configurable and giving better end results.
By the way, where are the source fot the TEI toolchain? I coundn't find a mention of them on PG, (searching the site for TEI returns nothing).
Carlo _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Hi Don, as you might well know any markup that is of any use will insert quite an amount of markup/information. The files are not intended to be hand written or edited, but you have an editor or the file is auto generated. This insures the consistency of the data. All that extra mark up as you say has its purpose for the tools that operate on those files. regards Keith. Am 05.02.2012 um 01:00 schrieb don kretz:
Your url is better than the one I had, and I found some documents.
Here's a link to a representative case.
http://ota.ahds.ac.uk/text/5506.xml
It's whqt I would expect an XML document to look like.
Does anyone think this is a markup we're likely to get DP proofers to adopt? There's a lot of markup for not much text - low signal/noise ratio. A lot of generic tags: <item>...</item> type stuff that's not encouraging.
Is this a real candidate?

"don" == don kretz <dakretz@gmail.com> writes:
don> --0016e6de14d6cf2fc304b82c26ff Content-Type: text/plain; don> charset=ISO-8859-1 don> I did start to browse through it. I didn't find anything that don> purported to be a catalog, or an offer of a url to download a don> text. If you know of one, please pass it along. don> On Sat, Feb 4, 2012 at 3:50 PM, Carlo Traverso don> <traverso@posso.dm.unipi.it>wrote: >> >>>>> "don" == don kretz <dakretz@gmail.com> writes: >> don> Yup, Been through some of that. Good project don> descriptions. Lots of Best Practices. I haven't found one don> yet that will let me have a TEI document. (A few will let me don> buy a CD that *may* have one on it.) >> Did you try the Oxford text archive, http://ota.ahds.ac.uk/ ? >> What they call XML is TEI, 2506 of them. I have seen many more >> from other sources. >> >> Carlo _______________________________________________ gutvol-d
From the front page, the first link is
The Archive: browse the catalogue and get access to the OTA resources ( http://ota.ahds.ac.uk/catalogue/index.html ) The first tab is TEI texts, and has a search box, together with a paged list. Choosing one, you get a page with a detailed description and including Distributed by the University of Oxford under a Creative Commons Attribution-ShareAlike 3.0 Unported License Download: XML; HTML; ePub; plain text

Carlo>Did you try the Oxford text archive, http://ota.ahds.ac.uk/ ? What they call XML is TEI, 2506 of them. I have seen many more from other sources. I tried exercising TEI as a "master format" by downloading one of their texts "at random" in one of their choices of output file format, namely EPUB. The result was completely garbage, with some elements fix-sized needlessly to about 10X the size of my 20" computer screen. The body text was in an extraordinarily ugly font, and everything ran together with little or no apparent attempt to do *any* kind of sensible formatting. And you point about using TEI as a "master format" was what ???

Hah, hah, hah! SORRY! Jim. I have seen what you saw once. I experimented with iBooks Author. exported to the ibooks format. Just changes the the ending to .epub and loaded it into calibre. Result: JUNK displayed. EPUB is not EPUB!! sounds funny, but it is not. better check what their epub is generated for! You actually, can blame this either on TEI or EPUB, but on the tools or how you used them. I can not say which! regards Keith. Am 05.02.2012 um 02:22 schrieb Jim Adcock:
Carlo>Did you try the Oxford text archive, http://ota.ahds.ac.uk/ ? What they call XML is TEI, 2506 of them. I have seen many more from other sources.
I tried exercising TEI as a "master format" by downloading one of their texts "at random" in one of their choices of output file format, namely EPUB. The result was completely garbage, with some elements fix-sized needlessly to about 10X the size of my 20" computer screen. The body text was in an extraordinarily ugly font, and everything ran together with little or no apparent attempt to do *any* kind of sensible formatting.
And you point about using TEI as a "master format" was what ???

Keith>You actually, can blame this either on TEI or EPUB, but on the tools or how you used them. I can not say which! Not sure what you are saying. I think you are saying that TEI *could* be used to generate attractive output on small devices, but that isn't being done anywhere today that anyone can show me. Most EPUB I see *does* generate attractive formatted display. Granted some EPUB comes from paper houses which "don't get" EPUB and their stuff can be pretty ugly because a small electronic device *is not* a sheet of paper and paper design rules don't necessarily translate directly to electronics (part of the reason PDF really doesn't work on small devices -- small devices are not made of paper.) A proposed "master" format should be able to successfully demonstrate the entire food chain, from raw OCR to rendered text on an end customers' tablet such that they say "Now *this* is a book!"

Hi Jim, I think I will start with your remark on PDFs. PDFs is very able to display the document on small displays. Whether is is eligible is a different matter. It is not the fault of PDF. It will be eligible if you have enough resolution and the display is not too small. But, the reading experience is unsatisfactory. Ever had a thumb book! Nice little buggers. Yet, you would not reformat a paper back to a thumb book! Not matter what format you use you will run into this problem of scale. As a matter of fact in both directions. EPUB is suppose to scale and produce similar results on all devices that conform to the "Standard". The reality is far away in this matter because the "Standard" is far to vague (or loose). Sure you can tweak to look "perfect" take that to another device or smaller/larger one and garbage. You can design it to look DECENT on several devices. But, that "HTML" does not look good in mobi. Even kindlegen creates two version because the devices are "incompatible" I do not see anything stopping TEI, RST or HTML for going from OCR to book. I do see that EPUB will not do! regards Keith. Am 06.02.2012 um 06:02 schrieb Jim Adcock:
Keith>You actually, can blame this either on TEI or EPUB, but on the tools or how you used them. I can not say which!
Not sure what you are saying. I think you are saying that TEI *could* be used to generate attractive output on small devices, but that isn't being done anywhere today that anyone can show me.
Most EPUB I see *does* generate attractive formatted display. Granted some EPUB comes from paper houses which "don't get" EPUB and their stuff can be pretty ugly because a small electronic device *is not* a sheet of paper and paper design rules don't necessarily translate directly to electronics (part of the reason PDF really doesn't work on small devices -- small devices are not made of paper.)
A proposed "master" format should be able to successfully demonstrate the entire food chain, from raw OCR to rendered text on an end customers' tablet such that they say "Now *this* is a book!"

On Sat, Feb 4, 2012 at 1:11 PM, don kretz <dakretz@gmail.com> wrote:
Are there any sites where I can pull down books in TEI format?
Yes but ... the point of using TEI for a few complicated books would be that the TEI could then be fed through a converter to produce usable texts in azw, epub, txt, and other formats. RST as well would be processed to produce end-user results. That's the point of a master format ... like using XML to store data, which can then be repurposed in various forms. What's so hard to understand about that? -- Karen Lofstrom

On 2/4/2012 4:41 PM, Karen Lofstrom wrote:
On Sat, Feb 4, 2012 at 1:11 PM, don kretz<dakretz@gmail.com> wrote:
Are there any sites where I can pull down books in TEI format?
Yes but ... the point of using TEI for a few complicated books would be that the TEI could then be fed through a converter to produce usable texts in azw, epub, txt, and other formats. RST as well would be processed to produce end-user results.
That's the point of a master format ... like using XML to store data, which can then be repurposed in various forms.
What's so hard to understand about that?
If you're trying to select a format to use, you want to pick one that has some weight behind it. You want one that has been vetted by experts; you want one that is mature enough that there is some assurance that it has the coverage you need; you want one that has software tools already developed to manipulate and convert it. And if you're trying to develop some of those tools on your own, you want to see how /other/ people have used the format. All of these are good reasons to examine other projects. Mr. Kretz is a developer and tool builder, he's not just a consumer.

There have been a number of assertions along the line of "and then you could just ..." that I would like to verify by finding cases where someone did. What is passed off as an "implementation detail" is often much more difficult to implement than it is to say. On Sat, Feb 4, 2012 at 3:42 PM, Lee Passey <lee@novomail.net> wrote:
On 2/4/2012 4:41 PM, Karen Lofstrom wrote:
On Sat, Feb 4, 2012 at 1:11 PM, don kretz<dakretz@gmail.com> wrote:
Are there any sites where I can pull down books in TEI format?
Yes but ... the point of using TEI for a few complicated books would be that the TEI could then be fed through a converter to produce usable texts in azw, epub, txt, and other formats. RST as well would be processed to produce end-user results.
That's the point of a master format ... like using XML to store data, which can then be repurposed in various forms.
What's so hard to understand about that?
If you're trying to select a format to use, you want to pick one that has some weight behind it. You want one that has been vetted by experts; you want one that is mature enough that there is some assurance that it has the coverage you need; you want one that has software tools already developed to manipulate and convert it.
And if you're trying to develop some of those tools on your own, you want to see how /other/ people have used the format.
All of these are good reasons to examine other projects.
Mr. Kretz is a developer and tool builder, he's not just a consumer.
______________________________**_________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/**mailman/listinfo/gutvol-d<http://lists.pglaf.org/mailman/listinfo/gutvol-d>

Karen> What's so hard to understand about that? What's so hard to understand is why PG never actually does what you are proposing. It is one thing to convert TEI into living HTML. It is another thing to convert TEI to something which vaguely renders through an HTML machine, but which any HTML author will tell you is in no way or shape real HTML code.

On 4 February 2012 23:11, don kretz <dakretz@gmail.com> wrote:
Are there any sites where I can pull down books in TEI format?
Do any publishers sell their books in TEI format?
Are there TEI libraries?
Perseus: http://www.perseus.tufts.edu/hopper/
Reading the TEI materials, these seem to be the kinds of purposes for which TEI was invented. Who has found it so useful that they have adopted it as their standard medium of storage?
TEI is pretty widely used in corpora, and dictionaries: anywhere where the semantic content is valued.
Textbooks seem to be squarely in the target market. What textbooks are distributed and used in TEI format?
I expect that PG texts in TEI format would be useful to roughly the same audiences.
If the texts were being semantically annotated, they would be useful for corpus-type uses. As is, most of them retain the page numbers, so they're at least a better starting point than plain text. (The HTML editions are _worse_ than plain text for those uses). -- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you

Don>Are there any sites where I can pull down books in TEI format? Don>Do any publishers sell their books in TEI format? Don>Are there TEI libraries? Answering the implied question, the *rest* of the non-PG non-academic e-book publishing world has settled on EPUB as being *the* master format.

On Sat, Feb 4, 2012 at 5:11 PM, Jim Adcock <jimad@msn.com> wrote:
Answering the implied question, the *rest* of the non-PG non-academic e-book publishing world has settled on EPUB as being *the* master format.
On one hand, that's not true. There are virtually no roleplaying game publishers who publish in EPUB. Virtually all of them publish in PDF. I suspect there are other niche markets where that's true. On the other, so? What audience exactly do you think I'm doing the History of Bibliographies of Bibliographies for? Or Selections from Early Middle English: 1130-1250? -- Kie ekzistas vivo, ekzistas espero.

"David" == David Starner <prosfilaes@gmail.com> writes:
David> On Sat, Feb 4, 2012 at 5:11 PM, Jim Adcock <jimad@msn.com> David> wrote: >> Answering the implied question, the *rest* of the non-PG >> non-academic e-book publishing world has settled on EPUB as >> being *the* master format. David> On one hand, that's not true. There are virtually no David> roleplaying game publishers who publish in EPUB. Virtually David> all of them publish in PDF. I suspect there are other David> niche markets where that's true. What has this to do with master formats? Commercial publishers don't publish their master files, you can at most see their distribution files. Mostly, DRM-ed. Carlo

On Sun, Feb 5, 2012 at 12:25 AM, Carlo Traverso <traverso@posso.dm.unipi.it> wrote:
"David" == David Starner <prosfilaes@gmail.com> writes:
David> On Sat, Feb 4, 2012 at 5:11 PM, Jim Adcock <jimad@msn.com> David> wrote: >> Answering the implied question, the *rest* of the non-PG >> non-academic e-book publishing world has settled on EPUB as >> being *the* master format.
David> On one hand, that's not true. There are virtually no David> roleplaying game publishers who publish in EPUB. Virtually David> all of them publish in PDF. I suspect there are other David> niche markets where that's true.
What has this to do with master formats? Commercial publishers don't publish their master files, you can at most see their distribution files. Mostly, DRM-ed.
Why is this a reply to me? If you want to argue that those publishers who publish in EPUB don't use anything like it for the master format, go ahead, but I think it obviously clear that if you don't publish in EPUB or Mobi, those aren't your internal formats. -- Kie ekzistas vivo, ekzistas espero.

David> What has this to do with master formats? Commercial publishers don't
publish their master files, you can at most see their distribution files. Mostly, DRM-ed.
If your master format *is* your distribution format then at least two good things happen: 1) What people see on their computer the moment before they submit it to PG is the same identical thing they or their friends see when they download it from PG. There is no tool-chain sausagemaker to screw things up. 2) If the distribution format is identical to the master format then each distribution into the world is another copy of the master format that is being preserved for posterity, and which immediately becomes useful for any derivative efforts. In comparison if PG picks an obscure format then the "master file" only lives as long as Amazon has a group of volunteers who are willing to keep the tool chain alive for all eternity. And are willing to volunteer to write to that format. As soon as they get sick of it then the master format becomes dead. Amazon has a huge business publishing things submitted in EPUB format, which is the source format, but which they then *sometimes not always* distribute in their own DRM wrapper -- the choice of DRM or no DRM is up to the author. B&N and Apple also do these things, and the other smaller distributors. Agreed once something is DRM'ed it no longer matters what is used to be -- because now it is 100% vendor specific. Amazon is now releasing KF8 file format which much more closely follows EPUB standards, presumably because the major publishing houses want to distribute to Amazon in EPUB format. Granted, not everyone writes directly in EPUB. Some who are paper-oriented author first in Dreamweaver, which in turn has a less-than-perfect EPUB output option, which authors typically clean up in Sigil, before publishing, or redistributing, in EPUB format. Some even author in Word Docs. So, EPUB can be, and is being used as, a primary, secondary, and tertiary format. The format is about "as simple as you can get" while still containing the things you really need to be "a book." http://en.wikipedia.org/wiki/EPUB

On Sun, Feb 5, 2012 at 8:44 PM, Jim Adcock <jimad@msn.com> wrote:
David> What has this to do with master formats? Commercial publishers don't
publish their master files, you can at most see their distribution files. Mostly, DRM-ed.
You've misattributed that; that was Carlo.
If your master format *is* your distribution format then at least two good things happen:
1) What people see on their computer the moment before they submit it to PG is the same identical thing they or their friends see when they download it from PG. There is no tool-chain sausagemaker to screw things up.
No, not at all. We've all ran into web pages where they must have been looked at before uploading, but they didn't work in our browser at our settings. That's exactly the issue we're running into here; EPUB takes HTML, but just feeding it EPUB doesn't work. Which doesn't at all address the issue that we don't all want the same master format. There's a demand for HTML for desktops, EPUB and MOBI at the least.
2) If the distribution format is identical to the master format then each distribution into the world is another copy of the master format that is being preserved for posterity, and which immediately becomes useful for any derivative efforts.
And then one day the distribution format is no longer the hot new thing and you have a pile of documents you still have to convert to the formats people want. And the problem is, for that derivative effort or any really, is that distribution formats suck. They match the output format, not what we want to input. If you're doing any serious derivative work, you probably want the TEI-Lite, not the HTML, especially you need page numbers or sidenotes.
In comparison if PG picks an obscure format then the "master file" only lives as long as Amazon has a group of volunteers who are willing to keep the tool chain alive for all eternity. And are willing to volunteer to write to that format. As soon as they get sick of it then the master format becomes dead.
And what? You think that HTML is magically going to be around forever? That no one will ever have to convert the EPUB to anything? As long as there's people interesting in keeping the archive alive, we can target a master format to whatever formats are popular; once there's no one interested in keeping the archive alive, our files will die.
The format is about "as simple as you can get" while still containing the things you really need to be "a book."
Which is interesting mostly if what you're trying to copy is novels. If you've got a book that has a feature that you don't really need to be a book, then you'll have problems. -- Kie ekzistas vivo, ekzistas espero.

Hi David, Am 06.02.2012 um 08:40 schrieb David Starner:
On Sun, Feb 5, 2012 at 8:44 PM, Jim Adcock <jimad@msn.com> wrote:
Which doesn't at all address the issue that we don't all want the same master format. There's a demand for HTML for desktops, EPUB and MOBI at the least. We "all" do not need use the same master format. All that is need is a convert from a particular format to the master format. The problem is that the formatting in a particular format must be constrained to set of features which is acceptable for producing the different output formats, from the master.
You see HTML is nice. Most ereader device formats use HTMl! Yet, they can reproduce all HTML. Therefore, any formatting must be mappable to a common set. regards Keith.

Which doesn't at all address the issue that we don't all want the same master format. There's a demand for HTML for desktops, EPUB and MOBI at the least.
And there are at least two issues for this: 1) People love the format they know and love, and have chosen it because the other formats are more difficult and ugly and more difficult to use and in practice generate inferior results on the platform that they use and love and which they author for. 2) Automatic conversion of one format to another is always a "dumbing down" process known at PG as "blind formatting" where PG has always said we will not accept files that are blind formatted because they contribute nothing. Yet that is exactly what the proponents the new super-uber-formats are proposing: they are taking existing PG books which are not blind formatted, and are now "blind down-format" converting them to the new super-uber-format. And in the process they throw away the efforts that the volunteers put into NOT blind formatting them -- because PG told the volunteers that it would not accept blind formatted files -- but now those who propose to crown themselves the Super-Uber-Users propose to do exactly that to all the volunteers' 40,000 books throwing away all the effort that went into trying to make (hopefully) intelligent design decisions about how to represent the important formatting decisions in the paper book in an electronic edition. When all that really is necessary is to "tweak" a half dozen lines in the CSS when targeting the small devices. PG already even has a placeholder for such tweaks. It's called pgepub.css and its currently almost empty, and what is in there isn't particularly "correct."

Hi Jim, If your master format is your distribution format things happen, too! 1) you are bound to that format a) support for other formats is difficult. b) you need support many formats that are not compatible 2) if a new format comes out or changes ……? 3) forcing everybody to this format A master format that is not bound to the distribution gives you 1) You have a well defined structure 2) you can conevrt from there 3) control restructuring of the format automagically 4) give users guidance of what would be acceptable input from their formats 5) have rules for best practices so that the broadest base can be reached Am 06.02.2012 um 05:44 schrieb Jim Adcock:
David> What has this to do with master formats? Commercial publishers don't
publish their master files, you can at most see their distribution files. Mostly, DRM-ed.
If your master format *is* your distribution format then at least two good things happen:
1) What people see on their computer the moment before they submit it to PG is the same identical thing they or their friends see when they download it from PG. There is no tool-chain sausagemaker to screw things up.
2) If the distribution format is identical to the master format then each distribution into the world is another copy of the master format that is being preserved for posterity, and which immediately becomes useful for any derivative efforts.
In comparison if PG picks an obscure format then the "master file" only lives as long as Amazon has a group of volunteers who are willing to keep the tool chain alive for all eternity. And are willing to volunteer to write to that format. As soon as they get sick of it then the master format becomes dead.
Amazon has a huge business publishing things submitted in EPUB format, which is the source format, but which they then *sometimes not always* distribute in their own DRM wrapper -- the choice of DRM or no DRM is up to the author. B&N and Apple also do these things, and the other smaller distributors. Agreed once something is DRM'ed it no longer matters what is used to be -- because now it is 100% vendor specific.
Amazon is now releasing KF8 file format which much more closely follows EPUB standards, presumably because the major publishing houses want to distribute to Amazon in EPUB format. Granted, not everyone writes directly in EPUB. Some who are paper-oriented author first in Dreamweaver, which in turn has a less-than-perfect EPUB output option, which authors typically clean up in Sigil, before publishing, or redistributing, in EPUB format. Some even author in Word Docs. So, EPUB can be, and is being used as, a primary, secondary, and tertiary format. The format is about "as simple as you can get" while still containing the things you really need to be "a book." The problem here is we are not creating BOOKs
WE ARE CONVERTING P-BOOKS TO EBOOKS. This process is completely different. It is actually just a mapping function as seen from the computational side. Meaning, we can care less if we match actual book making. The test is that the end result is a usable etext/ebook. regards Keith.

Keith>A master format that is not bound to the distribution gives you (examples).... You still have the problem that what you are sending out into the world is HTML, EPUB, and MOBI, formats that other people love to party on, and will, and because these formats are universally preferred by the great mass of real-world developers over your choice of super-uber-language, those partied-on versions are in fact that which *will* be distributed, not your super-uber-language version, and PG will have stuck itself into a ghetto. You can then apply a fix for a scanno to *your* version, but that won't be the version that real people out there will read, but rather the partied on version which will be more attractive and more fun to read, and the partied on version still won't get your scanno fix applied, in part because at some point in time once enough scannos have been fixed real world people stop caring about whether the last two or three scannos get fixed, because they never ever even see them. They just want an attractive book which is fun to read.

Hi Jim, Really, now ! The user that have these nice reading devices with EPUB and MOBI are going to tear the EPUB and MOBI files apart and fix error. Put them back together again and submitt them to PG !! Now who is living in the real world. In case you are wondering I am a natural born American. I also, have an american passport. So U can take your sly remarks somewhere else Mr. Jim "Hoover" Adcock. regards Keith. Am 06.02.2012 um 19:30 schrieb Jim Adcock:
Keith>A master format that is not bound to the distribution gives you (examples)....
You still have the problem that what you are sending out into the world is HTML, EPUB, and MOBI, formats that other people love to party on, and will, and because these formats are universally preferred by the great mass of real-world developers over your choice of super-uber-language, those partied-on versions are in fact that which *will* be distributed, not your super-uber-language version, and PG will have stuck itself into a ghetto. You can then apply a fix for a scanno to *your* version, but that won't be the version that real people out there will read, but rather the partied on version which will be more attractive and more fun to read, and the partied on version still won't get your scanno fix applied, in part because at some point in time once enough scannos have been fixed real world people stop caring about whether the last two or three scannos get fixed, because they never ever even see them. They just want an attractive book which is fun to read.

Carlo>What has this to do with master formats? If you pick a master format that everyone uses then you can find tools to work on that format, and you can find volunteers to work on that format. If you pick a master format that no one uses, then even the simplest format, say "txt70" becomes a pain in the behind, because there is no tool support for that format, and volunteers are "forced" to do things they don't want to do, which is a sure way to burn through your volunteers.

Hi Jim, Other have mentioned recently that is not so much the master format, but what is inside. I will be discussing this real soon: "A new approach" Interesting things are starting here and there seems there are those here willing to start changes to a better system, without restricting things to much for the user. you might not see, but I do. regards Keith. Am 06.02.2012 um 05:12 schrieb Jim Adcock:
Carlo>What has this to do with master formats?
If you pick a master format that everyone uses then you can find tools to work on that format, and you can find volunteers to work on that format. If you pick a master format that no one uses, then even the simplest format, say "txt70" becomes a pain in the behind, because there is no tool support for that format, and volunteers are "forced" to do things they don't want to do, which is a sure way to burn through your volunteers.

Excuse me, EVERYBODY! There is a BIG DIFFERENCE between a MASTER format and a PUBLISHING format! NO WONDER nothing sensible comes out of these threads. regards Keith. Am 05.02.2012 um 09:05 schrieb David Starner:
On Sat, Feb 4, 2012 at 5:11 PM, Jim Adcock <jimad@msn.com> wrote:
Answering the implied question, the *rest* of the non-PG non-academic e-book publishing world has settled on EPUB as being *the* master format.
On one hand, that's not true. There are virtually no roleplaying game publishers who publish in EPUB. Virtually all of them publish in PDF. I suspect there are other niche markets where that's true.
On the other, so? What audience exactly do you think I'm doing the History of Bibliographies of Bibliographies for? Or Selections from Early Middle English: 1130-1250?

On the other, so? What audience exactly do you think I'm doing the History of Bibliographies of Bibliographies for? Or Selections from Early Middle English: 1130-1250?
But the PG-TEI crowd isn't talking about writing items of academic interest into TEI. They are talking about taking books of general public interest, which have already been written by caring volunteers in HTML, throwing away the HTML, and forcing everyone to work with PG-TEI, whether they like it or not. When all that really needs to happen is a half-dozen lines of CSS needs to be "tweaked." And which probably could be "tweaked" automatically if Marcello's tool for example was on top of its game. If you want to work an old English text in TEI or something for a doctoral thesis, and then donate that to PG, then do so. Just don't make the rest of us follow suit.

Jim> TEI is too geeky, and creates a source file which is dead from the
minute it is written, because the only person who cares to maintain it is the original author. And the HTML it generates stinks to high heaven.
Lee>Don't blame the markup language I take back my objection to TEI on the basis of it making crappy HTML. Geek on TEI all you want as long as the result is livable HTML that generates reasonably "book-like" results. A book is what goes into the system, and a book should be what comes back out. Just don't try to force the rest of world to geek to that extent with you.

On Sat, Feb 04, 2012 at 12:10:24PM -0800, Jim Adcock wrote:
BBgee... somehow i got the impression that -- under the rubric of "crowdsourcing" -- you were _soliciting_ "improved" versions of p.g. books, versions that were "fixed" and "tweaked" such that they gave people "better" experiences, either because they had had their typos and errors "corrected", or because they were "more friendly" with some hardware/software, or a combination of both those factors. i _also_ thought that you proposed to "host" these "new" versions, _and_ that the "best" one -- out of many that were "submitted", thanks to dedicated efforts by volunteers -- might get "folded back into" the p.g. library....
I don't have anything against discussion of markup, transformation mechanisms, master formats, DP, etc. But my emphasis all along in these discussion threads has been on tools to enable aggregation and redistribution of variants on our eBooks. This is because we get frequent requests to add such files into the PG collection, and have no good, standard, scalable way to do so. Plus, complaints about shortcomings of existing mobile formats. (Contrariwise, end users -- readers -- never complain to help@ concerning HTML, master formats, the PG (and DP) processing workflow, and other topics we've been hashing out. That's not to say such topics are not important, or are unrelated.)
What I heard Greg say was indeed he was going allow these various different "improved" versions to be posted targeting various platforms, and that indeed he said that WW'ers would have the option in the future of folding back *parts* of the new effort into the existing PG source files when they felt doing so would make a contribution. I don't think he ever said anything about totally replacing current versions with totally new versions in totally new source languages, and I certainly didn't hear him suggest that the current source file formats of the current books were going to be totally replaced with new source file formats.
That's a nice rephrasing of what I've been thinking of. This whole approach is, essentially, a layer over the existing set of files of the eBooks in our collection. We already have those files, and a way to receive and update them. We don't have a good way of handling variations on those files (notably, derived formats), including crowdsourcing backports.
... And I still don't see where people who want to post these "improved" versions targeting specific platforms can post their efforts where PG customers can find them, so, I guess, in practice, Marcello holds the keys to the fortress.
We've only been discussing this for a week or two. I encourage anyone wanting to do their own experiments to go ahead. Nothing that Marcello or anyone else did is deployed to www.gutenberg.org yet. -- Greg

I would like to propose a new approach. As Greg said, the core problem is how to maintain several formats of ebooks generated in several ways when we get errata reports. The master format is useful for this since a toolchain can then regenerate all the derived versions. My proposal is to have different toolchains for generation and updates. I assume that all the formats contain a text, and this is substantially the txt70; and that all the formats contain the same text, possibly subdivided in different parts, organized differently (e.g. the footnotes). The encoding is not important, it does not matter if the HTML contains < while the txt contains <, or if the HTML contains — while the text contains a dash. So I assume the txt to be UTF-8 (the dash example shows that this might be a bit simplified, but we may consider that "--" encodes a dash in non-UTF8 PG txt files, exactly as — encodes a dash in HTML. I also assume that most errata, the important ones, report a correction to the text, mostly fixing a typo. I suppose that errata for markup are relatively unusual, The problem is that, if the correction is accepted, we want to retrofit the correction to all the formats. Currently, this is made either manually editing the master, or manually editing the txt and HTML files; then recompute the derived formats. If we can retrofit automatically the changes from txt to the derived formats, then we can allow all the formats that we want. This is the same type of problems that is dealt with concurrent modifications in version control software like svn, except that here the problem should be handled at the character level, not at the line level. The only difference is that we need to operate with wdiff (or better dwdiff, or maybe even at a finer level) instead than diff (and patch). The problems might appear only if changes are done at the interface of words and markup. If you change "arid" to "and" and in HTML you have <i>arid</i> it is clear that it has to become <i>and</i> (we have changed i"ri" to "n" far from markup). Problems arise if you change "go" to "go!": should you change <i>go</i> to <i>go</i>! or to <i>go!</i> ? I am willing to investigate the issue on the basis of the errara reports that we receive, and possibly design software that can correct automatically all the formats (HTML, epub, mobi, either automatically deduced or hand-crafted)) from patches for txt. I was formerly in the errata@pglaf.org list, could I be added to the errata2010 mailing list for the purpose of studying the problem? Possibly with access to the archives, or at least part of them. Thanks. Carlo

Carlo>I also assume that most errata, the important ones, report a correction to the text, mostly fixing a typo. I suppose that errata for markup are relatively unusual, This is where we disagree. What I see *overwhelmingly* is that the errors in the files of PG are *overwhelmingly* massively errors of formatting. I would be hard-put to find a half dozen scannos in a given PG file. I can often find 100s if not 1000s of formattos in the same PG file. It's just that somehow the people at PG have become blind and tone-deaf to issues of formatting. Again, things tend to "work" in HTML. It's just the other formats that fall-down so badly.

"don" == don kretz <dakretz@gmail.com> writes:
don> Why would text files not be derived from the master like don> other formats? I have expected that derivation can only be don> automated from greater information density to lower. don> I think however that testing ideas early against real data as don> you suggest is important.
"Jim" == Jim Adcock <jimad@msn.com> writes:
Carlo> I also assume that most errata, the important ones, report Carlo> a Carlo> correction to the text, mostly fixing a typo. I suppose that Carlo> errata for markup are relatively unusual, Jim> This is where we disagree. What I see *overwhelmingly* is Jim> that the errors in the files of PG are *overwhelmingly* Jim> massively errors of formatting. I would be hard-put to find Jim> a half dozen scannos in a given PG file. I can often find Jim> 100s if not 1000s of formattos in the same PG file. Jim> It's just that somehow the people at PG have become blind and Jim> tone-deaf to issues of formatting. Again, things tend to Jim> "work" in HTML. It's just the other formats that fall-down Jim> so badly. I see that I have expressed myself incorrectly, since I have been misunderstood (not by Greg, I believe). I try again. First, what is a master format? It is not a format for distribution, it is a format from which all other formats are derived, hence it implies a toolchain to derive these formats, and should be defined in a way that it will be able to derive future formats. Master formats are important since a modification (fix) to the master can be reflected to all the distributed formats. Moreover, when epub4 and Zoox formats (based on HTML6) will be released, it will be wasy to provide good epub4 and Zoox for all the books with a good and rich master file, taking advantage of the cool new features of HTML6 and epub4, just adding new formats to the toolchain. But PG has some 40000 books that don't have a good master format, and fixing a typo in a book having the standard hand-crafted 4 formats (HTML, txt-UTF-8, txt-8 and txt-7 (ASCII) requires to fix 4 files and regenerate the other ones. And the problem will become worse if we allow hand-crafted epub and kindle files. My proposal is a way to simplify the maintenance of the legacy formats for the requests sent to errata-MMX@pglaf.org. Errata like this one (many are much less clearly stated): =========================== Title: Astounding Stories of Super-Science, October, 1930 Author: Various Release Date: September 1, 2009 [EBook #29882] Language: English Page 7: "then low whirring noise" should be "then a low whirring noise". "You can take of your gas mask" should be "You can take off your gas mask". Page 103: Should "The fumes might attract prowlers" be "The flames might attract prowlers"? The image is very unclear. Page 118: "subterranean action shock the electron" should be "subterranean action shook the electron". Page 123: "with what the knew already" should be "with what she knew already". Page 139: "To the left Is the better path" should be "To the left is the better path" - wrong case. ========================= I have access to errata now, and I will be in position to tell how many formattos are submitted as errata, and to suggest modifications to the errata procedures to allow an automatic correction of all the formats just correcting the UTF-8 txt. Of course this does not address the "formattos" (nor, for example, the splitting of a paragraph in 2), but, if (as I suspect) errata receives mainly typos, this might be a substantial reduction of the workload for the errata team. And might allow PG to accept e.g. handcrafted epub, and replacement of some "bad" autogenerated epubs with your "good" fixed epubs. Carlo PS: several posts have come while I was composing this one. I especially agree with David's last post.

Carlo>But PG has some 40000 books that don't have a good master format, and fixing a typo in a book having the standard hand-crafted 4 formats (HTML, txt-UTF-8, txt-8 and txt-7 (ASCII) requires to fix 4 files and regenerate the other ones. And the problem will become worse if we allow hand-crafted epub and kindle files. And the only way to get those 40,000 books to some hypothetical super-uber-format is to blind format down-convert them which in the process converts those file to "blind formatting", something which PG has always said it would not accept, because blind formatted files contribute nothing.

On 2/5/2012 5:12 PM, Greg Newby wrote:
Contrariwise, end users -- readers -- never complain to help@ concerning HTML, master formats, the PG (and DP) processing workflow, and other topics we've been hashing out.
Actually, interest in volunteering to work on HTML, CSS, and master formats is exactly why I joined this list. Although I've been happy to have PG files to download for my Kindle 3, the majority of those I have encountered have been, IMO, so unnecessarily ungainly looking that I routinely grab the HTML versions instead, adjust the coding, and redo the CSS before reconverting to MOBI. Perhaps these were older files no longer representative of the style PG seeks to promulgate. Regardless, I'm hopeful that the new approaches being explored on this list will soon result in improved typography throughout PG releases. I've been lurking for about two months now, getting a feel for this list. Is there another [less overwhelming] PG or DP list that focuses more on matters related to style/presentation/format? Regards, Mark

You may find some help at the PGDP forums, which are (however slightly) moderated... Jeroen. On 2012-02-05 15:08, Mark Swofford wrote:
I've been lurking for about two months now, getting a feel for this list. Is there another [less overwhelming] PG or DP list that focuses more on matters related to style/presentation/format?
Regards, Mark

Greg>We've only been discussing this for a week or two. I encourage anyone wanting to do their own experiments to go ahead. Nothing that Marcello or anyone else did is deployed to www.gutenberg.org yet. Sorry, I think some of us must be confused by Marcello deploying "his" version of the files at: http://www.gutenberg.org/ebooks/1661 for example, which is a privilege he has given himself that none of the rest have. What I think we had imaging was the Marcello was going to come up with was a way for *any* of us to post our examples say to the "More Files" category at that location, rather than Marcello giving himself "special privileges" to post his stuff directly into the main directory, while keeping the rest of us out.

On Sun, Feb 05, 2012 at 08:27:30PM -0800, Jim Adcock wrote:
Greg>We've only been discussing this for a week or two. I encourage anyone wanting to do their own experiments to go ahead. Nothing that Marcello or anyone else did is deployed to www.gutenberg.org yet.
Sorry, I think some of us must be confused by Marcello deploying "his" version of the files at:
http://www.gutenberg.org/ebooks/1661
for example, which is a privilege he has given himself that none of the rest have. What I think we had imaging was the Marcello was going to come up with was a way for *any* of us to post our examples say to the "More Files" category at that location, rather than Marcello giving himself "special privileges" to post his stuff directly into the main directory, while keeping the rest of us out.
I agree this seems like cheating. However, Marcello does constantly tweak the auto-conversion process, and there's no reason to stop now. (Especially when there is a flurry of ideas, suggestions and criticism.) The experiment with Mercurial seems to me to be just an experiment, with a couple of eBooks. -- Greg

On 02/06/2012 06:36 AM, Greg Newby wrote:
On Sun, Feb 05, 2012 at 08:27:30PM -0800, Jim Adcock wrote:
Greg>We've only been discussing this for a week or two. I encourage anyone wanting to do their own experiments to go ahead. Nothing that Marcello or anyone else did is deployed to www.gutenberg.org yet.
Sorry, I think some of us must be confused by Marcello deploying "his" version of the files at:
http://www.gutenberg.org/ebooks/1661
for example, which is a privilege he has given himself that none of the rest have. What I think we had imaging was the Marcello was going to come up with was a way for *any* of us to post our examples say to the "More Files" category at that location, rather than Marcello giving himself "special privileges" to post his stuff directly into the main directory, while keeping the rest of us out.
I agree this seems like cheating.
Since when is being the first person to get his ass in gear cheating? I always said I wanted to implement a master format. I never said I was going to implement vanity editions. In fact I don't believe PG has the resources to handle them. If other people want to have vanity editions then *they* should come up with ways to implement them, eg.: - devise a new filesystem layout that accomodates vanity editions, - talk the WWers into adopting it, - fix the WWers tools to work with it, - create a workforce to check and catalog all vanity submissions, - start posting books, - explain to me how the database should pick up those files / metadata. I'm not wasting my time crying and complaining and trying to stop other people from moving. In fact I've already implemented all database functions for the posting of vanity epubs since DP expressed the wish to do so. If they didn't start posting yet, its because they couldn't convince the WWers. -- Marcello Perathoner webmaster@gutenberg.org

On Mon, Feb 06, 2012 at 10:57:47AM +0100, Marcello Perathoner wrote:
On 02/06/2012 06:36 AM, Greg Newby wrote:
On Sun, Feb 05, 2012 at 08:27:30PM -0800, Jim Adcock wrote:
Greg>We've only been discussing this for a week or two. I encourage anyone wanting to do their own experiments to go ahead. Nothing that Marcello or anyone else did is deployed to www.gutenberg.org yet.
Sorry, I think some of us must be confused by Marcello deploying "his" version of the files at:
http://www.gutenberg.org/ebooks/1661
for example, which is a privilege he has given himself that none of the rest have. What I think we had imaging was the Marcello was going to come up with was a way for *any* of us to post our examples say to the "More Files" category at that location, rather than Marcello giving himself "special privileges" to post his stuff directly into the main directory, while keeping the rest of us out.
I agree this seems like cheating.
Since when is being the first person to get his ass in gear cheating?
I'm ready to withdraw my comment, but cannot quite understand what you did. Can you explain? Earlier, you wrote:
The hg repository is live with 2 books.
http://www.gutenberg.org/ebooks/1342
http://www.gutenberg.org/ebooks/1661
For now, only the EPUB, Kindle and PDF versions were automatically generated from RST. Please download those versions to any real ereader devices and comment on the experience.
The repository containing the RST sources can be viewed (not edited) here:
My question is whether you are now generating some of the live contents at www.gutenberg.org/ebooks/1661 and 1342 from those bitbucket sources, or from the sources in www.gutenberg.org/files/1/6/6/1661/ and 1342 (i.e., "as usual"). I didn't see whether the log file is exposed at www.gutenberg.org, but they're accessible via rsync. Here is the full set of generated files, including the converter log: http://snowy.arsc.alaska.edu/mirrors/gutenberg/cache/generated/1661/
From the log, it is clear that several sources are from the files originally posted. So, is this just an improved converter?
Thanks for clarification, My view is that it would be cheating for you to bypass the original master files in favor of your own new masters, for the live site at www.gutenberg.org. Spinoffs and demos elsewhere would certainly not be cheating. And, updating the convertion at www.gutenberg.org to use the existing original master files is very much welcome. -- Greg
I always said I wanted to implement a master format. I never said I was going to implement vanity editions. In fact I don't believe PG has the resources to handle them.
If other people want to have vanity editions then *they* should come up with ways to implement them, eg.:
- devise a new filesystem layout that accomodates vanity editions, - talk the WWers into adopting it, - fix the WWers tools to work with it, - create a workforce to check and catalog all vanity submissions, - start posting books, - explain to me how the database should pick up those files / metadata.
I'm not wasting my time crying and complaining and trying to stop other people from moving.
In fact I've already implemented all database functions for the posting of vanity epubs since DP expressed the wish to do so. If they didn't start posting yet, its because they couldn't convince the WWers.
-- Marcello Perathoner webmaster@gutenberg.org

On 02/06/2012 07:20 PM, Greg Newby wrote:
My view is that it would be cheating for you to bypass the original master files in favor of your own new masters, for the live site at www.gutenberg.org. Spinoffs and demos elsewhere would certainly not be cheating. And, updating the convertion at www.gutenberg.org to use the existing original master files is very much welcome.
I'm generating from my own RST masters instead of HTML for a selected few of the top 10 books. I converted those masters from the existing HTML files with a little script. There were no existing masters for those files. I didn't delete anybody else's files nor did I remove anybody else's files from the catalog. The only files that changed are the generated ones, living in cache/epub, those that were previously generated from old crappy HTML now are generated from RST. (And look a lot better than before if you ask me.) I don't regard that as cheating, because I'm not competing against anybody, but acting in the PG spirit of "everybody can contribute everything". As I already said, the web site is long since ready to accomodate hand-crafted epubs etc. along the generated ones. If there are none yet posted it is not my fault. Instead of bothering me, the people who advocate hand-crafted vanity epubs should have convinced the WWers to start posting them. -- Marcello Perathoner webmaster@gutenberg.org

I seem to notice a pattern ... Is it generally accepted practice to blame the whitewashers whenever a system design and implementation fails to be adopted? On Mon, Feb 6, 2012 at 11:50 AM, Marcello Perathoner <marcello@perathoner.de
wrote:
On 02/06/2012 07:20 PM, Greg Newby wrote:
My view is that it would be cheating for you to bypass the
original master files in favor of your own new masters, for the live site at www.gutenberg.org. Spinoffs and demos elsewhere would certainly not be cheating. And, updating the convertion at www.gutenberg.org to use the existing original master files is very much welcome.
I'm generating from my own RST masters instead of HTML for a selected few of the top 10 books. I converted those masters from the existing HTML files with a little script. There were no existing masters for those files.
I didn't delete anybody else's files nor did I remove anybody else's files from the catalog. The only files that changed are the generated ones, living in cache/epub, those that were previously generated from old crappy HTML now are generated from RST. (And look a lot better than before if you ask me.)
I don't regard that as cheating, because I'm not competing against anybody, but acting in the PG spirit of "everybody can contribute everything".
As I already said, the web site is long since ready to accomodate hand-crafted epubs etc. along the generated ones. If there are none yet posted it is not my fault. Instead of bothering me, the people who advocate hand-crafted vanity epubs should have convinced the WWers to start posting them.
-- Marcello Perathoner webmaster@gutenberg.org ______________________________**_________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/**mailman/listinfo/gutvol-d<http://lists.pglaf.org/mailman/listinfo/gutvol-d>

Marcello>In fact I've already implemented all database functions for the posting of vanity epubs since DP expressed the wish to do so. If they didn't start posting yet, its because they couldn't convince the WWers. Just to state the obvious, it is WAY time for Marcello and Greg to work this issue out between themselves, as they are working at cross purposes, doing so publicly, and in the process scaring the hell out of us volunteers to donate our time and efforts to PG and DP.
participants (13)
-
Bowerbird@aol.com
-
David Starner
-
don kretz
-
Greg Newby
-
Jeroen Hellingman
-
Jim Adcock
-
Jimmy O'Regan
-
Karen Lofstrom
-
Keith J. Schultz
-
Lee Passey
-
Marcello Perathoner
-
Mark Swofford
-
traverso@posso.dm.unipi.it