pride. and prejudice. incremental.

ok, greg, new versions of "pride and prejudice", incremental improvements for the tournament.
http://zenmagiclove.com/prapr/prapr.mobi http://zenmagiclove.com/prapr/prapr.html http://zenmagiclove.com/prapr/prapr.zml
http://zenmagiclove.com/prapr/prapr.ncx http://zenmagiclove.com/prapr/prapr.opf http://zenmagiclove.com/prapr/prapr.jpg
*** i learned some interesting things doing this... i began by stripping some cruft from my .css. as i did, more and more, i started to wonder how much of it i could strip out completely... it turned out that i could strip out quite a lot, and still get out essentially the _same_ .mobi. until eventually i dispensed with the stylesheet. what's in the .html file above is now just this:
p {} blockquote {background-color:yellow;} i {color:blue;}
and that's only to prove that kindlegen doesn't disregard the stylesheet, like some people say. because you will see those colors _do_ work -- when you display the .mobi on a color machine. but it's also true that kindlegen _does_ ignore a lot of the stuff you might put in a stylesheet. most especially, it seems, margins and padding. it's the exact equivalent of the hard lesson that designers had to learn as they went from print -- where they could control almost everything -- to the web, where you "control" almost nothing. they had to force themselves to abandon their desire for full control, and embrace flexibility. and that's why i wanted to run an experiment where i relinquished all that i possibly could, where i said, "ok, then show it how _you_ like." and know what? it didn't do a bad job at all. i was fine with it, aside from a couple details... one such example: it didn't center the headers. so i moved those exceptions into in-line styles, and boom, a simple-yet-quite-acceptable mobi. a list of the .html tags -- the complete set that is used in this book -- is appended to this post. but the bottom line is simple: let go. go zen. there are thousands and thousands of books in the p.g. library that can be treated this way. -bowerbird [mbp:pagebreak] == pagebreaks [div id="chunk0""] == sections [h2 style="text-align:center;"] (h1, h2, h4) == headers [p] == paragraphs [p style="text-align:left;text-indent:0;"] == table-of-contents [p style="text-align:center;text-indent:0;"] == scenebreaks [i] == italics [br] == breaks [hr] == horizontal rules [blockquote style="text-indent:1.25em;"] == blockquotes [a id="chapter_1"] == internal-idnames [a href="#chapter_1"] == internal-links [a href="http://zenmagiclove.com/prapr/"] == external-links [p style="text-align:right;"] == chapter-links

Lots of facets to this lesson - minimal markup is best. It's the same reason TEI looks nearly reasonable with no markup. A lot of the refactoring from old DP html to portable html is going go be just taking things out and remarking a few things syntactically (like chapter headings). Rather than inline style, you could also have just changed it to: <chapter_heading>Chapter 1</chapter_heading> and added to your stylesheet: chapter_heading { display: block; plus the other css stuff to match <h2>, or anything more to your taste } and you would have exactly the same appearance, with greater granularity of control over all the chapter headings. And if you change the css for whatever reason, you run no risk of unintentionally messing up all the other stuff someone has called "<h2>". I'm going through deformatting a project into wordpress now (not Encyclopedia Brittanica - more about this project later.) I didn't choose it because it was good or bad in any way - it's a very random project. It's not starting out auspiciously. In the title page there's a poem. The poem resides in a table, all its own, otherwise the table is serving no purpose. I have simplified it down to: <poem> <attribution>poet</attribution> </poem> (which, yes, is 100% equivalent to <div class="poem"> and <div class="attribution">, but it's simpler, and displays just as well; and both can be mapped properly to any other markup that understands poems and attributions.) and with simple css - nothing inline, no <br/>s, nothing - it looks much better than it does in the table - with less markup than the table has, in fact. Since it's simpler and less ambiguous, the proofer, formatter, and PPer all can have a greater sense of control over the process. Both these cases illustrate the general principle for refactoring DP html: Identify things syntactically and remove ambiguity, and there's not nearly as much to do (and the result usually looks better). On Tue, Feb 7, 2012 at 1:13 PM, <Bowerbird@aol.com> wrote:
ok, greg, new versions of "pride and prejudice", incremental improvements for the tournament.

BTW, if that's z-m-l, then you could have it integrated into wordpress as a fully functional input format, with html one-button preview and everything, with a couple hours' work. If that's the extent of what's required for this project, that much could be functional in half an hour.

Hi Don, Everybody, Am 08.02.2012 um 00:28 schrieb don kretz:
Lots of facets to this lesson - minimal markup is best.
It's the same reason TEI looks nearly reasonable with no markup.
A lot of the refactoring from old DP html to portable html is going go be just taking things out and remarking a few things syntactically (like chapter headings).
Rather than inline style, you could also have just changed it to:
<chapter_heading>Chapter 1</chapter_heading>
and added to your stylesheet: chapter_heading { display: block; plus the other css stuff to match <h2>, or anything more to your taste }
and you would have exactly the same appearance, with greater granularity of control over all the chapter headings. And if you change the css for whatever reason, you run no risk of unintentionally messing up all the other stuff someone has called "<h2>". I am glad you brought this up. It shows the need for a separate mark up language. That is do not generic features of a given language driectly, but use the extend as much as possible.
I'm going through deformatting a project into wordpress now (not Encyclopedia Brittanica - more about this project later.) I didn't choose it because it was good or bad in any way - it's a very random project.
It's not starting out auspiciously. In the title page there's a poem. The poem resides in a table, all its own, otherwise the table is serving no purpose. I have simplified it down to:
<poem>
<attribution>poet</attribution> </poem>
(which, yes, is 100% equivalent to <div class="poem"> and <div class="attribution">, but it's simpler, and displays just as well; and both can be mapped properly to any other markup that understands poems and attributions.)
and with simple css - nothing inline, no <br/>s, nothing - it looks much better than it does in the table - with less markup than the table has, in fact. Since it's simpler and less ambiguous, the proofer, formatter, and PPer all can have a greater sense of control over the process.
Both these cases illustrate the general principle for refactoring DP html: Identify things syntactically and remove ambiguity, and there's not nearly as much to do (and the result usually looks better).
On Tue, Feb 7, 2012 at 1:13 PM, <Bowerbird@aol.com> wrote: ok, greg, new versions of "pride and prejudice", incremental improvements for the tournament.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Tue, February 7, 2012 4:28 pm, don kretz wrote:
Lots of facets to this lesson - minimal markup is best.
Yes.
It's the same reason TEI looks nearly reasonable with no markup.
Not necessarily, but at this point complete accuracy on this point is not very relevant.
A lot of the refactoring from old DP html to portable html is going go be just taking things out and remarking a few things syntactically (like chapter headings).
And removing the internal style sheet and adding a link to a generic .css file.
Rather than inline style, you could also have just changed it to:
<chapter_heading>Chapter 1</chapter_heading>
Or <h3 class="chapter">Chapter 1</h3>. But you obviously know that.
and added to your stylesheet: chapter_heading { display: block; plus the other css stuff to match <h2>, or anything more to your taste }
/* Center first three headers */ h1, h2, h3 {text-align:center; font-weight: bold } /* Put a little extra before chapter breaks */ h3.chapter { margin-top: 2em; margin-bottom: 1em; page-break-before: always; font-size: larger }
and you would have exactly the same appearance, with greater granularity of control over all the chapter headings. And if you change the css for whatever reason, you run no risk of unintentionally messing up all the other stuff someone has called "<h2>".
Now we're getting to the nub of the issue. The rule I derive from what you have said is: For every document the document structure and the document presentation need to be segregated. This segregation should be accomplished in such a way that the document presentation can be changed without significantly changing the document structure. [little snip]
In the title page there's a poem. The poem resides in a table, all its own, otherwise the table is serving no purpose.
I think that was my rule #4: <table> may only be used for tabular data, never for display/presentation.
I have simplified it down to:
<poem> <attribution>poet</attribution> </poem>
(which, yes, is 100% equivalent to <div class="poem"> and <div class="attribution">, but it's simpler, and displays just as well;
One man's opinion, with which I happen not to agree...
and both can be mapped properly to any other markup that understands poems and attributions.)
Or you could have simplified it down to: <poetry> <by>poet</by> <verse>....</verse> </poetry> Which brings me to the point I have been trying to emphasize over the past several weeks. If we're planning to be able to swap out presentation, we have to use the same elements (tags) with the same attributes for the same purpose across all documents. <div class="poem"> or <poem> or <poetry> are all acceptable to me, but /there can be only one/! The less generic an element is (<poem> as opposed to <div class="poem">), the easier it is to programmatically validate markup. For example, if <poem> is permitted by the DTD, XMLLint or Jing will object if it encounters <poetry>; but it would /not/ object if it encounters <div class="poetry">. This does not mean we couldn't develop a tool to validate that all <div>s must be classified, or to check that the classification values are in the set of allowed values for the element; but an existing DTD or RelaxNG validator is not going to catch those problems. But none of these tools can check to be sure that valid tags are applied correctly. The problem with a presentational focus is that a contributor will use a valid tag incorrectly. For example, pretend that the <poem> tag, by default, is assigned a display type of "pre" and a monospaced font. Now pretend we have a book that has a quoted telegram as part of the text. A contributor may be tempted to mark the telegram as <poem> because it has the desired default presentation, even though the semantics are incorrect. (In HTML, the correct tag would be <tt>). I believe that volunteers are completely capable of marking <poem> and <tt> correctly, but only if there is a simple document that says, "<poem> may only be used for poetry. Don't use it for things that aren't poetry just because it looks right. Other options may be <tt>, ..., etc. If you aren't sure, post a message to the PP group." To summarize: Presentational markup (aka Cascading Style Sheets) and structural markup should be segregated into separate files. Structural markup should be strictly constrained to established patterns so different style sheets can be applied to the structural document according to a user's preference. Markup constraints must be clearly and unambiguously documented in a format that a volunteer can read and understand.

Lee>If we're planning to be able to swap out presentation, we have to use the same elements (tags) with the same attributes for the same purpose across all documents. <div class="poem"> or <poem> or <poetry> are all acceptable to me, but /there can be only one/! Are *you* planning to re-tag the existing 40,000 texts? If not, I suggest a better approach would be to come up with "suggestions" for tags which submitters can use moving forward to be more consistent and to make PG's life simpler moving forward. And if you are considering doing that, consider that the large majority of formatters submitting to PG have *already* agreed to a set of tag-names which they already know and love, so there is no reason for you to make them learn a new set of tags names for no good reason. Co-opt these existing experts by agreeing to use *their* choice of tag names rather than insisting that they use your newly invented tag names.

On Wed, February 8, 2012 12:05 pm, James Adcock wrote:
Are *you* planning to re-tag the existing 40,000 texts?
Yup. If you go back and look, that was Mr. Hutchinson's original proposal that I bought in to: 1. Agree to a master format from which all other "user" format can be reliably derived, 2. Develop tool chains which can reliably derive all other "user" formats. 3. Rework PG's existing corpus to comply with the agreed-upon master format. Right now, new work is not on my radar. Creating a new user interface for Distributed Proofreaders to use to create new works in the agreed-upon master format is a worthy goal, but it is not /my/ goal. Most of the value of Project Gutenberg /right now/ is probably not only in the 40,000 existing works it is probably within the first 5000. Finding ways to significantly improve those works is what I want to do. In some cases, it may involve actually replacing the PG texts with new versions created from scratch. I would hope that the improvement process could be made simple enough for volunteers from DP would join me, but an improvement process would be significantly different from what DP does now, so improvements to the DP workflow is of only tangential interest to me.
If not, I suggest a better approach would be to come up with "suggestions" for tags which submitters can use moving forward to be more consistent and to make PG's life simpler moving forward.
I certainly believe that as we move towards a consensus on a master format existing practices should be carefully considered for adoption. If anyone is suggesting starting over there should be a compelling reason to do so (that compelling reason may exist, I just haven't seen it yet). But mere "suggestions" are inadequate. As a programmer, I can't deal with suggestions, I can only deal with rules. I don't care what the rules are, I just need to know.

And yup. I thought that was what was requested explicitly. On Wed, Feb 8, 2012 at 11:33 AM, Lee Passey <lee@novomail.net> wrote:
On Wed, February 8, 2012 12:05 pm, James Adcock wrote:
Are *you* planning to re-tag the existing 40,000 texts?
Yup.

Are *you* planning to re-tag the existing 40,000 texts?
And yup. I thought that was what was requested explicitly.
Wow, you doing 40,000 texts. That is quite courageous. I've thought about doing this myself, and I came to the conclusion I would get burned out after reworking about 100 books -- while receiving extreme negative feedback from the PG/DP community for each of those 100 efforts.

On Wed, Feb 8, 2012 at 9:33 AM, Lee Passey <lee@novomail.net> wrote:
I would hope that the improvement process could be made simple enough for volunteers from DP would join me, but an improvement process would be significantly different from what DP does now, so improvements to the DP workflow is of only tangential interest to me.
Sign me up. I'm doing a lot at DP, also planning to work on Koine Greek and Hawaiian digitization projects, but I think this effort is long overdue. -- Karen Lofstrom

But mere "suggestions" are inadequate. As a programmer, I can't deal with suggestions, I can only deal with rules. I don't care what the rules are, I just need to know.
As a programmer presumably you recognize that any text which doesn't follow your rules when processed by your tools will result in something being broken. You also realize that when a volunteer submission, perhaps already prepared from another location, walks in the front door, and which doesn't magically follow your set of rules, because guess what even if hypothetically PG/DP volunteers buy into your rules no one else will, then PG will have no way to accept that "freebie" and that text will be lost. You also realize that any of the 40,000 existing texts that are not converted to your rules with not be able to be processed by your tools, and then will become "lost to the world" or "dead sea scrolls" or whatever you want to call them. Again, I suggest you think of your rules as being suggestions, in which case they have a greater chance of being accepted positively. I do think that there is room for someone who understands just how difficult the task is, and how much is actually involved, to make a contribution. Again, start by taking a hard look at the existing texts, new and old, the tools being used currently to generate those files, the commonalities used in currently coding those texts, and the large, knowledgeable, and vibrant volunteer community at DP. As a simple example: I have spent some months trying to convince people at PG and DP to change ONE line of code by 26% in order to remove 90% of the problems I see with the code generated by PG which is consumed by about 30% of its customers. But to no avail, in part because there are factions at DP and PG who do not want PG customers to have a positive reading experience.

Hi James, Can you help.me? Where do I find the documentation to the agreed tags? regards Keith. Am 08.02.2012 um 20:05 schrieb James Adcock:
Lee>If we're planning to be able to swap out presentation, we have to use the same elements (tags) with the same attributes for the same purpose across all documents. <div class="poem"> or <poem> or <poetry> are all acceptable to me, but /there can be only one/!
Are *you* planning to re-tag the existing 40,000 texts? If not, I suggest a better approach would be to come up with "suggestions" for tags which submitters can use moving forward to be more consistent and to make PG's life simpler moving forward. And if you are considering doing that, consider that the large majority of formatters submitting to PG have *already* agreed to a set of tag-names which they already know and love, so there is no reason for you to make them learn a new set of tags names for no good reason. Co-opt these existing experts by agreeing to use *their* choice of tag names rather than insisting that they use your newly invented tag names.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Where do I find the documentation to the agreed tags?
Besides actually looking for existing patterns in the actual HTML code being submitted to PG, take a look at: http://sourceforge.net/projects/guiguts/ http://www.pgdp.net/c/faq/document.php especially alphabetic index just below: http://www.pgdp.net/c/faq/document.php#f_errors http://www.pgdp.net/c/tools/pool.php?pool_id=PP http://www.pgdp.net/c/faq/formatting_summary.pdf Some things which seem to be more-or-less agreed upon, as examples: toc start tb i b pagenum sc g hr blockquote poem stanza chapter section author footnote sidenote title sup sub illus dropcap allcap

Below find one example of metadata definitions as implemented by people who do this for a living: http://idpf.org/epub/30/spec/epub30-publications.html

On 9 February 2012 15:16, Jim Adcock <jimad@msn.com> wrote:
Below find one example of metadata definitions as implemented by people who do this for a living:
That's a specification for a container format for metadata. If you check, you'll see that the PG catalogue already uses several of the same ontologies. Same metadata, different format. -- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you

On Thu, Feb 09, 2012 at 04:08:33PM +0000, Jimmy O'Regan wrote:
On 9 February 2012 15:16, Jim Adcock <jimad@msn.com> wrote:
Below find one example of metadata definitions as implemented by people who do this for a living:
That's a specification for a container format for metadata. If you check, you'll see that the PG catalogue already uses several of the same ontologies. Same metadata, different format.
Mostly the same format, too. PG uses Dublin Core, with some additional namespaces. It's a big download, but look at the catalog in XML/RDF format in the "offline catalogs" page of www.gutenberg.org Also, you can find an RDF file in the generated files, with just the metadata for that book. They're not as easy to find at www.gutenberg.org, but they're part of the rsync-able contents ("mirroring how-to"). For example, ftp://snowy.arsc.alaska.edu/mirrors/gutenberg/cache/generated/1661/ or http://snowy.arsc.alaska.edu/mirrors/gutenberg/cache/generated/1661/ Specifically, http://snowy.arsc.alaska.edu/mirrors/gutenberg/cache/generated/1661/pg1661.r... -- Greg

Greg, How easy is it for the public / DP / us to help collect and improve that data, crowd-source- or otherwise? On Thu, Feb 9, 2012 at 12:22 PM, Greg Newby <gbnewby@pglaf.org> wrote:
On 9 February 2012 15:16, Jim Adcock <jimad@msn.com> wrote:
Below find one example of metadata definitions as implemented by
On Thu, Feb 09, 2012 at 04:08:33PM +0000, Jimmy O'Regan wrote: people who
do this for a living:
That's a specification for a container format for metadata. If you check, you'll see that the PG catalogue already uses several of the same ontologies. Same metadata, different format.
Mostly the same format, too. PG uses Dublin Core, with some additional namespaces.
It's a big download, but look at the catalog in XML/RDF format in the "offline catalogs" page of www.gutenberg.org
Also, you can find an RDF file in the generated files, with just the metadata for that book. They're not as easy to find at www.gutenberg.org, but they're part of the rsync-able contents ("mirroring how-to"). For example,
ftp://snowy.arsc.alaska.edu/mirrors/gutenberg/cache/generated/1661/ or http://snowy.arsc.alaska.edu/mirrors/gutenberg/cache/generated/1661/
Specifically,
http://snowy.arsc.alaska.edu/mirrors/gutenberg/cache/generated/1661/pg1661.r...
-- Greg _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Greg, Is it possible to imagine poaching off the metadata other people have? For one example, could TIA be mined to acquire a list of books whose images are harvestable? And could some of them be determined clearable to build a list of projects that could be begun with lower-than-usual overhead?

On Thu, February 9, 2012 1:42 pm, don kretz wrote:
Is it possible to imagine poaching off the metadata other people have?
For one example, could TIA be mined to acquire a list of books whose images are harvestable? And could some of them be determined clearable to build a list of projects that could be begun with lower-than-usual overhead?
For WEM metadata, Open Library is probably the best open-source repository. Interestingly, Open Library contains some references to Project Gutenberg works, but only those which have been independently posted to IA. The OL people have consistently expressed complete indifference to the PG corpus. There is perhaps an opportunity here to find a way to do an automated upload of references to the PG corpus to OL, and at the same time recording the OLID for each work back into the PG metadata.

On Thu, Feb 09, 2012 at 12:42:12PM -0800, don kretz wrote:
Greg,
Is it possible to imagine poaching off the metadata other people have?
For one example, could TIA be mined to acquire a list of books whose images are harvestable? And could some of them be determined clearable to build a list of projects that could be begun with lower-than-usual overhead?
I don't think I understand this suggestion. We were talking about metadata in the books that PG completes/posts. It seems you are talking about projects that are not yet complete. I do think that might be a good idea for identifying candidates for scanning, especially if we could somehow filter out items we already have (*that* part can be rather difficult, due to variations in titls) That seems a somewhat different focus than most of what we've been talking about, though, so perhaps I'm just misunderstanding what you are suggesting. Alternatively, some of my earlier messages on this might have clarified what we have. -- Greg

On Thu, Feb 09, 2012 at 12:38:09PM -0800, don kretz wrote:
Greg,
How easy is it for the public / DP / us to help collect and improve that data, crowd-source- or otherwise?
We only get a few corrections per week, so don't need extra labor at this point for that purpose. Email corrections catalog2012@pglaf.org, and Andrew Sly will take care of it. Adding subject headings could use some crowdsourcing, but only with somewhat trained crowds: it involves looking up a record in LoC and getting the subject heading. It's probably possible to get some progress automatically, too. Ditto for author birth/death dates. The .RST example below shows you the types of metadata that can get added. Some comes with the book (author, title; PG stuff like eBook number), some gets added automatically (filenames, sizes, MIME types), some gets added by hand (author death/birth, LCSH). The non-automatic stuff is tweaked by hand as needed (such as, when a title is not correctly harvested automatically, or author name variations need to be conflated). Andrew takes very good care of this, without much additional help. Large-scale adding of better author or subject info would take an added level of effort, which he could explain better than I. -- Greg
On Thu, Feb 9, 2012 at 12:22 PM, Greg Newby <gbnewby@pglaf.org> wrote:
On 9 February 2012 15:16, Jim Adcock <jimad@msn.com> wrote:
Below find one example of metadata definitions as implemented by
On Thu, Feb 09, 2012 at 04:08:33PM +0000, Jimmy O'Regan wrote: people who
do this for a living:
That's a specification for a container format for metadata. If you check, you'll see that the PG catalogue already uses several of the same ontologies. Same metadata, different format.
Mostly the same format, too. PG uses Dublin Core, with some additional namespaces.
It's a big download, but look at the catalog in XML/RDF format in the "offline catalogs" page of www.gutenberg.org
Also, you can find an RDF file in the generated files, with just the metadata for that book. They're not as easy to find at www.gutenberg.org, but they're part of the rsync-able contents ("mirroring how-to"). For example,
ftp://snowy.arsc.alaska.edu/mirrors/gutenberg/cache/generated/1661/ or http://snowy.arsc.alaska.edu/mirrors/gutenberg/cache/generated/1661/
Specifically,
http://snowy.arsc.alaska.edu/mirrors/gutenberg/cache/generated/1661/pg1661.r...
-- Greg _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Greg> PG uses Dublin Core, with some additional namespaces. We are confusing catalog storage with book storage. When PG sends a say HTML book out into the world the catalog information does not travel with the book. In the case, as an example only, with an EPUB book (properly coded) the book information *does* travel out into the world with the book. I suggest again if PG is in the business of preserving books then it should send the whole book out into the world, not keep some parts of the book only in the PG catalog.

On Thu, Feb 09, 2012 at 02:24:13PM -0800, James Adcock wrote:
Greg> PG uses Dublin Core, with some additional namespaces.
We are confusing catalog storage with book storage. When PG sends a say HTML book out into the world the catalog information does not travel with the book. In the case, as an example only, with an EPUB book (properly coded) the book information *does* travel out into the world with the book. I suggest again if PG is in the business of preserving books then it should send the whole book out into the world, not keep some parts of the book only in the PG catalog.
We have partial metadata in the text/html. Are you talking about adding DC metadata in what is visible? Or, encoded but not necessarily visible? Either seems pretty achievable. One slight snag is that some of the metadata is only added after the book's files are pushed to the server (notably, LCSH). Note, however, that the RST *does* include all the metadata (as I pointed our in a section of the message you cut). Getting that same data into derived formats would be quite easy. -- Greg

On Thu, February 9, 2012 3:49 pm, Greg Newby wrote:
Note, however, that the RST *does* include all the metadata (as I pointed our in a section of the message you cut). Getting that same data into derived formats would be quite easy.
I didn't see any reference to ReStructured Text in your earlier message. Did you really mean RDF?

On Thu, Feb 09, 2012 at 04:06:16PM -0700, Lee Passey wrote:
On Thu, February 9, 2012 3:49 pm, Greg Newby wrote:
Note, however, that the RST *does* include all the metadata (as I pointed our in a section of the message you cut). Getting that same data into derived formats would be quite easy.
I didn't see any reference to ReStructured Text in your earlier message. Did you really mean RDF?
I did. And the RDF includes XML with DC. Lotta acronyms... :)

Greg>We have partial metadata in the text/html. Are you talking about adding DC metadata in what is visible? Or, encoded but not necessarily visible? Either seems pretty achievable. Well, consider the "Metadata" I can find in one particular PG HTML file. Different tools grab and process this "metadata" in different ways -- not suprising since the "metadata" is presenting in self-conflicting ways: <title>Emma, by Jane Austen</title> The Project Gutenberg EBook of Emma, by Jane Austen Title: Emma Author: Jane Austen Release Date: January 21, 2010 Ebook #158 PROJECT GUTENBERG EBOOK EMMA Produced by An Anonymous Volunteer, and David Widger <h1>EMMA</h1> <h2>By Jane Austen</h2> <h1>VOLUME I</h1> End of the Project Gutenberg EBook of Emma, by Jane Austen *** END OF THIS PROJECT GUTENBERG EBOOK EMMA *** ***** This file should be named 158-h.htm or 158-h.zip ***** ========================== Note for example the title information is presented in five different conflicting ways. IE PG has given this book five different "titles." This HTML "Metadata" to somehow supposed to be mapped (for example) onto Dublincore: contributor creator date description format identifier language publisher rights source subject title And now compare to the PG RDF catalog (which does this somewhat better) http://www.gutenberg.org/ebooks/158.rdf One simple approach (for example) might be simply to append the Dublincore info to the end of the file, but one might also hope that PG might decide on a clean manner how to name the "title" and "author" of their works. For example one commercial tool I use finds a title on this book of: The Project Gutenberg EBook of Emma, by Jane Austen And an author of [none] Whereas I would have hoped for a title of: Emma And an author of: Austen, Jane And perhaps a publisher of: Project Gutenberg For example if I search for this book locally in my collection, I would hope to find it sorted by title on "E", not "P" or "T" (which would not be useful given how many books I have from PG) and I would hope sorted on author it would show up under "A" not "J"

Hi Jim, Am 09.02.2012 um 15:45 schrieb Jim Adcock:
Where do I find the documentation to the agreed tags?
Besides actually looking for existing patterns in the actual HTML code being submitted to PG, take a look at: interesting documentation!
http://sourceforge.net/projects/guiguts/ This project belongs to dp!
http://www.pgdp.net/c/faq/document.php This project belongs to dp!
especially alphabetic index just below: http://www.pgdp.net/c/faq/document.php#f_errors
This project belongs to dp!
http://www.pgdp.net/c/tools/pool.php?pool_id=PP This project belongs to dp!
http://www.pgdp.net/c/faq/formatting_summary.pdf This project belongs to dp!
I thought this is PG. A PG list and want to improve PG. regards Keith

Where do I find the documentation to the agreed tags?
I thought this is PG. A PG list and want to improve PG. When PG accepts DP submissions then DP submitters become part of the PG community. You don't want to accept this notion because you don't want to have to actually do the work -- namely the work of convincing the entire community of submitters to change directions to your whim -- not making any claims about whether those whims are good or not. If there is to be any agreement about "change" at PG that agreement in practice needs to include ALL people who submit in practice to PG, NOT just the independent submitter. DP also recognizes the need for change. It's just that they are well aware in practice of what it actually takes to make a book, and thus are not likely to be impressed by suggestions that don't actually make a real contribution towards improving the situation.

Hi Jim, Sorry, DP is a independent entity. They are affiliated with PG. Yes, they contribute to the cause and are mentioned by PG proper. DP does not define PG policies and vice versa. I have notice that there are those working in/with DP that would like to see it differently, and participate here try to enforce DP ways to be adopted here. I am not a lemming, that is i do just follow the one in front of me that is following …, blindly. As for the masses a friend of mine has a nice saying: "People millions of flies can not err, eat shit" Please do not get me wrong. DP is doing a tremendous job, or am i talking about the quality of work. I am just saying even though DP is the main contributor to PG it does not mean it is the best way. Like I have said many times, I do not care what DP thinks or does. My suggestions are aimed at PG. In how far PG excepts DP or says DP wins other lose is not my concern or for me to decide. Besides, since DPs work is so well documented it should not be hard to convert what ever the produce to the master format of PG. I see no necessity for PG to use the master format at DP. Nor do I see the necessity for DP to adopt PGs. regards Keith. Am 09.02.2012 um 18:15 schrieb Jim Adcock:
Where do I find the documentation to the agreed tags?
I thought this is PG. A PG list and want to improve PG.
When PG accepts DP submissions then DP submitters become part of the PG community. You don't want to accept this notion because you don't want to have to actually do the work -- namely the work of convincing the entire community of submitters to change directions to your whim -- not making any claims about whether those whims are good or not.
If there is to be any agreement about "change" at PG that agreement in practice needs to include ALL people who submit in practice to PG, NOT just the independent submitter. DP also recognizes the need for change. It's just that they are well aware in practice of what it actually takes to make a book, and thus are not likely to be impressed by suggestions that don't actually make a real contribution towards improving the situation.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Keith>DP does not define PG policies and vice versa. Well, we fundamentally disagree on this subject, so there is no point arguing. My position is that PG and DP are strongly tied simply because the vast majority of PG content comes from DP, and it would be silly to pretend otherwise. Also DP has a greater practical knowledge of these issues that PG has. Note that *I* do not work with DP anymore [so there is no point in trying to tar me with that feather] I gave DP up since books that take years to produce do not fit well with my lifestyle. I produce independently, because that is something I can in practice do.

I thought this is PG. A PG list and want to improve PG.
regards Keith
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
If you want to improve something, don't you logically start with understanding what it is you are trying to improve? If you want to improve DP/PG's process, don't you start with any information you can find on that process? There it is, at hopefully your appropriate level of abstraction and domain of interest.

Hi Don, I have gathered all the information i need. 1) PG does not have a sufficiently specified set of constraints of what belongs in the contents of files and formats that they are willing to accept! 2) DP does have a well documented set of constraints, yet it is so polymorph that can be used to produced consistant data. or to put it differently, 1) PG has not concerned itself with the complexity of the problem of creating ebooks. That is left to the contributors! 2) DP has artificially overly introduced complexity to the problem. Know the complexity introduced by DP is that they believe that it easier to digitalize books by the syntactic mark up of the semantics of a book and texts. This assumption is false, because there is no semantic information in a book. the extraction of such information is a function of the human brain. All that is needed is to identify the syntactic structures of the layout of the book inorder to digitalize it into an ebook. Another problem is that classical book and text semantics do not mix well, and I assume that they have improperly mixed them. regards Keith. Am 09.02.2012 um 19:52 schrieb don kretz:
I thought this is PG. A PG list and want to improve PG.
regards Keith
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
If you want to improve something, don't you logically start with understanding what it is you are trying to improve? If you want to improve DP/PG's process, don't you start with any information you can find on that process?
There it is, at hopefully your appropriate level of abstraction and domain of interest. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Thu, Feb 9, 2012 at 4:16 PM, Keith J. Schultz <schultzk@uni-trier.de> wrote:
Know the complexity introduced by DP is that they believe that it easier to digitalize books by the syntactic mark up of the semantics of a book and texts. This assumption is false, because there is no semantic information in a book. the extraction of such information is a function of the human brain.
All that is needed is to identify the syntactic structures of the layout of the book inorder to digitalize it into an ebook.
First, there's no syntactic structure in a book. There's only ink. Anything above that level is purely an artifact of the human mind. Secondly, I have no idea what you're talking about. What markup do you not think DP should be doing?
Another problem is that classical book and text semantics do not mix well, and I assume that they have improperly mixed them.
Of course. When having a discussion with someone else, point out that something is hard, and therefore you can assume they must have screwed it up. That's exactly what makes for a working discussion. -- Kie ekzistas vivo, ekzistas espero.

Hi David, I will assume that you are not giving me the idiot treatment. So I will explain to you why you are wrong! First, the distribution of ink is not arbitrary! By analyzing the the distribution of the ink, you will observe reoccurring structures (forms), we will call them letters. Again, these letters are not distributed arbitrarily. further analysis reveals another form, lines. Yet further analysis reveals even more structure! These structures are not an artifact of the human mind, but truly physically existent. As the continues with the analysis of the ink one will identify recurring more structures and segmentation. Eventually, you end up with a syntactic description of the structure. This has nothing to do with an artifact of the human mind. regards Keith. Am 10.02.2012 um 10:43 schrieb David Starner:
First, there's no syntactic structure in a book. There's only ink. Anything above that level is purely an artifact of the human mind.
participants (10)
-
Bowerbird@aol.com
-
David Starner
-
don kretz
-
Greg Newby
-
James Adcock
-
Jim Adcock
-
Jimmy O'Regan
-
Karen Lofstrom
-
Keith J. Schultz
-
Lee Passey