
Sorry to repeat myself but you can get the RST and PGTEI tools sets (Epubmaker and Gnutenberg Press) installed easily on Windows systems as bundled with guiguts-win here http://sourceforge.net/projects/guiguts/files/guiguts/. There are many interesting development opportunities here: fleshing out a GUI interface to these tools, making it easier for users to generate Epub friendly HTML while provideing warnings when they have not, tweaking the code used to create HTML so it creates PGTEI. Contact me with a sourceforge ID and I can add you as a developer. On the last idea, the first step would be to modify lib\Guiguts\HTMLConvert.pm to replace a line like [code]$textwindow->ntinsert( "$step.0", '<h1>' );[/code] with [code]$textwindow->ntinsert( "$step.0", $chapterheader);[/code] then create a file html.rc with lines like [code]$chapterheader='<h1>';[/code] which gets read in at the start of htmlautoconvert(). Then you create pgtei.rc [code]$chapterheader='<header>';[/code] You would then have a routine to produce TEI that is already far better than the one on the dp web site. A minor change to another routine (is_paragraph_open) could ensure that the <div> for the previous chapter is closed. Hunter

I'd like to see someone consider using one of the many UI frameworks available these days rather than coming up with another - I guess to try to avoid another guiguts situation. One-man software tools have such obvious shortcomings no matter what their technical merit. The first that comes to my mind is Eclipse but there are dozens of others in the general category.. The CMS packages (django, wordpress, joomla, etc.) have a lot of what's needed and are variously easy to adapt. Epubmaker could be plugged in. But so much of this has already been done that it's likely we can avoid reinventing some major wheels. On Sun, Feb 5, 2012 at 11:23 AM, <hmonroe.pglaf@huntermonroe.com> wrote:
Sorry to repeat myself but you can get the RST and PGTEI tools sets (Epubmaker and Gnutenberg Press) installed easily on Windows systems as bundled with guiguts-win here http://sourceforge.net/projects/guiguts/files/guiguts/.
There are many interesting development opportunities here: fleshing out a GUI interface to these tools, making it easier for users to generate Epub friendly HTML while provideing warnings when they have not, tweaking the code used to create HTML so it creates PGTEI. Contact me with a sourceforge ID and I can add you as a developer. On the last idea, the first step would be to modify lib\Guiguts\HTMLConvert.pm to replace a line like
[code]$textwindow->ntinsert( "$step.0", '<h1>' );[/code]
with
[code]$textwindow->ntinsert( "$step.0", $chapterheader);[/code]
then create a file html.rc with lines like
[code]$chapterheader='<h1>';[/code]
which gets read in at the start of htmlautoconvert(). Then you create pgtei.rc
[code]$chapterheader='<header>';[/code]
You would then have a routine to produce TEI that is already far better than the one on the dp web site. A minor change to another routine (is_paragraph_open) could ensure that the <div> for the previous chapter is closed.
Hunter _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Don>I'd like to see someone consider using one of the many UI frameworks available these days rather than coming up with another - I guess to try to avoid another guiguts situation. One-man software tools have such obvious shortcomings no matter what their technical merit. Sorry, I'm sure I get what you are suggesting based on looking at these websites, but it seems like you are going beyond requiring volunteers to submit to a "master" source format to now also "master" them when it comes to the choice of tools they are allowed to use to develop books in that "master" source format? I can see how this might work if they were well paid employees, not volunteers. Or are you saying that you are volunteering to make a new set of tools that the volunteers can *choose* to use -- either the tools you provide, or guiguts, or Notepad++ -- or whatever the heck the volunteer wants to use which *they* feel is the most efficient, and/or the most fun, to help them get *their* book effort done?

You simply don't force volunteers to use tools. We're trying that now with RST. People use guiguts because it's the only thing available (other than your preferred editor and personal html sense, which is not a common resource.) I am developing tools that people will only use if a) they're easier than the alternative, b) they are more productive than the alternative and c) they increase the likelihood of success when facing the whitewashers and the PG software filters. But why would I want to develop tools that didn't? Unlike some, I'm not committed to any technology in the workflow, I'm only committed to the people without whom the workflow wouldn't flow. Hence my insistence that the only markup options worth considering are the ones that have the possibility of being used by DP volunteers with minimal objection. You'll see one example when I get the eb.tbicl.org clone up (probably within a week - I've got to ship the database files over and transplanted as the most tedious step.) It's by default an html editor but with some helper abstractions added to simplify some common markup patterns to enable non-techs to write successful html. Interesting fact - there are really only three contexts that provide html editing tools that non-technical people seem to tolerate. One is email editors; and their output is so dumbed-down as to be I think useless. The other two are wikis and blog engines. Wikis have developed a comprehensive markup of their own that no one seems to be interesting in discussing. Nor blog engines, which have a larger variety of options. Most of the discussion here centers around how technically well some markup will represent the text, while capturing the data, unambiguously enough to be mined for alternative outputs. I think this is a futile exercise base on incorrect priorities. I think the only path to success is to provide a user interface that as much as possible hides the markup so people don't need to deal with it directly. Then as long as the information is there, it can be extracted into any format you want. That's what software is for - to adapt to what people need to do. Not vice versa (yes I'm repeating myself.) That's the main point of one of those wikisource papers - they have decided that the most important inhibitor to their ability to grow and succeed is an inadequate editing interface. I can't think of a single successful word processing product that exposes its underlying text representation to the user. So I have one project that's built on the blogger software that has the reputation for the least intimidating user interface (wordpress). And I picked it pretty much solely for that reason. It's a lot easier to get the text into whatever markup I want, serialized any way I need to, once I get people to edit it into some unambiguously structured format, whatever way they find easiest. (And very few people can get it into unambiguously structured form of any kind using just notepad.) Guiguts has unstructured ambiguity as one of its biggest weaknesses. Wordpress by default provides the user with a TinyMCE interface, which looks like any rich text editor and produces html. I find it personally unusable. The backup editor is a customizable html editor that doesn't inhibit the complexity of the html or other markup you enter so that's what I'm building on. The also have editor modules for markdown, FCKeditor, textile, and RST (although the RST editor appears to have been abandoned.) I think it had the potential to be morphed into a variety of contexts, but it has never caught on and is now primarily a python documentation tool, and applied to other purposes by pythonisti. Eclipse is quite commonly used as an easier-to-use ui shell in front of editors (mainly of the programmer type, but the target audience of the preferred toolset provided by PG is unquestionably programmer-types.) It comes with pretty good tools for highlighting html, javascript, or any other syntax you can name. On Sun, Feb 5, 2012 at 10:31 PM, Jim Adcock <jimad@msn.com> wrote:
Don>I'd like to see someone consider using one of the many UI frameworks available these days rather than coming up with another - I guess to try to avoid another guiguts situation. One-man software tools have such obvious shortcomings no matter what their technical merit.
Sorry, I'm sure I get what you are suggesting based on looking at these websites, but it seems like you are going beyond requiring volunteers to submit to a "master" source format to now also "master" them when it comes to the choice of tools they are allowed to use to develop books in that "master" source format?
I can see how this might work if they were well paid employees, not volunteers.
Or are you saying that you are volunteering to make a new set of tools that the volunteers can *choose* to use -- either the tools you provide, or guiguts, or Notepad++ -- or whatever the heck the volunteer wants to use which *they* feel is the most efficient, and/or the most fun, to help them get *their* book effort done?
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 02/06/2012 08:32 AM, don kretz wrote:
You simply don't force volunteers to use tools. We're trying that now with RST.
False. We are giving them the option to post in RST. Like we gave them the option to post in TEI. If you don't like either, you can use the traditional posting channel. -- Marcello Perathoner webmaster@gutenberg.org

What if DP revised its workflow so that one person combined all the pages, checked for inconsistencies and typos, and marked all the chapter headings and section headings with some easy markup? This text could then be passed to someone who specialized in markup. This would basically split the PP process into two parts. It would resemble a real-world publishing workflow in which editors are asked to mark text divisions before passing the text to the layout person. If we did that, we wouldn't have to teach all the PPers RST or TEI. Just a few. -- Karen Lofstrom

Hi Karen, Why teach them RST or TEI. Give then a WSYWIG editor. The PPers would only need how to use a simplistic editor and need not worry in what format the file is saved in, nor worry about making mistakes using the format. Another benefit is if for what ever reason the "master format" changes the PPers do not have to learn something new, just rewrite the editor to output the new format. regards Keith. Am 06.02.2012 um 17:04 schrieb Karen Lofstrom:
What if DP revised its workflow so that one person combined all the pages, checked for inconsistencies and typos, and marked all the chapter headings and section headings with some easy markup? This text could then be passed to someone who specialized in markup. This would basically split the PP process into two parts. It would resemble a real-world publishing workflow in which editors are asked to mark text divisions before passing the text to the layout person.
If we did that, we wouldn't have to teach all the PPers RST or TEI. Just a few.

Keith<>Why teach them RST or TEI. Give then a WSYWIG editor. The PPers would only need how to use a simplistic editor and need not worry in what format the file is saved in, nor worry about making mistakes using the format. You mean like Sigil?

Am 06.02.2012 um 20:17 schrieb Jim Adcock:
Keith<>Why teach them RST or TEI. Give then a WSYWIG editor. The PPers would only need how to use a simplistic editor and need not worry in what format the file is saved in, nor worry about making mistakes using the format.
You mean like Sigil?
No, not Sigil! But, along the lines of Sigil more sophisticated and gear towards a master format and not a device format. Because, what do we do when the new devices and standards come out. regards Keith.

Keith> No, not Sigil! But, along the lines of Sigil more sophisticated and gear towards a master format and not a device format. Because, what do we do when the new devices and standards come out. But EPUB *is* a master format, just not one apparently you like, and when I suggest a real tool (Sigil) which has a real and large user base of real people making real books you counterpropose something which doesn't exist.

Hi James, What do you expect from me? You ask me what I was thinking about. My view is that EPUB is not a viable master format. If I would follow your way of arguing things lets all just take one of the most widely used formats on the planet docx! Just for the sake of argument. EPUBs are not real books they are virtual at the most digital. regards Keith. Am 06.02.2012 um 23:00 schrieb James Adcock:
Keith> No, not Sigil! But, along the lines of Sigil more sophisticated and gear towards a master format and not a device format. Because, what do we do when the new devices and standards come out.
But EPUB *is* a master format, just not one apparently you like, and when I suggest a real tool (Sigil) which has a real and large user base of real people making real books you counterpropose something which doesn't exist.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Some of the Publish-On-Demand services recommend you do exactly that. And their requirements are quite similar. On Mon, Feb 6, 2012 at 3:50 PM, Jim Adcock <jimad@msn.com> wrote:
If I would follow your way of arguing things lets all just take one of the most widely used formats on the planet docx!
Actually a better idea than most floated here.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

One shouldn't need to examine the source code to anyone's program to find out what it does. Which could be something else entirely tomorrow. On Mon, Feb 6, 2012 at 3:53 PM, don kretz <dakretz@gmail.com> wrote:
Some of the Publish-On-Demand services recommend you do exactly that. And their requirements are quite similar.
On Mon, Feb 6, 2012 at 3:50 PM, Jim Adcock <jimad@msn.com> wrote:
If I would follow your way of arguing things lets all just take one of the most widely used formats on the planet docx!
Actually a better idea than most floated here.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Unless, of course, it is intended that you not know, and you should assume it's a black box and what it does and how it does it isn't anything you should count on in the future. Which is a legitimate approach; but it should be made explicit. On Mon, Feb 6, 2012 at 3:57 PM, don kretz <dakretz@gmail.com> wrote:
One shouldn't need to examine the source code to anyone's program to find out what it does. Which could be something else entirely tomorrow.
On Mon, Feb 6, 2012 at 3:53 PM, don kretz <dakretz@gmail.com> wrote:
Some of the Publish-On-Demand services recommend you do exactly that. And their requirements are quite similar.
On Mon, Feb 6, 2012 at 3:50 PM, Jim Adcock <jimad@msn.com> wrote:
If I would follow your way of arguing things lets all just take one of the most widely used formats on the planet docx!
Actually a better idea than most floated here.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Hi Don, Are we talking OOP and encapsulation here? Then you only should know what goes in and what comes out! regards Keith. Am 07.02.2012 um 01:00 schrieb don kretz:
Unless, of course, it is intended that you not know, and you should assume it's a black box and what it does and how it does it isn't anything you should count on in the future. Which is a legitimate approach; but it should be made explicit.
On Mon, Feb 6, 2012 at 3:57 PM, don kretz <dakretz@gmail.com> wrote: One shouldn't need to examine the source code to anyone's program to find out what it does. Which could be something else entirely tomorrow.
On Mon, Feb 6, 2012 at 3:53 PM, don kretz <dakretz@gmail.com> wrote: Some of the Publish-On-Demand services recommend you do exactly that. And their requirements are quite similar.
On Mon, Feb 6, 2012 at 3:50 PM, Jim Adcock <jimad@msn.com> wrote:
If I would follow your way of arguing things lets all just take one of the most widely used formats on the planet docx!
Actually a better idea than most floated here.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Yes, you need to trust that the person in charge of the inside is doing what you expect. Which sometimes you only find out if he tells you what he's doing, because he may not know or care what you expect. On Mon, Feb 6, 2012 at 4:55 PM, Keith J. Schultz <schultzk@uni-trier.de>wrote:
Hi Don,
Are we talking OOP and encapsulation here?
Then you only should know what goes in and what comes out!
regards Keith.
Am 07.02.2012 um 01:00 schrieb don kretz:
Unless, of course, it is intended that you not know, and you should assume it's a black box and what it does and how it does it isn't anything you should count on in the future. Which is a legitimate approach; but it should be made explicit.
On Mon, Feb 6, 2012 at 3:57 PM, don kretz <dakretz@gmail.com> wrote:
One shouldn't need to examine the source code to anyone's program to find out what it does. Which could be something else entirely tomorrow.
On Mon, Feb 6, 2012 at 3:53 PM, don kretz <dakretz@gmail.com> wrote:
Some of the Publish-On-Demand services recommend you do exactly that. And their requirements are quite similar.
On Mon, Feb 6, 2012 at 3:50 PM, Jim Adcock <jimad@msn.com> wrote:
If I would follow your way of arguing things lets all just take one of the most widely used formats on the planet docx!
Actually a better idea than most floated here.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

If PG is to have a hope of citizen-provided proofing, crowd or otherwise, there are two non-negotiable design requirements. The page images must be available as the basis for resolving all questions about content (understanding they aren't always unambiguous, but no images is hopeless.) The text must have the page numbers embedded as data so the error submission process includes the ability to easily confirm the text with the image. People will self-verify if they are given the tools. If they can't, the whitewashers are screwed again.

Hi Don, Agreed ! regards Keith. Am 08.02.2012 um 07:39 schrieb don kretz:
If PG is to have a hope of citizen-provided proofing, crowd or otherwise, there are two non-negotiable design requirements.
The page images must be available as the basis for resolving all questions about content (understanding they aren't always unambiguous, but no images is hopeless.)
The text must have the page numbers embedded as data so the error submission process includes the ability to easily confirm the text with the image.
People will self-verify if they are given the tools. If they can't, the whitewashers are screwed again. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Don>If PG is to have a hope of citizen-provided proofing, crowd or otherwise, there are two non-negotiable design requirements. I would hope that no one on this forum is in a position to issue non-negotiable demands. Rather I would hope we could have a polite conversation of ideas on the basis of their merits. I would hope PG would be willing to accept *any* text which they can reasonably assure is out of copyright, and thus the public has a "non-negotiable" right to copy that work wherever and whenever they want. PG being part of the public. If PG is not willing to accept all such texts then PG is effectively working *with* those organizations which engage in copyfraud. But, PG, and would-be submitters face a realistic problem that some other organizations assert "copyright" over some or all parts of photographic page images from works which are clearly out of copyright. In my opinion those assertions are fraudulent aka "copyfraud" but as a pragmatic response to this problem PG, and submitters may wish to avoid copying, storing, and re-transmission of these "copyfraud" photocopies. Even these "copyfraud" organizations agree that textual representation of these out-of-copyright texts in no way infringes on their blessed photocopies. So a simple solution in these cases is that PG posts the textual representation, but does not host a copy of the "copyfraud" images. Asking submitters to tackle or put themselves in harm's way of the these "copyfraud" organizations would not be a reasonable requirement. What the submitter often does instead is tells the whitewashers where these pages are hosted, in case the whitewashers need to verify that the submitted text *is* in fact the text of a book which *is* out of copyright, and which therefore the submitter, and PG, have every legal right in the world to store, transfer and copy as they see fit. Besides, we are not really talking about "citizen-provided proofing" -- that is what DP does -- rather we are talking about human-powered formatting to overcome the limitations of PG tool-driven formatting. Agreed, still, that such formatting is done better while viewing page images, otherwise one is doing "blind formatting" -- which in fact is what the PG tools *are* doing, except perhaps when they are only slightly fixing problems with submitter HTML code.
The text must have the page numbers embedded as data so the error submission process includes the ability to easily confirm the text with the image.
Submitters, including yours truly, often do not record page numbers because to do so is often a pain -- page numbers being one of the things which OCR does worst. People who insist on page numbers do so because page numbers fit well with *their* choice of work flow. For those of us who do not use that work flow putting page numbers back in doesn't work well. Checking submitted text against page images without page numbers is *in practice* no big deal and something that we who do not use page numbers do "all the time" -- we simply use textual searches to match what we are working on against a portion of the page image. For example when I OCR I also generate a PDF page image file which contains the OCR text. Usually I am "in sync" with the PDF so *I* know where I am. If I put the work away and come back in two weeks from vacation I can simply "re sync" by doing a textual search on the PDF. Again, If I have to provide page images then I might as well just provide the page images to DP and let them "mechanical turk" the work. Because making a set of page images that would keep DP or PG happy is a ton of work -- where I would rather use the time and energy to do something fun and constructive. And some of us believe that page numbers in e-books are a very very bad idea. If you want page numbers, you do the work. Don't make us do it.

Don>If PG is to have a hope of citizen-provided proofing, crowd or otherwise, there are two non-negotiable design requirements.
I would hope that no one on this forum is in a position to issue non-negotiable demands. Rather I would hope we could have a polite conversation of ideas on the basis of their merits.
Read the post. If PG is to have a hope of citizen-provided proofing ...
On Wed, Feb 8, 2012 at 10:04 AM, Jim Adcock <jimad@msn.com> wrote: then ..... It isn't so because I or anyone else declares it so. You sound like you are admirably well equipped to correct your own texts with only text search. I have no way to know, but I'm willing to take your word for it. I apologize if I misunderstood that we were having a conversation at least partially about the topic of widely available techniques for the public to submit corrections to PG texts. At least as vigorously as how you proof your texts and on what terms PG accepts them; which is clearly an important topic for you.

EPUB is a wonderful display format. It is NOT a master format.
Well, you would have to define what *you* mean by "master format" since clearly your idea of what a "master format" is is different from mine. Not sure what you mean by a "display format." I can clearly see that PDF is a display format, and HP Printer Code is a display format. By "master format" I means something that can be used to provide 100% of the information needed to create a "real book"

On 8 February 2012 17:26, Jim Adcock <jimad@msn.com> wrote:
EPUB is a wonderful display format. It is NOT a master format.
Well, you would have to define what *you* mean by "master format" since clearly your idea of what a "master format" is is different from mine.
It's generally accepted that a master format is one from which other formats are generated. This is analogous to a master tape in music, or a master reel in cinema.
Not sure what you mean by a "display format." I can clearly see that PDF is a display format, and HP Printer Code is a display format.
EPUB is defined as a delivery format, rather than a display format: it's meant to (more or less) coordinate how the various files in other formats (HTML, SSML, OpenType, etc.) form a book. (HP's PCL is actually not a display format -- it's not meant to be displayed, after all -- it's a programming language, as is Adobe's Postscript).
By "master format" I means something that can be used to provide 100% of the information needed to create a "real book"
I can see how you would arrive at that idea, if you've never been introduced to the term (and it makes some of your other pronouncements a lot less... odd), but 'master format' is an accepted term, and it seems quite clear that it's what everyone else has in mind. -- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you

It's generally accepted that a master format is one from which other formats are generated. This is analogous to a master tape in music, or a master reel in cinema.
OK, well its relatively trivial to convert EPUB to HTML, MOBI, KF8, and txt70. What more do you want? HTML to EPUB doesn't really work, because HTML doesn't contain all the elements of EPUB, forcing PG to "fake" this conversion, as one can easily see by taking a good hard look at the push-ups Marcello goes through in epubmaker to generate EPUB from HTML -- let alone generating MOBI. Epubmaker can only "fake" this conversion based on additional conventions PG loads unsupported on top of HTML.

Am 08.02.2012 um 20:16 schrieb James Adcock:
HTML to EPUB doesn't really work, because HTML doesn't contain all the elements of EPUB, forcing PG to "fake" this conversion, as one can easily see by taking a good hard look at the push-ups Marcello goes through in epubmaker to generate EPUB from HTML -- let alone generating MOBI. Epubmaker can only "fake" this conversion based on additional conventions PG loads unsupported on top of HTML.
?????? ???????????? ?????????????????? ??????? ?????????? ????? ??????? ??????? ??????? ??????? ?????? ?????? ?????? ??????? ?????? ?????? ????????? ???????????? ???????? ???? Regards Keith

Jim>> Epubmaker can only "fake" this conversion based on additional conventions PG loads unsupported on top of HTML. Keith> (vacuous huge question mark) Read the code to Epubmaker where Marcello makes it abundantly clear where he is having to "fake" things in order to try to make them work, paying particular attention to exceptions and recovery. Such as for example trying to gleam information from source file names -- which is certainly NOT part of HTML. Or trying to extract information from assumed PG header conventions. Again, not part of HTML.

On 8 February 2012 19:16, James Adcock <jimad@msn.com> wrote:
It's generally accepted that a master format is one from which other formats are generated. This is analogous to a master tape in music, or a master reel in cinema.
OK, well its relatively trivial to convert EPUB to HTML, MOBI, KF8, and txt70. What more do you want?
It's relatively trivial to burn a CD from MP3s and a DVD from MP4s. That doesn't make an MP3 a master tape, or an MP4 a master reel.
HTML to EPUB doesn't really work, because HTML doesn't contain all the elements of EPUB, forcing PG to "fake" this conversion, as one can easily see by taking a good hard look at the push-ups Marcello goes through in epubmaker to generate EPUB from HTML -- let alone generating MOBI. Epubmaker can only "fake" this conversion based on additional conventions PG loads unsupported on top of HTML.
That's not so much a problem in converting HTML to EPUB -- most EPUB files are just HTML in a zip file, with some metadata -- the problem is inferring this metadata from HTML. Inferring semantic information of any kind from presentation-level details is, at best, unreliable. -- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you

On Wed, February 8, 2012 1:03 pm, Jimmy O'Regan wrote:
That's not so much a problem in converting HTML to EPUB -- most EPUB files are just HTML in a zip file, with some metadata -- the problem is inferring this metadata from HTML. Inferring semantic information of any kind from presentation-level details is, at best, unreliable.
One of the first things that Ms. Lofstrom suggested in response to Mr. Hutchinson's original proposal was updates to the metadata associated with a text. I would think that any automated generation process should be extracting the metadata associated with a text - and if the metadata is incorrect in the resulting file then obviously we need to improve the master metadata. And I think this raises an interesting issue: not only do we need a master document format, we also need a master /metadata/ format.

Hi Lee, Well, I would go further and say that the meta data should be part of the master format. Something I should put into ANA. Thanx regards Keith. Am 08.02.2012 um 21:11 schrieb Lee Passey:
On Wed, February 8, 2012 1:03 pm, Jimmy O'Regan wrote:
That's not so much a problem in converting HTML to EPUB -- most EPUB files are just HTML in a zip file, with some metadata -- the problem is inferring this metadata from HTML. Inferring semantic information of any kind from presentation-level details is, at best, unreliable.
One of the first things that Ms. Lofstrom suggested in response to Mr. Hutchinson's original proposal was updates to the metadata associated with a text. I would think that any automated generation process should be extracting the metadata associated with a text - and if the metadata is incorrect in the resulting file then obviously we need to improve the master metadata.
And I think this raises an interesting issue: not only do we need a master document format, we also need a master /metadata/ format.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

"Metadata" is a loaded word in this context. In particular, DP has it's own legacy set of definitions. It's worth taking the time to explain what you think the metadata is in pretty complete and precise terms, lest you be misunderstood and differing assumptions be made particularly about the difficulty of collecting it. On Wed, Feb 8, 2012 at 1:58 PM, Keith J. Schultz <schultzk@uni-trier.de>wrote:
Hi Lee,
Well, I would go further and say that the meta data should be part of the master format.
Something I should put into ANA.
Thanx
regards Keith.
Am 08.02.2012 um 21:11 schrieb Lee Passey:
On Wed, February 8, 2012 1:03 pm, Jimmy O'Regan wrote:
That's not so much a problem in converting HTML to EPUB -- most EPUB files are just HTML in a zip file, with some metadata -- the problem is inferring this metadata from HTML. Inferring semantic information of any kind from presentation-level details is, at best, unreliable.
One of the first things that Ms. Lofstrom suggested in response to Mr. Hutchinson's original proposal was updates to the metadata associated with a text. I would think that any automated generation process should be extracting the metadata associated with a text - and if the metadata is incorrect in the resulting file then obviously we need to improve the master metadata.
And I think this raises an interesting issue: not only do we need a master document format, we also need a master /metadata/ format.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Hi Don, What I was saying is that I need to discuss the matter of meta data. As I am playing of implementing or suggesting particular meta data be in the master format. What is important though is if you want meta data it should be in the master format and not in some other file or data base. Naturally, there are ways to keep things hooked up, but it would complicate if there are changes to the meta data and has to be aggregated to all those versions and decide which it should be connected to! As to collecting it. That is a different matter and depends on the data you want. regards Keith. Am 09.02.2012 um 00:27 schrieb don kretz:
"Metadata" is a loaded word in this context. In particular, DP has it's own legacy set of definitions.
It's worth taking the time to explain what you think the metadata is in pretty complete and precise terms, lest you be misunderstood and differing assumptions be made particularly about the difficulty of collecting it.
On Wed, Feb 8, 2012 at 1:58 PM, Keith J. Schultz <schultzk@uni-trier.de> wrote: Hi Lee,
Well, I would go further and say that the meta data should be part of the master format.
Something I should put into ANA.
Thanx
regards

On Wed, February 8, 2012 4:27 pm, don kretz wrote:
"Metadata" is a loaded word in this context. In particular, DP has it's own legacy set of definitions.
It's worth taking the time to explain what you think the metadata is in pretty complete and precise terms, lest you be misunderstood and differing assumptions be made particularly about the difficulty of collecting it.
Fair enough. Let me start by simply exposing, not discussing, two basic concepts: Metadata is data about data. Thus, when you start to talk about "metadata" you first have to identify the "data" that you're "meta-ing." The bibliographic people at the International Federation of Library Associations and Institutions have come up with the WEMI model: Work, Expression, Manifestation, Item. "Work" is the abstract concept from which all Expressions are derived (one cannot have a "Work" without at least one "Expression"). _Huckleberry Finn_ by Mark Twain is a "Work." The 1902 Authorized Edition of _Huck Finn_ with correction by the author is an "Expression" as is a translation of Huck Finn into another language. Every "Expression" has at least on manifestation (the 1902 Authorized edition published in New York by Dunlop). "Items" is the specific book, on a specific shelf in a specific library. We don't care about "Items". So, just brainstorming here, there can be at least three "data" that we might want to "meta". First, there's WEM data. This is the data usually captured in the <meta> section of an ePub's .opf file, and what is traditionally thought of when someone uses the term "metadata." Most electronic publications use the Dublin Core model to record this metadata, but some do not. WEM metadata is the sole focus of Internet Archive's Open Library project (openlibrary.org). ePub is an interesting use case, because an ePub is fundamentally a collection of files, which only together can be considered a "publication"; individually, these files are just, well, individual files. In this case, there needs to be metadata that describes which files are part of the publication, and how they go together. This is metadata not about the work or it's expression, but metadata about the ePub publication structure. In the case of ePub, publication metadata is also stored in the .opf file; indeed, that is the .opf file's primary function. Every e-text in PG's corpus came from somewhere. Some individual created the file, and usually other individuals have modified it over the course of it's lifetime. At some point in time someone decided that PG wouldn't be sued for publishing it's version on the internet. And sometimes some automated process may have been triggered that changed the nature of the e-text. The collection of data that describes PG's processes is also metadata. Much of this data has been lost in the sands of time (download statistics) and other data that should have been lost (Al Haines' credit line) has been preserved. But it is what it is, and what it is is metadata. I've identified here three relevant types of metadata: WEM metadata (the most important), publication metadata (also important), and PG metadata (some important, some not). Going forward I will try to differentiate between these different types of metadata. When I do not, everyone may assume that I'm talking about WEM metadata. I'm particularly interested in hearing from Ms. Lofstrom with suggestions about what WEM metadata should be collected, and how it might be structured and retained.

On Thu, Feb 9, 2012 at 7:48 AM, Lee Passey <lee@novomail.net> wrote:
I'm particularly interested in hearing from Ms. Lofstrom with suggestions about what WEM metadata should be collected, and how it might be structured and retained.
From my USER point of view, PG gives next to no usable information to
I only know that metadata are important, from dipping into librarian blogs and also from my own struggles to find, retrieve, and store ebooks. Lee, you clearly know more than I do about the librarian end of things. Librarians who are awake to the value of information technology and struggling to organize their own institutional digital repositories would be the best resources here. I'm thinking of Dorothea Salo, who used to blog about such issues before ... well, I'm reasonably certain that her bosses ordered her not to rock the boat. [Sidenote: a few years ago, I attended a conference of Pacific librarians. What I saw was a gathering ot timid bureaucrats, dependent on public funding, who were not going to do anything new unless it had been proven SAFE by someone else. That's what Dorothea was bucking.] people searching for books. Organized by author and title. That's all. I would want info re date and place of publication, publisher, and genre. Birth and death dates of author. If a serial publication, volume and number, and the run of the publication. (I might want to look at all the books and periodicals published in London in 1882. No way to do it now, and it ought to be easy.) I would want LOC and Dewey Decimal and other such numbers (dunno about schemes used in non-American libraries) stored with the book info so that I could find all versions of a book, e, paper, whatever. I would want to know WHEN the book was first digitized and WHEN it had been corrected, if at all. (That's important for judging the reliability of the text.) I would want a good library search engine. When you use a library database, you're stuck with antiquated interfaces that only recognize certain terms in a certain order. Nothing at all like Google, which can give you answers even if your queries are loose and impressionistic. Finally, I'd want a recommendation system like the ones used by Amazon or Netflix. Ideally, I'd also like to be able to download software to organize my books on my home computer. Right now I have several thousand books that are labeled only by title and author (Villette - Bronte) and "shelved" in a homegrown folder system. Frex, Villette would be in the folder Bronte C, which would be in the folder 19th century - England. I don't think this is the best system; it's not even a good one. But it's what I can do without having to code something myself. (I'm not a programmer; I got the only C I ever got in my life in second-semester Java programming.) If PG books had all the necessary metadata and I had a program that could read that data, the program could organize my books for me. Oh, and I'd like to be able to download ALL of an author's books at once, instead of having to do it painfully and slowly one by one. I suppose I still think like an academic. If I'm interested in Joseph Altsheler, I want to see everything he published. That's a user's POV, not a librarian's and not a programmer's. I don't think that anything in my list is undoable. It's all been done. It's just a matter of assembling the pieces. Yes, it looks easy from the outside. I'm guilty of "why don't we" -- often encountered in meetings -- which translates to "I have a keen idea, why don't YOU do it." Still, I've put in eight years and thousands of pages at DP, so it's not all "you do the work". -- Karen Lofstrom

Karen, First, don't denigrate the work you've done at DP compared to what people in this group have done or might do. Your stuff is what makes this worth doing (and even talking about.) It should also be what largely determines the requirements and priorities for PG and DP development. I think "librarianship" is a good metaphor for PG's role for purposes of your question. To what degree does PG function as a library - assemble books in a common store, make sure they are in good shape, and make it easy to find and acquire books (for free!). Do the degree that's accurate, then your understanding of metadata is pretty good - as it applies to PG more than to the construction of ebooks. But the DP production line is a good place to collect and identify some of that metadata, because it's inherent in the title page and other areas we work on. DP's markup could make it a lot easier to do this. Currently though it mostly just associates it with display formatting (size, font, etc.) and the opportunity is lost. I personally already include it as specific tags in the editor I'm building for eb.tbicl.org. Here's a specific example of a project title page as it currently appears. [title-page] [title]ASTRONOMY WITH AN OPERA-GLASS [/title] [subtitle]A POPULAR INTRODUCTION TO THE STUDY OF THE STARRY HEAVENS WITH THE SIMPLEST OF OPTICAL INSTRUMENTS [/subtitle] [subsubtitle] WITH MAPS AND DIRECTIONS TO FACILITATE THE RECOGNITION OF THE CONSTELLATIONS AND THE PRINCIPAL STARS VISIBLE TO THE NAKED EYE [/subsubtitle] BY [author]GARRETT P. SERVISS[/author] <epigram> "Known are their laws; in harmony unroll The nineteen-orbed cycles of the Moon. And all the signs through which Night whirls her car From belted Orion back to Orion and his dauntless Hound, And all Poseidon's, all high Zeus' stars Bear on their beams true messages to man." <attribution>Poste's Aratus.</attribution> </epigram> [edition]THIRD EDITION[/edition] [publisher] NEW YORK D. APPLETON AND COMPANY London: Caxton House, Paternoster Square [pub-date]1890[/pub-date] [/publisher] [copyright] Copyright, 1888, By D APPLETON AND COMPANY. [/copyright] [/title-page] On Thu, Feb 9, 2012 at 11:43 AM, Karen Lofstrom <klofstrom@gmail.com> wrote:
On Thu, Feb 9, 2012 at 7:48 AM, Lee Passey <lee@novomail.net> wrote:
I'm particularly interested in hearing from Ms. Lofstrom with suggestions about what WEM metadata should be collected, and how it might be structured and retained.
I only know that metadata are important, from dipping into librarian blogs and also from my own struggles to find, retrieve, and store ebooks.

On Thu, Feb 09, 2012 at 09:43:05AM -1000, Karen Lofstrom wrote:
From my USER point of view, PG gives next to no usable information to
... people searching for books. Organized by author and title. That's all. I would want info re date and place of publication, publisher, and genre. Birth and death dates of author. If a serial publication, volume and number, and the run of the publication.
(I might want to look at all the books and periodicals published in London in 1882. No way to do it now, and it ought to be easy.)
This approach would be problematic, since the PG titles are considered as published when *we* release them. Published in Urbana, Illinois. Published by Project Gutenberg. It might be desirable to include something about original publication dates of the source material(s) we used, but I think that's not consistent with (say) Dublin Core metadata for our books. I believe you'll find that cataloging rules for reprints and derivatives are to catalog the reprint or derivative, not the original item. (We see this all the time with facsimile reprints, in fact.) In case you are thinking, "but what if we try to accurately represent the old printed book, in our new eBook," it's still not appropriate to claim that the old book's metata applies to our new eBook. Furthermore, we'll have scholars and other riffraff complaining that our book is not, in fact, the same.
I would want LOC and Dewey Decimal and other such numbers (dunno about schemes used in non-American libraries) stored with the book info so that I could find all versions of a book, e, paper, whatever.
We do reasonably well with this, but would use more. We use LCSH. It's not in the book, but is in the RDF in the cache/generated subset (see my earlier notes on this) , and of course in our catalog (i.e., the XML/RDF or MARC in "offline catalogs").
I would want to know WHEN the book was first digitized and WHEN it had been corrected, if at all. (That's important for judging the reliability of the text.)
Well, that's in there. Original is in the book, and the metadata. Updates are in the book. Your suggestions are all good ones, and I'm only providing feedback on the few where you might have some misunderstandings. -- Greg
I would want a good library search engine. When you use a library database, you're stuck with antiquated interfaces that only recognize certain terms in a certain order. Nothing at all like Google, which can give you answers even if your queries are loose and impressionistic.
Finally, I'd want a recommendation system like the ones used by Amazon or Netflix.
Ideally, I'd also like to be able to download software to organize my books on my home computer. Right now I have several thousand books that are labeled only by title and author (Villette - Bronte) and "shelved" in a homegrown folder system. Frex, Villette would be in the folder Bronte C, which would be in the folder 19th century - England. I don't think this is the best system; it's not even a good one. But it's what I can do without having to code something myself. (I'm not a programmer; I got the only C I ever got in my life in second-semester Java programming.) If PG books had all the necessary metadata and I had a program that could read that data, the program could organize my books for me.
Oh, and I'd like to be able to download ALL of an author's books at once, instead of having to do it painfully and slowly one by one. I suppose I still think like an academic. If I'm interested in Joseph Altsheler, I want to see everything he published.
That's a user's POV, not a librarian's and not a programmer's. I don't think that anything in my list is undoable. It's all been done. It's just a matter of assembling the pieces.
Yes, it looks easy from the outside. I'm guilty of "why don't we" -- often encountered in meetings -- which translates to "I have a keen idea, why don't YOU do it." Still, I've put in eight years and thousands of pages at DP, so it's not all "you do the work".
-- Karen Lofstrom _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Thu, Feb 9, 2012 at 11:45 PM, Greg Newby <gbnewby@pglaf.org> wrote:
This approach would be problematic, since the PG titles are considered as published when *we* release them. Published in Urbana, Illinois. Published by Project Gutenberg.
It might be desirable to include something about original publication dates of the source material(s) we used, but I think that's not consistent with (say) Dublin Core metadata for our books.
Then we bolt on fields that give original publication date, original publisher and place of publication. I'm a scholar. I *need* that info. The original PG intent, to obscure the source of the texts (and in some cases to combine texts) was a mistake, and I'm glad that the policy has been abandoned.
In case you are thinking, "but what if we try to accurately represent the old printed book, in our new eBook," it's still not appropriate to claim that the old book's metata applies to our new eBook. Furthermore, we'll have scholars and other riffraff complaining that our book is not, in fact, the same.
But there should be a place IN the metadata to credit the source of the PG version. How far we can depart from the original version is another discussion. We have to balance fidelity to the original with readability in the present (and usability on ereaders). There's ambiguity and difficulty there. I am happy that DP has been moving towards more scholarly editions, with many PPers noting which changes have been in the text, as emendations of typos in the original.
Well, that's in there. Original is in the book, and the metadata. Updates are in the book.
But none of that is easily searchable! The PG interface allows us to search only by au and title. Google and IA and Manybooks aren't any better, really. Is there any organization out there that has done it right? -- Karen Lofstrom

On 02/10/2012 11:04 AM, Karen Lofstrom wrote:
But there should be a place IN the metadata to credit the source of the PG version.
Enter *distributed proofreaders* in to the PG search.
How far we can depart from the original version is another discussion. We have to balance fidelity to the original with readability in the present (and usability on ereaders). There's ambiguity and difficulty there. I am happy that DP has been moving towards more scholarly editions, with many PPers noting which changes have been in the text, as emendations of typos in the original.
This is a myth. Citing from PG in an academic context will probably get you caned and expelled. I think you mean "hobbists who like to call themselves scholars".
But none of that is easily searchable! The PG interface allows us to search only by au and title.
False. You never tried, did you? -- Marcello Perathoner webmaster@gutenberg.org

1. These are the MARC fields we currently keep in the database: 010 - Library of Congress Control Number (LCCN) | LoC No. 020 - International Standard Book Number (ISBN) | ISBN 022 - International Standard Serial Number (ISSN) | ISSN 035 - System Control Number | Ebook No. 041 - Language Code | Language 050 - Library of Congress Classification (LoCC) | LoCC 082 - Dewey Decimal Classification Number | Dewey 240 - Uniform title | Uniform Title 245 - Title Statement | Title 246 - Varying form of title | Alternate Title 250 - Edition Statement | Edition 260 - Publication, Distribution, etc. (Imprint) | Imprint 300 - Physical description | Source Description 440 - Series statement / Added Entry -- Title | Series Title 500 - General note | Note 505 - Formatted Contents Note | Contents 508 - Creation / Production Credits Note | Credits 520 - Summary, etc. Note | Summary 534 - Original Version Note | Original Version 540 - Terms Governing Use and Reproduction Note | Copyright Notice 546 - Language Note -- Uncontrolled | Language Note 600 - Subject Added Entry -- Personal Name | Subject 610 - Subject Added Entry -- Corporate Name | Subject 611 - Subject Added Entry -- Meeting Name | Subject 630 - Subject Added Entry -- Uniform Title | Subject 650 - Subject Added Entry -- Topical Term | Subject 651 - Subject Added Entry -- Geographic Name | Subject 653 - Index Term -- Uncontrolled | Subject 901 - Cover or Frontispiece Image URL | Cover Art 902 - Title Page Image URL | Title Page 903 - Verso Image URL | Verso Some fields are controlled, eg. subjects. That means you get to insert only values from a subject list approved by the LoC. Refer to the MARC docs. 2. Some of the data are automatically populated from the PG header in the posted ebook. Most of them need to be added manually after the fact. We have an online interface for that and people working on it. If you want to volunteer, speak up. (And great silence ensued ...) Alternatively I can also batch insert data into the database from a CVS file formatted thus: ebook;marc;value 12345;260;New York, 1921 12345;300;300 pages, 8x13 inches 12346;260;Chicago, 1854 3. For submissions in the RST format you can add all that data like this: .. meta:: :DC.Title: Witches :MARC.250: First Edition :MARC.260: Salem, Mass. 1692 -- Marcello Perathoner webmaster@gutenberg.org

And I strongly disagree with this example below, for the reason that Greg described well earlier. We should *not* be presenting this type of information about our source text as if it were information about the PG text. --Andrew On Fri, 10 Feb 2012, Marcello Perathoner wrote:
Alternatively I can also batch insert data into the database from a CVS file formatted thus:
ebook;marc;value
12345;260;New York, 1921 12345;300;300 pages, 8x13 inches 12346;260;Chicago, 1854

On Fri, Feb 10, 2012 at 1:45 AM, Marcello Perathoner <marcello@perathoner.de> wrote:
False. You never tried, did you?
Why should I try when it's not clear to me that such a search is possible? When I'm only presented with certain info, I can only believe that this is all the info possible. The interface needs a redesign. -- Karen Lofstrom

On 02/10/2012 06:54 PM, Karen Lofstrom wrote:
On Fri, Feb 10, 2012 at 1:45 AM, Marcello Perathoner <marcello@perathoner.de> wrote:
False. You never tried, did you?
Why should I try when it's not clear to me that such a search is possible? When I'm only presented with certain info, I can only believe that this is all the info possible. The interface needs a redesign.
Why should you *not* try to search for something before saying it isn't possible? The interface says: "Search All" and it does exactly that. It searches all data we have. -- Marcello Perathoner webmaster@gutenberg.org

"Marcello" == Marcello Perathoner <marcello@perathoner.de> writes:
Marcello> On 02/10/2012 06:54 PM, Karen Lofstrom wrote: >> On Fri, Feb 10, 2012 at 1:45 AM, Marcello Perathoner >> <marcello@perathoner.de> wrote: >> >>> False. You never tried, did you? >> Why should I try when it's not clear to me that such a search >> is possible? When I'm only presented with certain info, I can >> only believe that this is all the info possible. The interface >> needs a redesign. Marcello> Why should you *not* try to search for something before Marcello> saying it isn't possible? Marcello> The interface says: "Search All" and it does exactly Marcello> that. It searches all data we have. And if I search for "distributed proofreaders" (following your advice) the search returns 53 matches. About 0.2% of the expected results. Already finding that there is an advanced search is not trivial. Instead of having a link in the search column, immediately after Book search, you have to scroll. And in the Book search page there is no mention of the advanced search page. Carlo

On Fri, Feb 10, 2012 at 10:09 AM, Carlo Traverso <traverso@posso.dm.unipi.it> wrote:
Already finding that there is an advanced search is not trivial. Instead of having a link in the search column, immediately after Book search, you have to scroll. And in the Book search page there is no mention of the advanced search page.
The PG page isn't the best interface in the world. But then, we haven't actively searched for a good interface designer/usability expert who is willing to work for nothing :) This is not to criticize Marcello (who keeps things running against all odds) because he doesn't have that skillset. Perhaps he would be willing to work with someone else, if we could attract that someone. -- Karen Lofstrom

On 02/10/2012 09:09 PM, Carlo Traverso wrote:
And if I search for "distributed proofreaders" (following your advice) the search returns 53 matches. About 0.2% of the expected results.
So we need a volunteer to enter the missing 98.8% credits.
Already finding that there is an advanced search is not trivial. Instead of having a link in the search column, immediately after Book search, you have to scroll. And in the Book search page there is no mention of the advanced search page.
Actually the old `advanced search´ is far less capable than the new default search. It is kept only for compatibility with the catalog team. -- Marcello Perathoner webmaster@gutenberg.org

Marcello Perathoner <marcello@perathoner.de> writes:
Why should you *not* try to search for something before saying it isn't possible?
The interface says: "Search All" and it does exactly that. It searches all data we have.
Marcello, you can do better. Just agree that the search interface is rather limited. I've no idea what I must enter, if I'm interessed in works by Goethe published before 1833 (*goethe -1833* does not seem to do the trick). Or works by Goethe in German? *goethe german*? No, you must select the language first (which is the last option...). You will probably argue that nearly all our books are in English, and thus it might make sense to offer "language" as second or third class criterium. But then, it is somehow sticky, but the search interface does not say so... As long as there are just 50.000 or 100.000 books (and you are not interested in "goethe faust" or "shakespeare hamlet") the interface is probably good enough. There is no need to blame Marcello ;-) The searches are pretty fast and you can simply glance into all the books that are presented as search results. -- Karl Eichwalder

On 02/10/2012 09:41 PM, Karl Eichwalder wrote:
I've no idea what I must enter, if I'm interessed in works by Goethe published before 1833 (*goethe -1833* does not seem to do the trick).
Ranges are not implemented. There's no point in implementing those because we have publication data for only a dozen books. But *goethe faust* works. -- Marcello Perathoner webmaster@gutenberg.org

Marcello Perathoner <marcello@perathoner.de> writes:
On 02/10/2012 09:41 PM, Karl Eichwalder wrote:
I've no idea what I must enter, if I'm interessed in works by Goethe published before 1833 (*goethe -1833* does not seem to do the trick).
Ranges are not implemented. There's no point in implementing those because we have publication data for only a dozen books.
This is strange because you must enter the copyright or release/publication date, if you do the copyright clearance.
But *goethe faust* works.
Yes, but there the problem is that you see several "Goethe, Faust" edition without further details. Neither the language nor the publication date is listed. It is the same with "Shakespeare, Hamlet" listings. -- Karl Eichwalder

On 02/11/2012 03:56 PM, Karl Eichwalder wrote:
This is strange because you must enter the copyright or release/publication date, if you do the copyright clearance.
That is a different database. Its not trivial to match copyright clearance records to ebooks because the ebook no. gets affixed late in the process. -- Marcello Perathoner webmaster@gutenberg.org

"Marcello" == Marcello Perathoner <marcello@perathoner.de> writes:
Marcello> On 02/11/2012 03:56 PM, Karl Eichwalder wrote: >> This is strange because you must enter the copyright or >> release/publication date, if you do the copyright clearance. Marcello> That is a different database. Its not trivial to match Marcello> copyright clearance records to ebooks because the ebook Marcello> no. gets affixed late in the process. I understand that going from the clearance to the book might be difficult, since the clearance may cover more than one book, but the converse should be straightforward: we need a clearance key to post a book. But the data are discarded by the WW-ers, and are hence difficult to retrieve. All the ww procedures toss away data. To get a clearance one has to enter the publisher name and location and publishing date; these are present in the clearance record, and can be consulted. The full clearance itself contains the publication date, and is part of the upload note, but the publisher is gone (unless the preparer transcribes the front page, that now is usual). And the posted note discards this information. It would be enough to include the full clearance metadata in the posting note, or at least the clearance number, that gives access to the clearance records, and the catalogue team would have all the data. Carlo

On Sat, Feb 11, 2012 at 8:01 AM, Carlo Traverso <traverso@posso.dm.unipi.it>wrote:
"Marcello" == Marcello Perathoner <marcello@perathoner.de> writes:
All the ww procedures toss away data. To get a clearance one has to enter the publisher name and location and publishing date; these are present in the clearance record, and can be consulted. The full clearance itself contains the publication date, and is part of the upload note, but the publisher is gone (unless the preparer transcribes the front page, that now is usual). And the posted note discards this information.
And it's remarkable someone would try to reconstruct the data and yet be so indifferent about having the page images to refer to,. or accessible from the text.

On 2/11/2012 11:18 AM, don kretz wrote:
And it's remarkable someone would try to reconstruct the data and yet be so indifferent about having the page images to refer to,. or accessible from the text.
What I find remarkable is that after 2 decades anyone would expect the Project Gutenberg old guard to do anything other than the same thing they've been doing.

On Sat, Feb 11, 2012 at 12:15:41PM -0700, Lee Passey wrote:
On 2/11/2012 11:18 AM, don kretz wrote:
And it's remarkable someone would try to reconstruct the data and yet be so indifferent about having the page images to refer to,. or accessible from the text.
What I find remarkable is that after 2 decades anyone would expect the Project Gutenberg old guard to do anything other than the same thing they've been doing.
....Which is to leave such decisions to the eBook's submitter(s). -- Greg

On Mon, February 13, 2012 12:15 am, Greg Newby wrote:
On Sat, Feb 11, 2012 at 12:15:41PM -0700, Lee Passey wrote:
On 2/11/2012 11:18 AM, don kretz wrote:
And it's remarkable someone would try to reconstruct the data and yet be so indifferent about having the page images to refer to,. or accessible from the text.
What I find remarkable is that after 2 decades anyone would expect the Project Gutenberg old guard to do anything other than the same thing they've been doing.
....Which is to leave such decisions to the eBook's submitter(s).
Precisely my point. So all you complainers out there, know that your suggestions to improve PG will fall on deaf ears. As far as PG is concerned, what was is what will be. 1971 Michael Hart was given an operator's account with $100,000,000 of computer time in it by the operators of the mainframe at the University of Illinois. When he first started entering texts into that system he asked no one's permission -- he "just did it." Today that philosophy still holds. Mr. Newby has offered disk space and a fat pipe to anyone who wants to improve Project Gutenberg. That offer is equivalent to the operator's time Mr. Hart started with. Now anyone who wants to improve PG can "just do it." No permission is needed. So stop trying to improve Project Gutenberg, and just start developing the alternative.

Don> What I find remarkable is that after 2 decades anyone would expect the
Project Gutenberg old guard to do anything other than the same thing they've been doing.
Greg>....Which is to leave such decisions to the eBook's submitter(s). Again, I have written software that would allow one to back-align PG works to "the original text" even when they are not "identical" texts, and can reintroduce page numbers and "original" line breaks. It's in a crude state right now, because no one has actually expressed an interest. My intent was that it could be used by DP to reprocess old crufty PG files back through their system (which it could be used for) if they wanted to [so that no one at DP really has ANY excuse to complain about independently produced books] or it could be used by someone wanting to back-submit to archive.org Or it could be used to pursue "more scholarly" versions. The software "works" by taking one "polished" PG text and one "unpolished" say raw OCR, Levenshtein matching them on word tokens and then clones the formatting whitespace from the one to the other. It can also clone over the page numbers. In general, obviously, if you want to say produce a "scholarly" edition from a PG text you're going to have to re-proof your book after performing such back matching. My software can help with that too.

"Jim" == Jim Adcock <jimad@msn.com> writes:
Don> What I find remarkable is that after 2 decades anyone would Don> expect the >> Project Gutenberg old guard to do anything other than the same >> thing they've been doing. Greg> ....Which is to leave such decisions to the eBook's Greg> submitter(s). Jim> Again, I have written software that would allow one to Jim> back-align PG works to "the original text" even when they are Jim> not "identical" texts, and can reintroduce page numbers and Jim> "original" line breaks. It's in a crude state right now, Jim> because no one has actually expressed an interest. My intent Jim> was that it could be used by DP to reprocess old crufty PG Jim> files back through their system (which it could be used for) Jim> if they wanted to [so that no one at DP really has ANY excuse Jim> to complain about independently produced books] or it could Jim> be used by someone wanting to back-submit to archive.org Or Jim> it could be used to pursue "more scholarly" versions. Jim> The software "works" by taking one "polished" PG text and one Jim> "unpolished" say raw OCR, Levenshtein matching them on word Jim> tokens and then clones the formatting whitespace from the one Jim> to the other. It can also clone over the page numbers. Jim> In general, obviously, if you want to say produce a Jim> "scholarly" edition from a PG text you're going to have to Jim> re-proof your book after performing such back matching. My Jim> software can help with that too. Much simpler, one can use wdiff (or dwdiff) and preserve whitespace from one and non-whitespace from the other through regexp. I am interested in comparing your tool with my wdiff approach. Carlo

I am interested in comparing your tool with my wdiff approach.
See: http://freekindlebooks.org/Dev/pgdiff.cpp where this might be called using something like: pgdiff -linebreaks -w 1000 pg1342.txt ia.txt > rewrapped.txt which clones the linebreaks from ia.txt onto the pg1432.txt creating rewrapped.txt The -w 1000 parameter specifies that the longest continuous stretch of mismatched words to search for a match should be about 1000 words -- since this is an n^2 algorithm. This is useful in versioning, for example, where one version of the text may contain a paragraph not found in the other text. I haven't worked much on the issue pagebreaks, right now I just have a hard-wired assumption that pagebreaks are marked with "PAGEBREAK" in ia.txt, in which case those pagebreaks are also passed through to rewrapped.txt Long uncommon prefixes and suffixes such as PG legalize and/or generated TOCs should be removed. First word of each text, and last word of each text should be identical to "get the algorithm off on the right foot." Often I simply put in a dummy word at the start and end of each text such as "START" and "END" to force this match. This requirement is a bug that I need to sort out. More commonly I use this program in versioning to compare to different versions of "the same" text -- not necessarily identical editions, by doing a: pgdiff -w 1000 pg1342.txt ia.txt > rewrapped.txt which marks areas of disagreement between pg1342.txt and ia.txt

On 02/15/2012 10:16 PM, James Adcock wrote:
I am interested in comparing your tool with my wdiff approach.
See:
1. Doesn't even compile because of the gratuitous use of M$ proprietary functions like fopen_s and _flushall. 2. Finally compiles but doesn't even get to print the help screen before it dumps core because of this: while (*pSep != chFileSeparator) --pSep; What happens if we have no separator or the separator is '/'? $ ./a.out Segmentation fault $ 3. And it can't do proper math: $ ./a.out -w 9999 ../publish/76/76.rst ~/Documents/76-test.rst Setting a -w parameter larger than 10,000 results in excessive run times $ 4. And when it finally runs, it runs slooooooooow and gives us this: { “*Dah,* | “DAH, } now, Huck, what I tell { you? } { — } { what | you?—what } I tell you up dah on Jackson islan'? { I | I } { TOLE | *tole* } you I got a hairy breas', en what's de sign un it; en I { TOLE | *tole* } you I ben rich wunst, en gwineter to be rich { AGIN; | *agin;* } en it's come true; en heah she { is! | *is! } { DAH, | Dah,* } now! { doan’ | doan' } talk to { *me* } { — } { signs | ME—signs } is { SIGNS, | *signs,* } mine I tell you; en I knowed { jis’ | jis' } { ‘s | 's } well { ‘at | 'at } I { ‘uz | 'uz } gwineter be rich agin as I's { a-stannin’ | a-stannin' } heah dis minute!” What do I do with this mess? There are not even line numbers in this mess. -- Marcello Perathoner webmaster@gutenberg.org

Any real webmaster will know about apt-get wdiff. Check with your system administrator. The output may be too cryptic for some not familiar with unix conventions. It's pretty terse sometimes.

Doesn't even compile because of the gratuitous use of M$ proprietary functions like fopen_s and _flushall.
If you had been following the conversation you would have read that I was in no way claiming my software was in a state ready for public consumption.
And it can't do proper math:
$ ./a.out -w 9999 ../publish/76/76.rst ~/Documents/76-test.rst Setting a -w parameter larger than 10,000 results in excessive run times $ Not sure what "it can't do proper math" means. This algorithm like all Levenshtein string match is n^2 in the maximum distance of string mismatches you want to handle. If you have less than 10,000 words in a row different between your two documents, then set a smaller maximum mismatch distance, such as 1,000 is very generous for most documents sets. If the routines you are used to run faster than this it is because they crap out when the input files sets aren't closely similar. Of course this also depends on how fast a machine you have and how much memory you have. 4.
What do I do with this mess? There are not even line numbers in this mess.
I'm not sure what you want to do with this software, so I can't guess how to help you. Usually in versioning I run it without the -justdiffs option in which case it tags regions of the output .txt file *in context* that I need to look more closely at with the { ... } I use the -justdiffs option just as a quick visual summary "sanity check" of the edit changes I have made. Suggest take a look at the output *without* the -justdiffs option in order to better understand -justdiffs.

On 02/17/2012 06:20 PM, Jim Adcock wrote:
And it can't do proper math:
$ ./a.out -w 9999 ../publish/76/76.rst ~/Documents/76-test.rst Setting a -w parameter larger than 10,000 results in excessive run times $
Not sure what "it can't do proper math" means.
It should not terminate with that warning when you set -w 9999. Because in my math book 9999 is < 10000.
This algorithm like all Levenshtein string match is n^2 in the maximum distance of string mismatches you want to handle.
What is a "Levenshtein string match"??? even Google doesn't know.
If you have less than 10,000 words in a row different between your two documents, then set a smaller maximum mismatch distance, such as 1,000 is very generous for most documents sets. If the routines you are used to run faster than this it is because they crap out when the input files sets aren't closely similar.
There are routines that run faster and can adapt dynamically. No need to trial-and-error.
4.
What do I do with this mess? There are not even line numbers in this mess.
I'm not sure what you want to do with this software, so I can't guess how to help you.
Line numbers in the output so that if I run this animal inside emacs or vi I can go from one mismatch to the next. -- Marcello Perathoner webmaster@gutenberg.org

It should not terminate with that warning when you set -w 9999. Because in my math book 9999 is < 10000.
Don't know why it is terminating on your machine. "9999 vs. 10000" is just a coincidence -- the 10000 number is hardwired into the usage prompt. Suggest you try a parameter like -w 1000 unless you know you have two *very* different input texts. Again any large unmatching prefixes and suffixes such as PG legalize, mismatched TOCs, "scholarly introductions" etc should be removed first, and right now the code has a "known bug" where if the first words and the last words of the two texts don't match it may not synchronize (which I fix just by inserting dummy tokens such as "START" and "END") It's been a while since I've worked on this but I think it expects a word dict "GutDicEN.txt" and expects it in more-or-less sort order. Slow if the dict isn't more-or-less in sort order.
What is a "Levenshtein string match"??? even Google doesn't know.
There are routines that run faster and can adapt dynamically. No need to
Strange. Your copy of Google works different than my copy of Google which gives: http://en.wikipedia.org/wiki/Levenshtein_distance Where in the case of word diff routines basically the string token is a word, not a char. trial-and-error. Not sure what you mean by "trial and error" but the other routines I have tried just crapped out when I tried them on "real world" tasks.
Line numbers in the output so that if I run this animal inside emacs or vi I can go from one mismatch to the next.
Give me a ref to your choice of diff output format and I will see if I can help you if you are serious about *actually* wanting to use this.

On Sat, Feb 11, 2012 at 10:18:21AM -0800, don kretz wrote:
On Sat, Feb 11, 2012 at 8:01 AM, Carlo Traverso <traverso@posso.dm.unipi.it>wrote:
> "Marcello" == Marcello Perathoner <marcello@perathoner.de> writes:
All the ww procedures toss away data. To get a clearance one has to enter the publisher name and location and publishing date; these are present in the clearance record, and can be consulted. The full clearance itself contains the publication date, and is part of the upload note, but the publisher is gone (unless the preparer transcribes the front page, that now is usual). And the posted note discards this information.
And it's remarkable someone would try to reconstruct the data and yet be so indifferent about having the page images to refer to,. or accessible from the text.
The issue of whether, or how, to include information about sources used has been contentious as long as there has been a Project Gutenberg. The policy forever (at least since the first version of the "small print", in the early 1990s or late 1980s) is found in every single eBook and elsewhere: "Project Gutenberg-tm eBooks are often created from several printed editions, all of which are confirmed as Public Domain in the U.S. unless a copyright notice is included. Thus, we do not necessarily keep eBooks in compliance with any particular paper edition." It has long been recognized that some eBook producers prefer to have their work adhere to a particular print edition. This is perfectly fine. It is certainly acceptable to include any quantity of information about source(s) used in a given eBook. Witness the practice of including scans and transcriptions of the TP&V and other in-book metadata, as part of an eBook submission. For producers to include such information in a more structured format seems fine to me. I don't recall anyone ever presenting an eBook in such a format (say, with a snipped of Dublin Core XML at the end). Keep in mind that it is very much our policy and intent to NOT maintain any particular adherance of an eBook to a print item. For example, if the print edition had an error that was fixed in later editions, we certainly would apply that correction if it were submitted to the errata process. (Ok, I can think of one or two exceptions, such as our Shakespeare first folios.) All that said: the idea that PG could catalog our items, and derive their *primary* metadata as based on one or more print editions used as sources is just not consistent with the policy and practice cited above. Our #140 was *not* published in 1906, it was published in 1994. (Hmmm...interesting example, since the catalog doesn't have this right, either.) We'd get beat up about it. Librarians would complain. Publishers would have a basis to complain about us mis-using their trademarks. And, it would be false. The PG editions are *not* their print sources. The idea of structured metadata about sources makes sense. Only if it's clearly a search for source material(s) used, not for the PG titles. And, carrying such information from the copyright clearance through the eBook submission is something that current producers could do today, and often do (though not in a way that is structured to be easily machine-parsable). In short, I see some technical problems and solutions to making source metadata easier to (a) keep with an eBook, and (b) search for. I don't see any policy in the way. The issue of whether a PG eBook must adhere to a particular print edition, or is the same as a print edition, was settled decades ago. Those with scholarly interests or other special purposes that require study of a particular print edition are invited, and have always been invited, to find other resources to supplement or replace those of Project Gutenberg. -- Greg

On 02/13/2012 08:13 AM, Greg Newby wrote:
The policy forever (at least since the first version of the "small print", in the early 1990s or late 1980s) is found in every single eBook and elsewhere:
"Project Gutenberg-tm eBooks are often created from several printed editions, all of which are confirmed as Public Domain in the U.S. unless a copyright notice is included. Thus, we do not necessarily keep eBooks in compliance with any particular paper edition."
This policy is a de-facto non-policy as the majority of our books are now produced from one specific paper edition. And that is what most producers want. I think we should drop this `policy´ entirely and make the catalog entry reflect the choice of the producer by saying either: - Project Gutenberg edition transcribed from different sources, or - include metadata of edition X.
For producers to include such information in a more structured format seems fine to me. I don't recall anyone ever presenting an eBook in such a format (say, with a snipped of Dublin Core XML at the end).
The main point here was to have a standard way to include metadata information in every file and have it go thru the WW process unscathed.
All that said: the idea that PG could catalog our items, and derive their *primary* metadata as based on one or more print editions used as sources is just not consistent with the policy and practice cited above. Our #140 was *not* published in 1906, it was published in 1994. (Hmmm...interesting example, since the catalog doesn't have this right, either.)
We'd get beat up about it. Librarians would complain. Publishers would have a basis to complain about us mis-using their trademarks. And, it would be false. The PG editions are *not* their print sources.
This argument is wrong on 3 counts: 1. A MARC catalog entry for a reproduction can either describe the source or the reproduction. This is intentionally left as a choice for the library. If the main catalog entries describe the source, then MARC field 533 describes the reproduction. See: http://www.loc.gov/marc/bibliographic/bd533.html If PG decides so, it is perfectly legal MARC to encode the original source description in the main body and use 533 to describe our electronic edition data. 2. It is not clear why the portion of our ebooks that are faithful reproductions of one paper edition should not get the metadata for their editions included. OTOH those produced from many editions should clearly state: PG edition. Both ways should be acceptable. 3. The argument "somebody could sue us for no reason" is always invalid.
The idea of structured metadata about sources makes sense. Only if it's clearly a search for source material(s) used, not for the PG titles.
I think it is clear enough for everybody from the context of the search that they are searching for an ebook and not a physical book. -- Marcello Perathoner webmaster@gutenberg.org

On Mon, Feb 13, 2012 at 12:03:33PM +0100, Marcello Perathoner wrote:
On 02/13/2012 08:13 AM, Greg Newby wrote:
The policy forever (at least since the first version of the "small print", in the early 1990s or late 1980s) is found in every single eBook and elsewhere:
"Project Gutenberg-tm eBooks are often created from several printed editions, all of which are confirmed as Public Domain in the U.S. unless a copyright notice is included. Thus, we do not necessarily keep eBooks in compliance with any particular paper edition."
This policy is a de-facto non-policy as the majority of our books are now produced from one specific paper edition. And that is what most producers want.
I think we should drop this `policy´ entirely and make the catalog entry reflect the choice of the producer by saying either:
- Project Gutenberg edition transcribed from different sources, or
- include metadata of edition X.
The current policy allows this.
For producers to include such information in a more structured format seems fine to me. I don't recall anyone ever presenting an eBook in such a format (say, with a snipped of Dublin Core XML at the end).
The main point here was to have a standard way to include metadata information in every file and have it go thru the WW process unscathed.
The current policy allows this.
All that said: the idea that PG could catalog our items, and derive their *primary* metadata as based on one or more print editions used as sources is just not consistent with the policy and practice cited above. Our #140 was *not* published in 1906, it was published in 1994. (Hmmm...interesting example, since the catalog doesn't have this right, either.)
We'd get beat up about it. Librarians would complain. Publishers would have a basis to complain about us mis-using their trademarks. And, it would be false. The PG editions are *not* their print sources.
This argument is wrong on 3 counts:
1.
A MARC catalog entry for a reproduction can either describe the source or the reproduction. This is intentionally left as a choice for the library. If the main catalog entries describe the source, then MARC field 533 describes the reproduction. See:
http://www.loc.gov/marc/bibliographic/bd533.html
If PG decides so, it is perfectly legal MARC to encode the original source description in the main body and use 533 to describe our electronic edition data.
I don't think this is a correct interpretation of what MARC allows, but would like to become better informed. Reproductions are about facsimilies, photographs and microfilms. We're certainly not making pictures of books, or facsimiles.
2.
It is not clear why the portion of our ebooks that are faithful reproductions of one paper edition should not get the metadata for their editions included.
(Note that we'd need some measure of faithfulness. My notion of faithfulness is not yours. For example, one of us might not thing page numbers are important. Another might believe that typesetting errors should always be preserved. Etc.) (Theoretically, if the RST master includes page numbers and graphics, and the EPUB derivative does not, is it still as faithful a reproduction?)
OTOH those produced from many editions should clearly state: PG edition.
Both ways should be acceptable.
We already do both. The producers often do include such source metadata, and it is allowed (even if it's a less-than-faithful reproduction). But it's not in an easily machine-parseable format. Such a format would be a fine idea. -- Greg
3.
The argument "somebody could sue us for no reason" is always invalid.
The idea of structured metadata about sources makes sense. Only if it's clearly a search for source material(s) used, not for the PG titles.
I think it is clear enough for everybody from the context of the search that they are searching for an ebook and not a physical book.

On 02/13/2012 12:55 PM, Greg Newby wrote:
Reproductions are about facsimilies, photographs and microfilms. We're certainly not making pictures of books, or facsimiles.
It is no surprise the MARC standard talks about microforms, because that's what brick and mortar libraries do. But it illustrates the concept that: The main body of MARC data need not describe the actual item, but may as well describe the item it is derived from. Of course, if PG does not want to make the site more usable for our readers, it is a perfectly valid choice to once again leave that to other entities, like WorldCat etc. -- Marcello Perathoner webmaster@gutenberg.org

Marcello>This policy is a de-facto non-policy as the majority of our books are now produced from one specific paper edition. And that is what most producers want. I work very hard to produce a book which is faithful to one paper edition. I would be very disappointed if PG were introducing "improvements" without checking that those "improvements" remain faithful to the edition I created from. Part of the problem with many ebook editions (including commercial ones) is that they *aren't* "faithful" to *anything.* I believe we "lose" books over the course of time by their cumulative divergence from the original (including blind formatting.) Particularly disappointing to me are "guess my mind" "fixes" where someone thinks they have found say a typesetter's error in the book where they can "correctly" guess the mind of the author and fix it. Sometimes you can, sometimes you can't. And sometimes the "repairer" guesses *one* solution to the problem when in fact there could be many.

On 02/11/2012 05:01 PM, Carlo Traverso wrote:
It would be enough to include the full clearance metadata in the posting note, or at least the clearance number, that gives access to the clearance records, and the catalogue team would have all the data.
The posting note is of no use. The key has to be in the posted files, but there's no agreed upon place to stick it in. RST has a metadata block that is well suited to transport all sorts of metadata from the producer to the catalog. It may well be sufficient to provide the LoC Call Number, if the book is based upon one specific editon, and the catalog could snarf the rest from the LoC via Z-whats-its-number. -- Marcello Perathoner webmaster@gutenberg.org

This is strange because you must enter the copyright or release/publication date, if you do the copyright clearance.
That is a different database. Its not trivial to match copyright clearance records to ebooks because the ebook no. gets affixed late in the process.
Don't know but that the copyright clearance doesn't work on the "burden of evidence" basis sometimes?

On 2/10/2012 11:20 AM, Marcello Perathoner wrote:
The interface says: "Search All" and it does exactly that. It searches all data we have.
So I conclude from this statement that it searches all of the GUTINDEX files, and catalog.rdf, and catalog.marc. The result of all these searches appears to return the e-text numbers of all matching e-texts. When I follow one of these links, I see a tab entitled Bibrec which contains some data. May I assume that what is presented here is a concatenation of all the matching data from GUTINDEX*, catalog.rdf, and catalog.marc? (That is, after all, all the data you have.) Do you retrieve this data by searching the files, or is it cached somewhere?

On 02/11/2012 08:27 PM, Lee Passey wrote:
On 2/10/2012 11:20 AM, Marcello Perathoner wrote:
The interface says: "Search All" and it does exactly that. It searches all data we have.
So I conclude from this statement that it searches all of the GUTINDEX files, and catalog.rdf, and catalog.marc. The result of all these searches appears to return the e-text numbers of all matching e-texts.
When I follow one of these links, I see a tab entitled Bibrec which contains some data.
May I assume that what is presented here is a concatenation of all the matching data from GUTINDEX*, catalog.rdf, and catalog.marc? (That is, after all, all the data you have.)
Do you retrieve this data by searching the files, or is it cached somewhere?
All stored in an SQL database. catalog.rdf is generated from the db and not the other way. -- Marcello Perathoner webmaster@gutenberg.org

On 2/11/2012 3:02 PM, Marcello Perathoner wrote:
On 02/11/2012 08:27 PM, Lee Passey wrote:
On 2/10/2012 11:20 AM, Marcello Perathoner wrote:
The interface says: "Search All" and it does exactly that. It searches all data we have.
So I conclude from this statement that it searches all of the GUTINDEX files, and catalog.rdf, and catalog.marc. The result of all these searches appears to return the e-text numbers of all matching e-texts.
When I follow one of these links, I see a tab entitled Bibrec which contains some data.
May I assume that what is presented here is a concatenation of all the matching data from GUTINDEX*, catalog.rdf, and catalog.marc? (That is, after all, all the data you have.)
Do you retrieve this data by searching the files, or is it cached somewhere?
All stored in an SQL database. catalog.rdf is generated from the db and not the other way.
How does your search algorithm work against the database? I've always thought that full text searches (full text of the metadata, not full text of the pg e-text) were quite inefficient. Do you generate an ad-hoc SQL query every time someone does a search?

On 02/11/2012 11:30 PM, Lee Passey wrote:
How does your search algorithm work against the database? I've always thought that full text searches (full text of the metadata, not full text of the pg e-text) were quite inefficient. Do you generate an ad-hoc SQL query every time someone does a search?
One text search column in the books table is enough. You fill that with all metadata you have about the book. -- Marcello Perathoner webmaster@gutenberg.org

On 2/11/2012 3:58 PM, Marcello Perathoner wrote:
On 02/11/2012 11:30 PM, Lee Passey wrote:
How does your search algorithm work against the database? I've always thought that full text searches (full text of the metadata, not full text of the pg e-text) were quite inefficient. Do you generate an ad-hoc SQL query every time someone does a search?
One text search column in the books table is enough. You fill that with all metadata you have about the book.
So, select etextid from books where upper(metadata) like '%SEARCH-TERM%'; ??

On 02/12/2012 12:03 AM, Lee Passey wrote:
select etextid from books where upper(metadata) like '%SEARCH-TERM%';
SELECT books.pk as url, books.title, books.author, books.downloads, books.release_date, books.fk_categories FROM v_appserver_books_categories_2 as books WHERE books.tsvec @@ to_tsquery ('english', E'Struww:*') ORDER BY downloads DESC LIMIT 25 -- Marcello Perathoner webmaster@gutenberg.org

On 2/11/2012 4:45 PM, Marcello Perathoner wrote:
SELECT books.pk as url, books.title, books.author, books.downloads, books.release_date, books.fk_categories FROM v_appserver_books_categories_2 as books WHERE books.tsvec @@ to_tsquery ('english', E'Struww:*') ORDER BY downloads DESC LIMIT 25
Very good, thank you. This suggests that most of the data on the Bibrec screen comes from the books.fk_categories column. On the other hand, you naming convention suggests that the fk_categories tuple is actually foreign keys into another table. Is this correct? If so, what is the other table? Mr. Newby -- Is this database mirrored to readingroo.ms? If not, can it be? Mr. Perathoner -- Can you post, or point me to, the schema definition of this entire database?

Unfortunately there are some important errors in your knowledge about a few things, also some good ideas. I'm not sure where you got some of these impressions, but I will try to respond: On Fri, Feb 10, 2012 at 12:04:10AM -1000, Karen Lofstrom wrote:
On Thu, Feb 9, 2012 at 11:45 PM, Greg Newby <gbnewby@pglaf.org> wrote:
This approach would be problematic, since the PG titles are considered as published when *we* release them. Published in Urbana, Illinois. Published by Project Gutenberg.
It might be desirable to include something about original publication dates of the source material(s) we used, but I think that's not consistent with (say) Dublin Core metadata for our books.
Then we bolt on fields that give original publication date, original publisher and place of publication. I'm a scholar. I *need* that info.
Adding such metadata is a possibility. The particular fields etc. are not something I'm personally motivated to work on, but I think it's do-able. As you know, we do collect such information when the copyright clearance is given. Adding this to the upload & posting process would not be too difficult (even as an option).
The original PG intent, to obscure the source of the texts (and in some cases to combine texts) was a mistake, and I'm glad that the policy has been abandoned.
The policy is, and always has been, to leave this to the discretion of the person doing the digitization. There is no requirement that source data be included in a PG eBook, nor is there a prohibition against it. Similarly, there is no requirement of adherance of various details (such as, whether to include typos as-is, or page numbers, or a prohibition on modernizing spelling). Note that all of this upsets some people with deep expertise and scholarly interests. That's a hard crowd to please. A PG subset or spinoff that *does* enforce some set of requirements like those mentioned above would, of course, be welcome and encouraged.
In case you are thinking, "but what if we try to accurately represent the old printed book, in our new eBook," it's still not appropriate to claim that the old book's metata applies to our new eBook. Furthermore, we'll have scholars and other riffraff complaining that our book is not, in fact, the same.
But there should be a place IN the metadata to credit the source of the PG version.
This is possible, as mentioned above.
How far we can depart from the original version is another discussion. We have to balance fidelity to the original with readability in the present (and usability on ereaders). There's ambiguity and difficulty there. I am happy that DP has been moving towards more scholarly editions, with many PPers noting which changes have been in the text, as emendations of typos in the original.
The PG policy is the PG policy, as mentioned above. Individual or collective contributors are, of course able to make their own decisions.
Well, that's in there. Original is in the book, and the metadata. Updates are in the book.
But none of that is easily searchable! The PG interface allows us to search only by au and title. Google and IA and Manybooks aren't any better, really. Is there any organization out there that has done it right?
Huh? http://www.gutenberg.org/catalog/world/search Hugs & puppies, -- Greg

On Fri, Feb 10, 2012 at 8:15 AM, Greg Newby <gbnewby@pglaf.org> wrote:
But the main page says that advanced search covers only author name, title, language or subjects. The page you linked adds a few more things. Then Marcello says that I can search on original date and place of publication. How am I supposed to know that this is possible? The interface is user-hostile. -- Karen Lofstrom

On Fri, February 10, 2012 11:23 am, Karen Lofstrom wrote:
The interface is user-hostile.
Yes, but you missed the most important part of Mr. Newby's response:
A PG subset or spinoff that *does* enforce some set of requirements like those mentioned above would, of course, be welcome and encouraged.
Mr. Newby is not going to enforce metadata requirements. Mr. Perathoner probably could not create a user-friendly interface if his life depended on it. But all the data is there, and if you can design a new interface, we can implement it. Don't complain about PG's policies, ignore them and start over.

On 02/09/2012 08:43 PM, Karen Lofstrom wrote:
From my USER point of view, PG gives next to no usable information to people searching for books. Organized by author and title. That's all. I would want info re date and place of publication, publisher, and genre. Birth and death dates of author. If a serial publication, volume and number, and the run of the publication.
The database can hold all that data. Do you volunteer to enter them? No, eh?
(I might want to look at all the books and periodicals published in London in 1882. No way to do it now, and it ought to be easy.)
Do enter *Berlin 1922* into the PG search.
I would want LOC and Dewey Decimal and other such numbers (dunno about schemes used in non-American libraries) stored with the book info so that I could find all versions of a book, e, paper, whatever.
Do enter *99004276* into the PG search.
I would want to know WHEN the book was first digitized and WHEN it had been corrected, if at all. (That's important for judging the reliability of the text.)
That is why I'm proposing a source repository. You can get all that plus see which lines were corrected.
Finally, I'd want a recommendation system like the ones used by Amazon or Netflix.
We have a light-weight recommendation system based on book downloads only. If you want a better one, then PG must keep a lot of personal data about you. That is something I want to avoid.
Oh, and I'd like to be able to download ALL of an author's books at once, instead of having to do it painfully and slowly one by one. I suppose I still think like an academic. If I'm interested in Joseph Altsheler, I want to see everything he published.
If you show me how to start multiple downloads on all browsers I'll do it.
That's a user's POV, not a librarian's and not a programmer's. I don't think that anything in my list is undoable. It's all been done. It's just a matter of assembling the pieces.
People went to the moon, why don't we? It's just a matter of ... -- Marcello Perathoner webmaster@gutenberg.org

I'm particularly interested in hearing from Ms. Lofstrom with suggestions about what WEM metadata should be collected, and how it might be structured and retained.
I would also like to suggest that in keeping with the PG charter to preserve "books" not "parts of books" the "WEM" data (not implying literally the WEM data) needs to be part of the distribution, at least that part of the distribution going out of PG to the end customer. In the case of HTML (used only for an example) that would require PG layering even more PG traditions on top of HTML, because HTML doesn't contain this stuff. It could go at the end, and it could go in in a "hidden" manner (except that that which is hidden is often not hidden as we unfortunately keep rediscovering in the case of pagenums) or it could go in the body in sensible places, such as part of the title page, if there are suitable tagging conventions.

On Thu, February 9, 2012 2:24 pm, James Adcock wrote:
In the case of HTML (used only for an example) that would require PG layering even more PG traditions on top of HTML, because HTML doesn't contain this stuff. It could go at the end, and it could go in in a "hidden" manner (except that that which is hidden is often not hidden as we unfortunately keep rediscovering in the case of pagenums) or it could go in the body in sensible places, such as part of the title page, if there are suitable tagging conventions.

Lee>See: http://dublincore.org/documents/dc-html/ Yes, I think you're right -- it seems like the HTML "head profile" + dublincore would be the place to put it: <head profile="http://dublincore.org/documents/2008/08/04/dc-html/"> <title>Services to Government</title> .... </head>

Jimmy> the problem is inferring this metadata from HTML. That metadata is commonly accepted as being part of what it means to be "a book." Little details like title, author, publisher, publishing dates, copyright info, title page, TOC....

Off the top of my head, I'd define a master format as something that can be completely and losslessly converted to other formats as needed.
Well, obviously that never happens. First, there is a loss when the transcriber transcribes from the "display format" of the printed page to the "master format" which doesn't contain formatting information -- but needs to, because at least some of the information in the original book *is* encoded in formatting choices -- and then there is a second loss in converting from the "master format" to the "other format" when there is not a one-to-one match of elements, which means that the conversion has to "fake it." There are also losses when the transcriber does not completely understand, or does not care to completely understand, the "master format." And there are also losses when the writer of the translation tools don't bother to understand all the details of the other format which leads them to "fake it" more than the need to [or less than the need to, depending on the reliability of the actual target] So in practice each and every conversion leads at least potentially to a loss.

On Tue, February 7, 2012 4:45 pm, Joshua Hutchinson wrote:
This is the statement you keep making that is driving me nuts. <div><br /></div> <div>EPUB is a wonderful display format. It is NOT a master format.</div>
Heck, it's not even a display format; the display parameters are determined by the underlying HTML. ePub is an encapsulation and distribution format, approximately equal to creating a TAR archive. To use ePub as a master you would have to first extract and merge all the encapsulated HTML -- so why not just use the HTML as the master in the first place? But what can you do? Mr. Adcock doesn't understand this, and is going to continue to insist that ePub is an appropriate master file type. I suspect that the right thing to do is to say "yes, we'll consider ePub as a master file," then just never do anything to support it. I suspect that if we come up with a good tool chain to get a master file into a good ePub, the whole issue will just evaporate.

On 8 February 2012 18:19, Lee Passey <lee@novomail.net> wrote:
On Tue, February 7, 2012 4:45 pm, Joshua Hutchinson wrote:
This is the statement you keep making that is driving me nuts. <div><br /></div> <div>EPUB is a wonderful display format. It is NOT a master format.</div>
Heck, it's not even a display format; the display parameters are determined by the underlying HTML. ePub is an encapsulation and distribution format, approximately equal to creating a TAR archive.
That's not quite right either, because the container format for EPUB is a zip archive. -- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you

Karen>What if DP revised its workflow so that one person combined all the pages, checked for inconsistencies and typos, and marked all the chapter headings and section headings with some easy markup? This text could then be passed to someone who specialized in markup. This would basically split the PP process into two parts. It would resemble a real-world publishing workflow in which editors are asked to mark text divisions before passing the text to the layout person. Mapping what you are saying to the HTML world, what you are saying is that instead of including the CSS inline in the markup HTML file, we should have that CSS separated into a separate CSS file, and if we needed to target very small machines, or EPUB machines, or MOBI machines then we could just have separate CSS files for those machines, or just do pragmatic marking of some small regions of those CSS files using @media statements and queries so we don't actually even have to create a separate CSS file for each major class of machine. And if you did this you would then have exactly the same thing the rest of the world is already doing to support small machines, and would be using the same exact features that are already built into HTML5 and CSS3 to fix these kinds of problems, because the rest of the world has also already discovered these kinds of problems, and proposed and implemented language standards to deal with them, and HTML browser support for them, and is actively writing their own HTML to address the relatively simple issues required to refactor HTML code so it looks good on big machines, and on small machines, and even on printed paper, and on display-less version for sightless readers, etc. The rest of the world has already done this. It's just that PG and DP haven't caught on yet. Because there is a class of people at PG who always wants to reinvent the square wheel from scratch.

Hi Jim, How rude to say "… you are saying…" when she did not say what you said that she said. regards Keith. Am 06.02.2012 um 20:14 schrieb Jim Adcock:
Karen>What if DP revised its workflow so that one person combined all the pages, checked for inconsistencies and typos, and marked all the chapter headings and section headings with some easy markup? This text could then be passed to someone who specialized in markup. This would basically split the PP process into two parts. It would resemble a real-world publishing workflow in which editors are asked to mark text divisions before passing the text to the layout person.
Mapping what you are saying to the HTML world, what you are saying is that instead of including the CSS inline in the markup HTML file, we should have that CSS separated into a separate CSS file, and if we needed to target very small machines, or EPUB machines, or MOBI machines then we could just have separate CSS files for those machines, or just do pragmatic marking of some small regions of those CSS files using @media statements and queries so we don't actually even have to create a separate CSS file for each major class of machine. And if you did this you would then have exactly the same thing the rest of the world is already doing to support small machines, and would be using the same exact features that are already built into HTML5 and CSS3 to fix these kinds of problems, because the rest of the world has also already discovered these kinds of problems, and proposed and implemented language standards to deal with them, and HTML browser support for them, and is actively writing their own HTML to address the relatively simple issues required to refactor HTML code so it looks good on big machines, and on small machines, and even on printed paper, and on display-less version for sightless readers, etc. The rest of the world has already done this. It's just that PG and DP haven't caught on yet. Because there is a class of people at PG who always wants to reinvent the square wheel from scratch.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Mon, February 6, 2012 12:14 pm, Jim Adcock wrote:
Mapping what you are saying to the HTML world, what you are saying is that instead of including the CSS inline in the markup HTML file, we should have that CSS separated into a separate CSS file, and if we needed to target very small machines, or EPUB machines, or MOBI machines then we could just have separate CSS files for those machines, or just do pragmatic marking of some small regions of those CSS files using @media statements and queries so we don't actually even have to create a separate CSS file for each major class of machine.
If I have correctly understood this terribly run-on sentence... yes.

I would assume PG must be doing something like this now for html files for ereaders. If you know an html file is going to an ereader format, and you know the default css has, say, irrational margins, you adjust the css. It's what css is for. Isn't that currently happening? Especially for DP books, the pseudo-standard css would I'm sure have default adjustments that would work in most cases. Or am I missing something? On Mon, Feb 6, 2012 at 1:24 PM, Lee Passey <lee@novomail.net> wrote:
On Mon, February 6, 2012 12:14 pm, Jim Adcock wrote:
Mapping what you are saying to the HTML world, what you are saying is that instead of including the CSS inline in the markup HTML file, we should have that CSS separated into a separate CSS file, and if we needed to target very small machines, or EPUB machines, or MOBI machines then we could just have separate CSS files for those machines, or just do pragmatic marking of some small regions of those CSS files using @media statements and queries so we don't actually even have to create a separate CSS file for each major class of machine.
If I have correctly understood this terribly run-on sentence...
yes.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Don>I would assume PG must be doing something like this now for html files for ereaders. If you know an html file is going to an ereader format, and you know the default css has, say, irrational margins, you adjust the css. It's what css is for. Isn't that currently happening? Especially for DP books, the pseudo-standard css would I'm sure have default adjustments that would work in most cases. Not really what is currently happening, and epubmaker adds its own tag-on css file, but I can't really figure out how what is in there got in there, because it doesn't make a whole lot of sense to me. Some people at DP understand small machines, and implicitly write to those machines, and other people at DP say PG says they want HTML so what I will write to is the copy of Moz I have on my PC as it displays on my 20" monitor. And people read books about how to make fixed layout HTML home pages and then assume that what they read there represents "good" HTML practices for writing books.

:) :) TeX is not necessarily a word processor, but a layout system! ;-)))) ;-))))) regards Keith. Am 06.02.2012 um 17:19 schrieb Marcello Perathoner:
On 02/06/2012 08:32 AM, don kretz wrote:
I can't think of a single successful word processing product that exposes its underlying text representation to the user.
TeX.
-- Marcello Perathoner webmaster@gutenberg.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

TeX is markup, not an editor. TeX *has* editors that hide the markup: http://en.wikipedia.org/wiki/Comparison_of_TeX_editors
participants (14)
-
Andrew Sly
-
don kretz
-
Greg Newby
-
hmonroe.pglaf@huntermonroe.com
-
James Adcock
-
Jim Adcock
-
Jimmy O'Regan
-
Joshua Hutchinson
-
Karen Lofstrom
-
Karl Eichwalder
-
Keith J. Schultz
-
Lee Passey
-
Marcello Perathoner
-
traverso@posso.dm.unipi.it