Re: Bowerbird's software projects

gardner said:
i scraped them from canadiana.org myself... :+) but hey, do you still have your original o.c.r.? or a latest version that has the original linebreaks intact? ***
I don't believe that a single pass is feasible,
ok, i should elaborate. multiple passes, to check different aspects, will be required. but multiple passes to check the _same_ aspect are inefficient.
I don't believe that a single pass is feasible, in particular for mismatched quotes and spaced quotes.
i believe you're wrong, and that i can show you.
in particular for mismatched quotes and spaced quotes. You fix the open quote, or in my case close more often than not, then that reveals another quote problem further down/up.
i have already demonstrated that you can fix spacey quotes, and -- in the vast majority of cases -- fix 'em automatically. leading and trailing spacey quotes are easy to fix, of course. from there, it's a simple matter of segmenting the text into _paragraphs_, and counting quotemarks in each paragraph, making sure that the odd ones are open, and the even closed. then when you come upon a spacey quote, fix it to be open if it is an odd one, and fix it to be close if it is an even one. if you come up against a case where there is an odd number of quotes in a paragraph, and the next paragraph does not start with a quote, then you have a case you need to look at. similarly, if any of the quotemarks come up as the wrong type (an odd that's close, or an even that's open), you need to look. you can test this for yourself. you'll find that it's very robust. usually there's no need to spend much time on spacey quotes.
In any event I am not troubled by multiple passes.
ok.
Well it *is* a Gutenberg text after all.
right. that point wasn't directed at you, as you correctly realized.
Thanks.
well, the fact that you haven't wasted your time is only _part_ of the equation. the fact that you won't get much credit down the line (because _your_ text will be discarded because you threw away info that people will want) is yet another (bigger) part of the equation.
Sure. The book *is* public domain after all. Do what you like.
i think you missed the point. you can mount a version of your work that doesn't throw away the important information, and then no one will have to re-do it, in which case they will be happy to continue to give you the credit. -bowerbird

I'll concur on the spacey quotes. Twister has a tab just for those. You pick whether to visit all quotes or only anomalies, based on spacing restarting every paragraph. I never bother to visit all any more. It just pops from one bad quote-pair to the next, highlights the whole thing, and offers a button to realign it correctly. If it's just a spacing problem, not missing one end or the other; or some other usage (e.g. inches, dittos, etc.) it always gets it right. This is probably the fastest of all the regex checks that require visual inspection.

On 19-Feb-2010 23:57, Bowerbird@aol.com wrote:
(because _your_ text will be discarded because you threw away info that people will want) is yet another (bigger) part of the equation.
you can mount a version of your work that doesn't throw away the important information, and then no one will have to re-do it,
I'm going to take this as a jumping off point for a more general question about whether pagination of a published edition, is worth saving. Obviously there is a range of opinion. I'll give you mine. What I believe, philosophically, I am shooting for is to capture the core content, and reject the details that have mainly to do with the medium of publication. So at the top level, I think the text itself and notions like block quotations, poetry layout, italics and stuff I keep. Stuff that is a function of the fact it was printed on little rectangles of paper -- hyphenation page numbering, line ends, I believe I do not have any use for. Maybe there are possible future uses of my text that would want the things that I left out. I tend to doubt that this could ever be very important. If I take for example what scholarly editions tend to do, they focus on the text, tend to combine information from different printings and editions, and winnow out and reject the artefacts of hyphenation and pagination. The seek out and highlight even small differences in the text, but go to pains to filter out hyphenation artefacts. In the grand scheme of things, there were undoubtedly interesting things in earlier versions of a book than what we have -- the author's manuscript, editors notes, even the setter's notes all would be very interesting things to have. But if I think what value I could get from having the author's manuscript I do not picture knowing the pagination or line endings of a longhand manuscript as being of foremost importance. Obviously others feel like preserving page numbers is worthwhile -- I see that most PG-Canada texts have this. As an individual contributor I do not feel that my time is best spent capturing and encoding that, and so I don't. And I am happy that PG finds my efforts acceptable despite this deficiency. I haven't done any sort of real research, but a quick look shows me that not many texts attempt to preserve line endings in any way. Preserving line endings seems quite unpopular. My question to the pagination-preservers is: what is the difference? Both hyphenation, line-endings and pagination are mainly artefacts of the physical medium -- one of width and the other of height. Bowerbird wants to keep both; I see no need to keep either. But what is the reasoning behind keeping one (pagination) and not the other? ============================================================ Gardner Buchanan <gbuchana@teksavvy.com> Ottawa, ON FreeBSD: Where you want to go. Today.

On Sat, Feb 20, 2010 at 8:52 AM, Gardner Buchanan <gbuchana@teksavvy.com> wrote:
My question to the pagination-preservers is: what is the difference?
Pagination is crucial if you're talking about the text to someone else -- whether in a scholarly context, or just referring to a certain passage when writing a review. If you say, "Nina is called a gypsy on page .89 of the 1899 edition", someone else can find the passage and check your assertion. If you say, "Somewhere in the first third of the book, Nina is called a gypsy," people won't be able to find it. Even if you are reading on a device that does search easily, you'd still have to pull up and scour all mentions of gypsies. Pagination isn't a perfect reference method. If you're in a class where they're reading Gaskell's North and South, say, and the teacher is referring to a modern reprint and you've got an ebook version of the first edition, with the first edition pagination, you're going to have to do some searching. You'll probably find what you want within a range of a few pages, however. The best method is the one used for religious texts: giving chapter and verse. That reference is invariant across all versions. Perhaps we'll adopt that eventually for ALL texts. Until then, pagination is a next best. -- Karen Lofstrom

On 2/20/2010 11:52 AM, Gardner Buchanan wrote:
My question to the pagination-preservers is: what is the difference? Both hyphenation, line-endings and pagination are mainly artefacts of the physical medium -- one of width and the other of height. Bowerbird wants to keep both; I see no need to keep either. But what is the reasoning behind keeping one (pagination) and not the other?
As with most things, your position depends on your perspective. As a reader (consumer) of e-books, I want to get /all/ the production artifacts out of my way; a line should wrap wherever I would expect it to depending on the size of my viewport (screen), hyphenation should only occur between syllables at the right edge of the viewport, and page should end at the bottom of my viewport--no sooner, no later. Page numbers, if any, should reflect the number of /virtual/ pages there are in the book I'm reading; i.e. the number of viewports to complete the book. These page numbers should not be embedded in the text, but should be displayed somewhere else in the User Agent where I can refer to them if I want to, but otherwise they are inconspicuous. Of course, if I change fonts or the viewport size I would expect the page numbers to be updated to reflect that chante. BUT ... As a producer of e-books, it is my self-appointed task to create an e-book whose reading experience matches, a nearly as possible, a specific instance of a historical paper book. Clearly this doesn't mean that in the final product the page- or line-endings have to match the source, as that would in many cases lead to an awkward "ouija" board reading experience, but it does mean that I want to maintain markers throughout the e-document that can 1.) /create/ a view where the page- and line-endings match so I can do a side-by-side comparison of a page image with my electronic version, and 2.) lead me efficiently back to a particular page scan if there is any question about the correctness of the electronic edition. This apparent conflict between the two perspectives leads to two follow-on questions: 1.) where and how broad is the line between production and consumption?, and 2.) is it possible to create a single electronic document that can satisfy both needs? In the case of the PG/DP co-dependency, I think the line is clear and narrow: Distributed Proofreaders is /only/ a producer of electronic documents and its only consumer is Project Gutenberg. Project Gutenberg is /only/ a consumer of electronic texts, and while DP is its primary producer it is not the only one. According to Al Haines, one of PG's whitewashers, the PG 'errata' mechanism "is informal, at best, and there's no list of old submissions that would benefit from being re-done." Errata resolution at PG is handled via e-mail messages to a very small handful of whitewashers. According to Mr. Haines, "My PG priorities are my own productions first, followed by WWing, then Errata and Reposts." In yet another post, after detailing multiple problems with an old DP contribution he states:
Is it worth it? Personally speaking, no. It's going to take hours to fix this text, time that I'd far rather spend on my own productions, but there's currently no mechanism except for the Whitewashers, a.k.a. Errata Team, to fix this kind of thing. (Probably simpler to just re-do this text from scratch, which is something *I'm* not about to do.)
This is precisely the reason that DP puts such an emphasis on having a /completed/ text. Once an electronic document passes over from DP to PG there is almost /no/ chance that it will every be improved, revised, corrected. This is not to cast aspersions on the hard and dedicated work of the whitewashers, simply an acknowledgment of the fact that it is not a high priority for them and there is no formal mechanism to help it get done. Because Project Gutenberg is the /only/ consumer of Distributed Proofreader's production, preservation of line- and page-breaks should be of little importance in the current DP->PG work flow. /If/, on the other hand, what you're doing lies outside of the DP->PG work flow (as it appears your does), then the calculation changes. For example, what happens to the page scans from whence your text is derived? If those scans are not, and will not ever be, publicly available then encoding markers in the text that refer back to page scans and the original text layout may not be necessary or important. Likewise, if PG is your only distribution point then you will probably be the only one who will ever make changes, corrections or improvements to the text. If you expect that once you have completed a task and transferred responsibility to Project Gutenberg you are finished, perhaps even deleting the original scans and your intermediate work files (please don't do this; I'm sure that the Internet Archive would be willing to take them off your hands) then preservation of markers referring back to the original text are probably not necessary. By contrast, if you are preparing files for broader distribution than simply via Project Gutenberg, or if you anticipate that someday a work flow may develop either inside or outside of the DP->PG chain that will support continuous improvement of your original work, then I would think that creating and preserving text markers, including original page-breaks, line-breaks, and page numbers referencing the original scan set would be advisable. This is particularly true as it is always easier to preserve data, even that data of dubious value, than it is to try and recover data that has been lost or discarded. This leads us to the second question: is it possible to create a single electronic document which can satisfy the needs of both readers and producers? I believe that it is, but it requires the use of a markup language having at least the capability of marking some text as invisible, and a user agent that is capable of recognizing that markup and /not/ rendering it as indicated. I'm sure there are a number of markup languages that could satisfy this requirement, but I have chosen to use XHTML (with one small cheat). When ABBYY FineReader saves its OCR output in HTML format it has the option of placing a break (<br>) at the end of each line, and a horizontal rule (<hr>) between each page (an alternative is to save each scanned page as a separate file, but I find that less convenient). I then wrote a short program (could probably be done just as easily with a perl script, or even sed) that replaced each <hr> with an anchor tag indicating the page number (<a name="page##" />), and replaced each <br> with <lb />. Now <lb> is not a valid HTML element (hence the cheat), but I know of no user agent that will fail to render an HTML file just because it has an invalid element in it. FineReader is quite good at recognizing when line-ending hyphenation is due to splitting long words or when it is required as a part of a compound word. In the first case, when it saves using line breaks it saves the line-ending hyphenation either as a hard hyphen or a soft-hyphen. When soft hyphens are replaced by (and the following white space is removed) you have recorded a line-ending hyphen which will not be displayed (although different user agents sometimes do things differently). Originally I was in the "collapse page- and line-break" camp, but because I never submit my texts to PG, and I have hope that someday some sort of continuous improvement process may evolve (and because maintaining the data is cheap and simple) I'm moving into the "preserve everything" camp.

On Mon, Feb 22, 2010 at 3:45 PM, Lee Passey <lee@novomail.net> wrote:
When ABBYY FineReader saves its OCR output in HTML format it has the option of placing a break (<br>) at the end of each line, and a horizontal rule (<hr>) between each page (an alternative is to save each scanned page as a separate file, but I find that less convenient). I then wrote a short program (could probably be done just as easily with a perl script, or even sed) that replaced each <hr> with an anchor tag indicating the page number (<a name="page##" />), and replaced each <br> with <lb />. Now <lb> is not a valid HTML element (hence the cheat), but I know of no user agent that will fail to render an HTML file just because it has an invalid element in it.
Since the user agent will take care of rewrapping, you could just leave the linebreaks where they are. If you really want to have them encoded, I'd opt for some CSS. br.lb {display: none} in your <style> section Then <br class="lb" /> wherever you're currently putting <lb />.

On 2/22/2010 4:02 PM, Scott Olson wrote:
On Mon, Feb 22, 2010 at 3:45 PM, Lee Passey <lee@novomail.net <mailto:lee@novomail.net>> wrote:
When ABBYY FineReader saves its OCR output in HTML format it has the option of placing a break (<br>) at the end of each line, and a horizontal rule (<hr>) between each page (an alternative is to save each scanned page as a separate file, but I find that less convenient). I then wrote a short program (could probably be done just as easily with a perl script, or even sed) that replaced each <hr> with an anchor tag indicating the page number (<a name="page##" />), and replaced each <br> with <lb />. Now <lb> is not a valid HTML element (hence the cheat), but I know of no user agent that will fail to render an HTML file just because it has an invalid element in it.
Since the user agent will take care of rewrapping, you could just leave the linebreaks where they are.
I considered that, but I'm not in favor of invisible markup. What do I mean by invisible? We know that the HTML spec says that multiple white space can be collapsed unless it is specifically identified as "non-breaking," and we know that spaces, tabs and newlines are all white space and sometimes very hard to distinguish from each other. This means that my HTML tools might have a tendency to wrap these lines up if I'm not extremely diligent. And because it's still white space there's a good likelihood that I may not even notice if it gets screwed up. I like markup that's in my face, and obviously not part of the text. Markup rules like "three blank lines indicate a minor header, but four blank lines indicate a major header" and "one space at the beginning of a line means don't wrap this line, but two spaces means wrap this line but do a block indent" just make me shudder. If it's markup it should be markup, and if it's not it shouldn't pretend that it is. This is kind of a specific instance of the general rule that a markup element should do one thing, and one thing only.
If you really want to have them encoded, I'd opt for some CSS. br.lb <http://br.lb> {display: none} in your <style> section Then <br class="lb" /> wherever you're currently putting <lb />.
This is an option I have tried, and is not a bad idea. I prefer the invalid element idea simply because many user agents I'm familiar with, particularly phones and handheld devices simply have not yet figured out how to do CSS. At one point Mobipocket claimed CSS support, but on closer examination I discovered that their publisher tool simply went through and replaced CSS styles with elements that their UA actually recognized. Your "display:none" trick simply wouldn't have worked in the old Mobipocket reader. Now I know that the Kindle software was based largely (if not entirely) on the old Mobipocket reader. What is the effect of trying to use CSS to turn off the display of line-breaks after the file has been converted to .mobi? Maybe someone with a Kindle could enlighten us?

Love it!!! Michael On Mon, 22 Feb 2010, Scott Olson wrote:
On Mon, Feb 22, 2010 at 3:45 PM, Lee Passey <lee@novomail.net> wrote: When ABBYY FineReader saves its OCR output in HTML format it has the option of placing a break (<br>) at the end of each line, and a horizontal rule (<hr>) between each page (an alternative is to save each scanned page as a separate file, but I find that less convenient). I then wrote a short program (could probably be done just as easily with a perl script, or even sed) that replaced each <hr> with an anchor tag indicating the page number (<a name="page##" />), and replaced each <br> with <lb />. Now <lb> is not a valid HTML element (hence the cheat), but I know of no user agent that will fail to render an HTML file just because it has an invalid element in it.
Since the user agent will take care of rewrapping, you could just leave the linebreaks where they are. If you really want to have them encoded, I'd opt for some CSS. br.lb {display: none} in your <style> section Then <br class="lb" /> wherever you're currently putting <lb />.
participants (7)
-
Bowerbird@aol.com
-
don kretz
-
Gardner Buchanan
-
Karen Lofstrom
-
Lee Passey
-
Michael S. Hart
-
Scott Olson