Re: [gutvol-d] epubeditor.sourceforge.net

here's that post to lee which i started over a week ago, and part of what started making me feel that "despair". this post started out as a straightforward review of the difference between lee's tool and my methods. but i felt i needed to preface that with commentary on why a discussion with lee always is so frustrating, and why i eventually had to put him in my kill-folder, and how i wish that i wouldn't have reviewed his app, and the "preface" soon overwhelmed the "review"... so... if you want to skip the back-and-forth scratching, jump down to the two long lines of asterisks below, surrounding a section saying "take a deep breath"... *** lee's post (fri, oct 21st, at 17:47) can be found here:
http://lists.pglaf.org/mailman/private/gutvol-d/2011-October/008200.html dear lee... ok, first, lee, let me be perfectly clear to you... i understand all of your points -- every one -- about your program in your latest reply to me. indeed, i understood all of those points when you made them in your _previous_ reply to me. so you didn't need to make them _again_, and you won't need to make them again in a reply to _this_. because i understand 'em. honest! every one! totally and completely, lee! really! i had simply forgotten how tedious it can be to have a "conversation" with you, even when you're _not_ trying to spin it or sabotage me. but now i remember. so i will give you a few reminders, and other people here can see what i'm talking about. i said:
or, ya know, you can always give 'em _your_ source-code.
lee said:
But that's exactly what I did!
yes, lee. i knew your code was open-source. i downloaded your code from sourceforge. and sourceforge is a host for open-source. anyone with minimal experience knows that. anyone who's been around a while knows that. anyone who can read the blurb that describes sourceforge, on the download page, knows that. so yes, i knew your code was open-source... and if you had thought about it for a second, or given me 1/10 of the credit i "deserve" for paying attention, or being a programmer, or putting in time, or being a web-surfer who isn't totally asleep, you woulda _known_ that i knew that your code was open-source, and you wouldn't have made the reply you made. so i wondered why you made that reply? but i stop wondering, after two seconds or so, because i've learned it simply doesn't matter... what it means, though, is that you missed that my suggestion was _ironic_, a bit _sarcastic_, and thus you missed the point i was making, which is that you -- and others just like you -- make noise about the "open-source" aspect, when -- in actuality -- the overwhelming mass of open-source projects _don't_ get treatment of the sort that you so-called "advocates" are so fond of talking about, namely that the code is worked over by a large number of people, who not only ensure that it is solid but also continually extend it to all kinds of new uses. oh sure, that happens with _some_ programs. but the vast majority of them are maintained by one person, who does all the work on it, until they tire of it, and then they respond to further requests with a "you can do it yourself". but nobody ever does. you know who's gonna work on your app, lee? you. and only you, lee. you. and nobody else. when i told d.p. i would code a _spellchecker_ for them, they told me they weren't interested, because "it won't be open-source"... so they went without a spellchecker instead, for years. and when they decided to take up the task of adapting an open-source spellchecker, it took a ton of time for them to get it to work, and it _still_, to this very day, doesn't always do what they would like it to do. and guess what? have they _ever_ went in to rewrite that code, so it would behave like they want it behave? no, they haven't. as it would be too difficult. is open-source a good idea? yes sir, it sure is! is free software an even better idea? you betcha! but let's not confuse the _real_ with the _ideal_. just because somebody else _can_ work on it does _not_ mean that they _will_ do that. ever. that's the long explanation of the point that i was making with my simple "suggestion"... i didn't want to have to type all of _that_, though, because then _i_ would've been the tedious one. but as you don't "get the joke", then even having any discussion with you becomes very tedious... *** here's another example, for you, and the others... i said:
isn't it the "trivially easy" tasks that we want our computers to be performing _for_ us?
lee said:
No, I don't think so. First you have to understand that there are tasks that are trivially easy /for a human bean/ that are extraordinarily complex for a computer. And there are tasks that are enormously complex for a "human bean" (primarily because they are so detailed) that are trivially easy for a computer.
well, gee, lee, thanks for the grand exposition there. i bet your friends think that you're really really smart. i grant that you're thorough, even as you manage to miss the point completely, and by 3.85 country miles. because we were talking about a task that is: 1. trivial for a human being. 2. trivial for a computer. 3. trivial for a human being to code a computer to do. and i think that we can all agree that your exposition is completely overblown in regard to that type of task. yet that's what we were talking about. (go look it up, if you need to, but the task was exactly like one that you'd just talked about by saying your program did it, so as "to relieve the tedium and avoid simple errors".) *** plowing through these diversions becomes very tiring. it's as if you're intentionally _trying_ to miss the point. (i'm not saying that you _are_ doing it "intentionally", mind you, because that _might_ be the "fundamental attributional error" raising its ugly head... but i have had to slash through the underbrush of these dodges so often that they sure do _seem_ to be "intentional". if they are not, then you would appear to have some serious problems when it comes to staying on-point.) i said:
one of the "themes" of the event is "beautiful books".
lee said:
Hopefully, you have your tongue planted firmly in your cheek.
not only do you _not_ get my humor when i put it out, you _think_ i'm joking when i'm relating a simple fact, combined with a link which you must not have checked. these little misunderstandings cumulate to great frustration. i will say that yes, i did find that theme to be _ironic_... so maybe you can catch irony when i direct it at others, but not when i direct it at you. however, i wasn't poking _fun_ at that theme; i was appalled they would choose it! that being the case, though, no need for your "hopefully". their text is ugly, and thus they have no right to even use the term "beautiful" in conjunction with their text-versions. *** lee said:
I have no Mac, no access to a Mac, and little interest in the Mac. The promise of Java was "write once run everywhere." Well, I wrote it once, now we'll just have to hope that some Mac developer out there can troubleshoot the problem (and tell me what the solution is once s/he figures it out).
this is exactly the type of attitude i was making fun of, in my post to which you wrote this response... for the record, let us note that "the promise of java" has once again gone unfulfilled in a real-life instance. *** i'll wrap this up, focusing on lee's post on wed oct 26 07:31.
http://lists.pglaf.org/mailman/private/gutvol-d/2011-October/008216.html
/Your/ file is
http://ia700600.us.archive.org/16/items/artofbook00holm/artofbook00holm_djvu....
There it is, the text, the whole text, and nothing but the text.
i discussed why that text-file -- and all of the .djvu.txt files over at at archive.org have problems -- but my post might have _followed_ this one, in which case we couldn't blame lee for not knowing that. except that lee _should_ know all that. he has heard it before. nonetheless, he keeps trying to distort what i mean by "a text file". he keeps trying to talk about text-files as if _all_ of them had the deficiencies of the archive.org text-files, as if _all_ of them were lacking any structural information, and as if this was _required_... you can make a text-file "smart" if you want to, and it does _not_ require any angle-brackets at all. and anything that someone can do with angle-brackets can _also_ be achieved _some_other_way_, in a plain-text file, and it's just ridiculous to say that it can not... there's nothing magical about angle-brackets... nothing at all...
I'm just being a little more demanding.
no, lee. you just _misinterpret_ what i am "demanding" as being much less than it really is, and then you think you have "more"... a direct and one-to-one correspondence can be made between what _you_ are asking for, and what _i_ am asking for. the job has some inherent demands, and if those demands are satisfied, then both of us can do the job... but if not, neither of us can...
What /I/ want is the output from FineReader as though the "Save as HTML" option was selected, with all the markup that FineReader was able to intuit
if i get "all the markup that finereader was able to intuit", then i can do the job just as well as you can. maybe better. the point is that archive.org isn't giving us that information; they tell us we need to trawl their pile of x.m.l. crap to get it.
Does anyone want to furnish me a *nix server with a fat pipe?
i had pointed out that, although it would be _possible_ to run a script against all 3 million books at archive.org, the machinery and bandwidth required make it impractical. lee's solution? ask someone to "furnish" all that to him... i guess it never hurts to ask, 'eh? good luck with that, lee. *** anyway, there are your examples, folks... like i said, _tedious_. and then he repeats everything. this is why i point lee in my kill-file. now i remember. so let's bring this to a close. ************************************************************ take a deep breath to clear your system... take another deep breath to clear your system... take a third deep breath to clear your system... ************************************************************ i now direct the remainder of this post _back_ to the audience at large, not lee specifically... *** the main point of departure between lee and me is that he _starts_ with "the text is in .html form". _then_ his tool takes over. which is fine, i guess... except for the fact that it doesn't match the reality of how us regular humans actually make e-books. it doesn't describe the task that is being done by post-processors over at distributed proofreaders. it doesn't even reflect how the e-book designers who do the job _professionally_ go about the task. because _we_ all start with text. maybe the text is in a word-processing file, maybe it's raw ascii, but it's most assuredly not already marked up in .html. that's what _we_ have to do, to make it an _e-book_. if it was already in .html format, we'd call it "done", or mighty close to it. you might have noticed, up above, when i said the text _might_ be "in a word-processing file", yeah? so maybe you're just thinking that we could ask the word-processing app to convert the text into .html? well, yes, we could. and some of us novices do... but the professional book-designers don't do that. and they strongly advise even us amateurs not to... because what they have found is that the .html which is applied by word-processing apps is _very_crappy_. it gives poor results in most all the e-book viewers, and it is extremely difficult to work with, when you need to make changes. (and you almost always do.) so the admonition is fairly universal: don't do that! what do the professionals advise us amateurs to do? they advise us to save the file as plain-ascii text, and then to apply the .html to that plain text, including the reapplication of styling (e.g., italics) which gets _lost_ when the file is saved in plain-ascii format... indeed, that is precisely what those professionals do. preach what they practice, practice what they preach. (if you don't trust me, ask me to provide some links. or research it yourself. it's easy to find such advice. joshua tallent, liz castro, or thebookdesigner.com...) now, i think it's utterly ridiculous to strip away styling and then have to _reapply_ it. but that's what they do. the application of good solid .html, though, is wise, so _that_ part of the advice i can thoroughly second... even if you do it by hand, it's more economical than letting a word-processor apply crap, which you then waste more time -- long run -- trying to "improve". now, the truth is that those pros have "scripts" that apply the markup automatically. plus they _know_ .html already, well, so this comes naturally to 'em, even if they have to do some of the work manually. but their advice is still good advice for us amateurs, because we get _totally_ confused by crappy .html... without the slightest notions of how to "improve" it, or even to make those inevitably-required changes but whether you are a professional or an amateur, the reality of making an e-book these days is that you _start_ with text, which you mark up in .html... (actually, for .epub, it's .xhtml, but we don't need to even bother making such fine-grained distinctions.) sometimes -- as with d.p. -- the text is from o.c.r. other times, it was "born digital". whatever the case, however, the reality is that we all start with _text_... and the nature of the _job_ is doing .html markup... it _is_ true that -- once you have done the markup into .html, there's _still_ a bit more work after that. and it's also true that this "bit more work" is often _very_ confusing and time-consuming, _especially_ to us amateurs, because the i.d.p.f. -- which is the organization that maintains the .epub standard -- has _never_ provided solid information concerning just exactly what this "bit more work" really entails. even the pros get confused, sometimes hopelessly. (i'd give a link, but i don't want to embarrass them.) however, if you do enough grunt work, and are ready (if not willing) to power through frustrating failures that can number in the dozens, or even _hundreds_, you too can eventually discover the things that work, and you can develop templates that ease future pain. after you've done that, the "bit more work" that is required _after_ you've done your .html markup is fairly easy -- it's basically just filling in information that's included in some "auxiliary" files in an .epub. two of those files are the .opf file and the .ncx file. you might recognize those extensions, since they're the files about which i've lately been speaking to lee. his epubeditor produces the auxiliary files for you, and helps you put the required information in them. so if you're one of the amateurs who are _struggling_ with the proper creation of these files, lee's program would be a _godsend_ to you, saving time and hassle. if you're a professional, you're not spending any time or energy producing these files anyway, because you already have scripts which make them automatically. so you might use lee's tool to do occasional reviews of your .epub files, or make minor corrections, but it probably won't be an app you consider as "crucial". more to the point, though, is that there are lots of programs out there that already create .epub files -- from text -- which generate the auxiliary files (like .opf and .ncx) required _inside_ the .epub file. they apply the .html _and_ create the auxiliary files. so, to sum up, there are two steps to making an .epub: 1. transform the text of your book into an .html file. 2. create the auxiliary files required inside an .epub. total novices, with no tools or experience, will spend _much_ time on the first, and _much_ on the second... professionals, operating with their pro tools, will spend a good amount of time on the first, little on the second. and amateurs, with the decent tools out now, will spend a good amount of time on the first, little on the second. in other words, lee's tool helps with the second step, but no one except unexperienced novices spend time on the second step to begin with. the second step is the "paperwork" that must be done to "finish the job", as the old expression puts it. lee's tool totally ignores the first step, .html markup, which is where everybody spends most of their time... this makes me suspect that lee simply doesn't know how real people in the real world make real e-books. namely, we start with text, and we mark it up in .html. then we do whatever little dance needed to turn it into an e-book file that's viewable on our e-book machines. now, if we only had some kind of a program that would take plain old text, and automagically turn it into .html, plus then create the auxiliary files required in an .epub, _then_ we'd have an app we could call "an epub editor". wait, isn't there such an app coming out real soon now? well, yes, son, there is. called "jaguar". real soon now. in the meantime, while you are eagerly awaiting that, if you're on a mac, you might want to buy a program called "multimarkdown composer", by fletcher penney, which is an editor that incorporates "multimarkdown". multimarkdown, also penney's, is a variant of markdown. markdown is a light-markup system which converts text into .html output that validates as standards-compliant. thus "composer" is a great tool to help with the first step listed above -- the hard step that takes most of the time. "composer" is new in the app store, and it's just $7.99... or, you know, you can use sigil. free/free. and it works. as far as i know, it works fine. couldn't be much better. -bowerbird

and if you had thought about it for a second, or given me 1/10 of the credit i "deserve" for paying attention, or being a programmer....
I think BB that the rest of us will start giving you credit for being a programmer when you actually produce that which you have been talking about for many years and make a product useful for people working on PG that people are actually willing to buy -- because you are claiming that which you are planning to release is going to be good enough that people will actually buy it. If you do that, then certainly I will begin to treat you with respect. But, obviously right now, I think the outcome, if and when it happens, will differ from your predictions.

On 10/30/2011 12:35 PM, Bowerbird@aol.com wrote [massive snippage]
you know who's gonna work on your app, lee? you. and only you, lee. you. and nobody else.
Yes, that is what I have always believed. My main purpose for creating an "open source" project is not to attract "adherents," but simply for purposes of transparency (and backups in "the cloud"). If someone wants to help out, and it's someone who shares my vision I'm happy for the help. If someone just wants to "steal" my ideas (for whatever they're worth) that's fine too. [much, much more snippage]
What /I/ want is the output from FineReader as though the "Save as HTML" option was selected, with all the markup that FineReader was able to intuit
if i get "all the markup that finereader was able to intuit", then i can do the job just as well as you can. maybe better.
Then we're both in luck. As I've been looking more carefully at the HTML output produced by the IA script, I'm discovering more and more useful information. When one uses FineReader, the post-recognition process brings up a side-by-side view of the image and the recognized text. The recognized text highlights words that FineReader is uncertain about or which do not appear in its dictionary. In the IA output, I'm discovering that that data has been preserved. I think with some effort, it would be possible to use this data to build a web interface substantially identical to the proofreading interface provided by FineReader. So Alex, all that talk a while back about how I wanted a "leaner, meaner" file? Forget about it. I think I like it just the way it is. I can select out what I need, and it has some potential. Why don't you talk IA into hiring me, so I can work on this full time? :-) [yet more snippage]
what do the professionals advise us amateurs to do?
they advise us to save the file as plain-ascii text, and then to apply the .html to that plain text, including the reapplication of styling (e.g., italics) which gets _lost_ when the file is saved in plain-ascii format...
An interesting assertion, although a bit thin on actual evidence. Apparently Liz Castro advocates using Adobe's InDesign (shudder) to generate the HTML to create ePub's and Josh Tallent talks about using Microsoft Word's "Save as HTML" as the first step, and then cleaning up the resultant HTML (he goes on to point out that "HTML is a very simple language to learn"). I, personally, have started with "Just Bare" ASCII only one time, and gave it up before I was done because it was just too painful. Of course, I'm obviously not a professional, but I would /never/ advocate that an amateur to start with Just Bare ASCII; what with macros and global search and replace even cleaning up clumsy HTML is easier that adding it all back in by hand.
the application of good solid .html, though, is wise, so _that_ part of the advice i can thoroughly second...
[snipped assertion I happen to disagree with]
now, the truth is that those pros have "scripts" that apply the markup automatically. plus they _know_ .html already, well, so this comes naturally to 'em, even if they have to do some of the work manually
You make a good point here. 99.9% of the time when I'm creating an e-book I start with the HTML output from FineReader. But as has been pointed out elsewhere, FineReader produces SGML/HTML not the XML/HTML required by ePub. So the very first thing I have to do is convert the FineReader output to XHTML. (It's possible to use HTMLTidy to accomplish this, but I wrote my own program derived from the Tidy code base which not only does the conversion, but it does some other useful transformations as well). When I first started designing ePubEditor, I made the conscious decision /not/ to try and write or integrate Yet Another HTML Editor (or Yet Another CSS Editor, or Yet Another JPEG Editor, or Yet Another SVG Editor, or Yet Another NCX editor, ...). While not a participant in the great vi vs. emacs religious war, I am aware of the history; I wanted a tool that could incorporate virtually any user's preferred HTML editor and not force her to accept my preferences. Thus, ePubEditor has an editable set of preferences where a specific editor could be specified for each file's media-type. Your comment got me thinking that perhaps in addition to media-type-specific editors I should have user-configurable, media-type-specific /transformers/ as well. So I added this configuration preference, together with a "Transform" button on the Manifest pane; these additions are included in the most recent changes I uploaded to SourceForge tonight. The configuration dialog provides for a media-type, a transformation program, a program command line, a /transformed/ media-type, and a new extension. For example, in the case of FR output, I can set "text/html+vnd.abbyy" as the media-type, fr2html.exe as the program and "text/html" as the new media-type. Then, I add the FR output file to the Manifest and set the media-type to "text/html+vnd.abbyy". When that file is selected, and the "Transform" button is activated, my selected tool runs which transforms the file from FR format to XHTML. The media-type is then reset to "text/html" and I can either perform more transformations, or open it in the associate "text/html" editor (in my case, Microsoft's Visual Web Developer Express). As another example, if you had a corpus of works marked up with z.m.l., and had a script that would convert from z.m.l. markup to HTML markup, you could set that script as the transformer for files with the media-type of "text/plain+x-zml." You could add .zml files to the Manifest (setting the media-type appropriately, of course), and convert them to HTML using the transform function, resetting the media-type and renaming the file with a ".html" extension. You could then perform other transformations on those files, edit them with your choice of editor, or split them into more manageable segments using the "Split HTML" function. A simple "Save As..." and you would have an ePub file (although it would not be guaranteed to have valid content. For that, you would want to run the built-in "epubcheck" report.) Thanks for the idea. [last of the snippage]
so if you're one of the amateurs who are _struggling_ with the proper creation of these files, lee's program would be a _godsend_ to you, saving time and hassle.
Thank you. That's what I am attempting to accomplish.

Apparently Liz Castro advocates using Adobe's InDesign (shudder) to generate the HTML to create ePub's and Josh Tallent talks about using Microsoft Word's "Save as HTML" as the first step, and then cleaning up the resultant HTML (he goes on to point out that "HTML is a very simple language to learn").
I suggest checking out Open Office *if* one carefully starts with a new HTML document in Open Office HTML mode in the first place, and *not* create an Open Office native format document and then convert that to HTML. Open Office HTML does *some* goofy things during editing of HTML documents -- but nothing compared to say MS Word.

You could try what is arguably the most widely used html editor/platform for non-techies - the WordPress.blog engine. Besides it's default mediocre sorta-wysiwyg UI, you can get adapters to input markdown, RST, textile, etc. (No z-m-l). Full support for storing and sharing your work elsewhere on the internet. Free with most hosting services. Pure, unadulterated HTML. On Wed, Nov 9, 2011 at 9:39 AM, Jim Adcock <jimad@msn.com> wrote:
Apparently Liz Castro advocates using Adobe's InDesign (shudder) to generate the HTML to create ePub's and Josh Tallent talks about using Microsoft Word's "Save as HTML" as the first step, and then cleaning up the resultant HTML (he goes on to point out that "HTML is a very simple language to learn").
I suggest checking out Open Office *if* one carefully starts with a new HTML document in Open Office HTML mode in the first place, and *not* create an Open Office native format document and then convert that to HTML. Open Office HTML does *some* goofy things during editing of HTML documents -- but nothing compared to say MS Word.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Wed, Nov 9, 2011 at 12:28 AM, Lee Passey <lee@novomail.net> wrote:
In the IA output, I'm discovering that that data has been preserved. I think with some effort, it would be possible to use this data to build a web interface substantially identical to the proofreading interface provided by FineReader.
So Alex, all that talk a while back about how I wanted a "leaner, meaner" file? Forget about it. I think I like it just the way it is. I can select out what I need, and it has some potential.
Here's something someone at the archive is working on (after hours, since it's not an official project yet). He'd love to hear your thoughts. http://edwardbetts.com/correct (Note you can't actually submit edits yet, but he's working on getting there) Already I find that proofing like that is a LOT easier than proofing on pgdp, but maybe that's me.

"Alex" == Alex Buie <abuie@kwdservices.com> writes:
Alex> Here's something someone at the archive is working on (after Alex> hours, since it's not an official project yet). He'd love to Alex> hear your thoughts. Alex> http://edwardbetts.com/correct Alex> (Note you can't actually submit edits yet, but he's working Alex> on getting there) Alex> Already I find that proofing like that is a LOT easier than Alex> proofing on pgdp, but maybe that's me. This is very interesting. Synchronizing the image and the text is a problem that is well known to be the key to a better proofreading environment, and the solution given here using data available in Abbyy files is brilliant. Still, some of pgdp solutions are better in spotting errors; here I see too much visual clutter and difficulties in appreciating errors in spacing, and identifying typos like arid for and. But basically the problem here is having too much of good things, combining with pgdp experience I believe that we are near to a real revolution in distributed proofreading. Carlo

Alex> http://edwardbetts.com/correct God I hope someone at DP sees this and picks up on it! Combining the best of this with the "best" of the current DP interface would be a killer improvement over the current DP situation!

On 11/11/2011 12:42 AM, Alex Buie wrote:
On Wed, Nov 9, 2011 at 12:28 AM, Lee Passey<lee@novomail.net> wrote:
In the IA output, I'm discovering that that data has been preserved. I think with some effort, it would be possible to use this data to build a web interface substantially identical to the proofreading interface provided by FineReader.
So Alex, all that talk a while back about how I wanted a "leaner, meaner" file? Forget about it. I think I like it just the way it is. I can select out what I need, and it has some potential.
Here's something someone at the archive is working on (after hours, since it's not an official project yet). He'd love to hear your thoughts.
I'm working on a similar project but I've opted for a desktop application. I think that every proofing application should do 99% of the work automatically and pass only the 1% of dubious cases to the human proofer. My app needs lots of processing power to give aids to the proofer and so a desktop app is the more natural choice. Collaborative proofing will be achieved either by passing only actual edits to a central server or by XML replication. I wonder if putting a white-on-transparent OCR overlay on top of the scanned layer lets you find errors just by looking at the black spots. -- Marcello Perathoner webmaster@gutenberg.org

"Marcello" == Marcello Perathoner <marcello@perathoner.de> writes:
Marcello> I'm working on a similar project but I've opted for a Marcello> desktop application. I think that every proofing Marcello> application should do 99% of the work automatically and Marcello> pass only the 1% of dubious cases to the human Marcello> proofer. How are you supposed to recover when the application is absolutely sure of being right, but it is wrong instead? I see this frequently enough especially in accents and punctuation (for example comma/period errors driven by a capitalized word in the middle of a sentence). Marcello> ......................... Marcello> I wonder if putting a white-on-transparent OCR overlay Marcello> on top of the scanned layer lets you find errors just by Marcello> looking at the black spots. It is frequent enough to have cases in which the original print contains for example a broken l that looks like an i or a fuzzy i that looks like an l. The OCR will be wrong, and the overlay will look OK. There are also a lot of cases in which a character overlays a different one (for example, . with overlay any of ,!?:; or any unaccented letter with overlay the same letter with an accent). This would require moreover that the fonts match exactly the font in the original, both in form and size. Most of the unevenness in the display at edwardbetts.com depends on an imperfect matching of fonts and image. I see the interest of the application as a replacement of pgdp proofreading interface with a solid synchronization of image and text; this was one of the features of fadedpage that worked only occasionally, and works very well instead at edwardbetts.com. Of course, it will work with any correction workflow, and this reusability is an excellent feature. Carlo

On 11/11/2011 04:20 PM, Carlo Traverso wrote:
"Marcello" == Marcello Perathoner<marcello@perathoner.de> writes:
Marcello> I'm working on a similar project but I've opted for a Marcello> desktop application. I think that every proofing Marcello> application should do 99% of the work automatically and Marcello> pass only the 1% of dubious cases to the human Marcello> proofer.
How are you supposed to recover when the application is absolutely sure of being right, but it is wrong instead?
You edit it back and push the [sic] button, to tell the app to leave it alone. -- Marcello Perathoner webmaster@gutenberg.org

?? But you don't know about it because it's in the 99% the application didn't pass you. On Fri, Nov 11, 2011 at 8:36 AM, Marcello Perathoner <marcello@perathoner.de
wrote:
On 11/11/2011 04:20 PM, Carlo Traverso wrote:
"Marcello" == Marcello Perathoner<marcello@**perathoner.de<marcello@perathoner.de>>
> writes: >
Marcello> I'm working on a similar project but I've opted for a Marcello> desktop application. I think that every proofing Marcello> application should do 99% of the work automatically and Marcello> pass only the 1% of dubious cases to the human Marcello> proofer.
How are you supposed to recover when the application is absolutely sure of being right, but it is wrong instead?
You edit it back and push the [sic] button, to tell the app to leave it alone.
-- Marcello Perathoner webmaster@gutenberg.org
______________________________**_________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/**mailman/listinfo/gutvol-d<http://lists.pglaf.org/mailman/listinfo/gutvol-d>

On 11/11/2011 06:28 PM, don kretz wrote:
?? But you don't know about it because it's in the 99% the application didn't pass you.
The idea is to assign a confidence to every word and then have a heat map overlay over the text, and a function to skip to the next word with confidence less than X. But you can always page through the text in the old-fashioned way. Also there will a scoring system which helps you decide which auto-correction plugins work best. (Just the same way DP decides which proofers are reliable. Only it is much easier to score an algorithm that performs the same every day.) -- Marcello Perathoner webmaster@gutenberg.org

I wonder if putting a white-on-transparent OCR overlay on top of the scanned layer lets you find errors just by looking at the black spots.
You might do better by putting a colored transparent text overlay over a colored text base, such that one can spot the color change where the (additive please not subtractive) colors "don't mix." For example a green base text layer with an additive red transparent text OCR overlay layer give you a yellow letter where the colors mix -- "hit", but green or red where there is a "miss". One can still read what's there in any case, but "hit vs. miss" would become pretty obvious, I would think.

On Fri, Nov 11, 2011 at 4:55 PM, James Adcock <jimad@msn.com> wrote:
I wonder if putting a white-on-transparent OCR overlay on top of the scanned layer lets you find errors just by looking at the black spots.
You might do better by putting a colored transparent text overlay over a colored text base, such that one can spot the color change where the (additive please not subtractive) colors "don't mix."
For example a green base text layer with an additive red transparent text OCR overlay layer give you a yellow letter where the colors mix -- "hit", but green or red where there is a "miss". One can still read what's there in any case, but "hit vs. miss" would become pretty obvious, I would think.
I'd say it depends more on how well you can match the exact font, kerning, etc, of the original book across multiple browsers, DOMs, and OSes. Even with precise letter positions from the OCR, not simple. Be interesting if someone can pull it off, though. Also, personally, I would not touch that particular color combination; I've got a moderate case of protanomaly[0], and that color combination is camouflage. -Bob [0] Somewhere around 5% of the population has some sort of color blindness.

On Thu, November 10, 2011 4:42 pm, Alex Buie wrote:
On Wed, Nov 9, 2011 at 12:28 AM, Lee Passey <lee@novomail.net> wrote:
In the IA output, I'm discovering that that data has been preserved. I think with some effort, it would be possible to use this data to build a web interface substantially identical to the proofreading interface provided by FineReader.
So Alex, all that talk a while back about how I wanted a "leaner, meaner" file? Forget about it. I think I like it just the way it is. I can select out what I need, and it has some potential.
Here's something someone at the archive is working on (after hours, since it's not an official project yet). He'd love to hear your thoughts.
http://edwardbetts.com/correct
(Note you can't actually submit edits yet, but he's working on getting there)
Already I find that proofing like that is a LOT easier than proofing on pgdp, but maybe that's me.
This is interesting, but not exactly what I had in mind. What Mr. Betts has developed here (I use the past tense, because I obviously don't know what he has in mind for the future, only what has already been accomplished) is another mechanism for comparing a line of text with a segment of an image that purports to represent that text. Even when the text becomes editable, you still only have a mechanism to help ensure that the ASCII encoding of an image is accurate. This is just a refinement on the Distributed Proofreader proofing model, variations of which others have done before. On the other hand, if you have used the Abbyy FineReader interface, you know it provides two side-by-side windows, one containing an image of a scanned page, and the other a representation of the text that was derived from the OCR. It is more that just a method of editing the text, however. For one thing, it shows paragraphs as paragraphs, usually indented. It also can highlight, using various user-definable colors, words that it could not find in it's dictionary, and "uncertain" word (FineReader's "best guess" but not meeting a certain threshold of reliability). When you select a word in the editing window, the image of the word is highlighted on the scanned image, and usually there is a "zoom" window at the bottom of the screen that shows the selected text on a single "zoomed" line. Having carefully looked through the contents of one of the "*_abbyy.xml" files made available on IA, I have come to the conclusion that the results of Ken H.'s script are /not/ derived from the data from those files. Not only does Ken's script preserve geometry information (which permits a "click on a word see it highlighted on the scan" function), it also shows which words FineReader could not find in it's dictionary, and which words FineReader was uncertain of. It also marks line-ending hyphenation, and whether FineReader considered it "soft" hyphenation, which should be removed if word wrapping no longer puts it on the end of a line, or "hard" hyphenation which should be displayed in all cases. I suspect that the file also contains other useful information that I simply haven't discovered yet. With this information, and a little JavaScript programming, I believe it would be possible to put together a web-based interface which could mimic the behavior of the Fine Reader interface. It is also interesting to note, that the script output is /not/ valid XHTML, but it is virtually identical to the SGML/HTML produced by FineReader. I'm forced to conclude that the source for Ken's fromabbyy.php script is /not/ any of the files that are part of the standard download location. If the files he uses /are/ publically available, so far we don't know where they are stored. For the last couple of weeks the script has been non-responsive. It would be nice if we could encourage Ken to troubleshoot that script so it was working again; even better would be if we could get access to those file which form the input source for his script in the first place.

On Mon, Nov 21, 2011 at 3:25 PM, Lee Passey <lee@novomail.net> wrote:
I'm forced to conclude that the source for Ken's fromabbyy.php script is /not/ any of the files that are part of the standard download location. If the files he uses /are/ publically available, so far we don't know where they are stored.
For the last couple of weeks the script has been non-responsive. It would be nice if we could encourage Ken to troubleshoot that script so it was working again; even better would be if we could get access to those file which form the input source for his script in the first place.
The script uses only the gzipped abbyy output to generate the html. (ie, what's available here http://ia600600.us.archive.org/16/items/artofbook00holm/artofbook00holm_abby...) The issue with it not working is a timeout on the public server side. I'll bang on the ops people tomorrow to see if we can increase the upstream timeout. Alex -- Alex Buie Network Coordinator / Server Engineer KWD Services, Inc Media and Hosting Solutions +1(703)445-3391 +1(480)253-9640 +1(703)919-8090 abuie@kwdservices.com ज़रा

On Mon, November 21, 2011 4:19 pm, Edward Betts wrote: [NB: all of Mr. Betts' reply has been included here, even those parts which I am not replying to, so everyone can get the benefit of his response.]
Hi Lee, I sent the message blow to the mailing list, it is waiting to be moderated.
On 21/11/11 12:25, Lee Passey wrote:
This is interesting, but not exactly what I had in mind. What Mr. Betts has developed here (I use the past tense, because I obviously don't know what he has in mind for the future, only what has already been accomplished) is another mechanism for comparing a line of text with a segment of an image that purports to represent that text.
Still working on it. The code is here: https://github.com/edwardbetts/corrections
You can now login using an Open Library or Internet Archive account. Saving works, but fixes aren't yet visible on the edit screen. I need to add more editing beyond changing single words.
My emphasis is building something that will save edits and maybe generate epubs using these fixes. I'm keen to retain word coordinates with edits, which things harder. With coordinates we can highlight corrected words in the original page images when using the Internet Archive book reader search and read aloud features.
Having carefully looked through the contents of one of the "*_abbyy.xml" files made available on IA, I have come to the conclusion that the results of Ken H.'s script are /not/ derived from the data from those files. Not only doesKen's script preserve geometry information (which permits a "click on a word see it highlighted on the scan" function), it also shows which words FineReader could not find in it's dictionary, and which words FineReader was uncertain of. It also marks line-ending hyphenation, and whether FineReader considered it "soft" hyphenation, which should be removed if word wrapping no longer puts it on the end of a line, or "hard" hyphenation which should be displayed in all cases. I suspect that the file also contains other useful information that I simply haven't discovered yet.
All this information is in the abbyy file, it is in the charParams tags, for example:
<charParams l="316" t="51" r="336" b="76" suspicious="true" wordStart="true" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="50" serifProbability="12" wordPenalty="36" meanStrokeWidth="107">J</charParams> <charParams l="331" t="46" r="343" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="53" serifProbability="100" wordPenalty="36" meanStrokeWidth="107">t</charParams> <charParams l="339" t="57" r="348" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="44" serifProbability="255" wordPenalty="36" meanStrokeWidth="107">i</charParams> <charParams l="343" t="53" r="384" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="48" serifProbability="255" wordPenalty="36" meanStrokeWidth="107">o</charParams> <charParams l="379" t="46" r="402" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="40" serifProbability="255" wordPenalty="36" meanStrokeWidth="107">K</charParams>
The interesting attributes are:
wordStart: tells if abbyy things a character is the first letter in a word wordFromDictionary: tells you if the word in the abbyy dictionary wordIdentifier: tells you if abbyy thinks the word is some kind of identifier charConfidence: gives a score for how confident abbyy is about the character
These are certainly /some/ interesting attributes, but are no means the /only/ interesting atttibutes. At least as interesting are "suspicious" and "proofed". And what the heck to the values for "l", "t", "r" and "b" mean? The complete schema definition for the Abbyy.xml file can be found at http://www.abbyy.com/FineReader_xml/FineReader6-schema-v1.xml, but it suffers from the serious flaw that most schema definitions do, which is that while it explains what the various components /are/ it contains little explanation of what those components /mean/. I'm pretty sure FineReader understands what all those elements mean, but is there some official way for /me/ to learn what they mean?
You can use wordStart to find the line-ending hyphenation.
Actually, you can't. If a <charParams> element contains a single hyphen, /and/ if it is the last <charParams> element of a <line> element, /then/ you can conclude that it is line ending hyphenation. But just knowing it is line-ending isn't enough. The FineReader interface (and the HTML file generated from the "fromabbyy.php" script) also indicates whether line-ending hyphenation is hard or soft. In the file I'm looking at right now, I can't see any way to distinguish between the hyphen in "hard-working" (which is hard) and the hyphen in "prop-erty" (which is soft). How do I use the Abbyy markup to make this distinction?
With this information, and a little JavaScript programming, I believe it would be possible to put together a web-based interface which could mimic the behavior of the Fine Reader interface.
I think the hard part is how to store the edits.
Storing the edits is the easiest part (depending, of course, on how the edits are created). All you need is an unambiguous way to link a character in the editor with a character in the file. The "l,t,r,b" attributes may provide just such a unique identifier. It's also possible that such a thing could be accomplished using XPATH. Setting the "proofed" attribute on a <charParams> element should be a signal that "a human being has accepted this, don't mess with it again." When I built my proof-of-concept web-based editing system last year, I simply split the document into page-sized HTML files, and passed an entire page file to Kupu, which is a JavaScript-based HTML editor (part of one of the Apache projects, I forget which). When the user was done editing and pressed the "Save" button, the HTML document was posted back to the server where a very simple Java servlet committed the file into a CVS repository. The working directory of the repository was set as an Apache document directory, so CVS would maintain a versioning history (to protect against cyber-vandalism) but the Web Server would always return the most current version. Saving incremental edits is a problem that has been solved over and over again. By leveraging other peoples' efforts, I'm sure that this would be the easiest part of the entire scheme.
I'm forced to conclude that the source for Ken's fromabbyy.php script is /not/ any of the files that are part of the standard download location. If the files he uses /are/ publically available, so far we don't know where they are stored.
The abbyy.xml is the source for Ken's fromabbyy.php
If you say so I've no choice but to believe you. But the output from the script has FineReader fingerprints all over it. It is so similar to the HTML output I see directly from FineReader that I have a hard time believing that FineReader itself did not generate the HTML. Why would the script take well-formed XML and convert it to non-well-formed SGML/HTML? If the input to the "fromabbyy.php" script is a *_abbyy.xml file, then the script must be passing that file to the FineReader engine to produce the HTML output. How is that done? I own a copy of FineReader, can I do it too?

On 2011-11-22 10:07, Lee Passey wrote:
These are certainly /some/ interesting attributes, but are no means the /only/ interesting atttibutes. At least as interesting are "suspicious" and "proofed". And what the heck to the values for "l", "t", "r" and "b" mean?
Coordinates of the character in the source image. l: left, t: top, r: right, b: bottom
You can use wordStart to find the line-ending hyphenation.
Actually, you can't. If a<charParams> element contains a single hyphen, /and/ if it is the last<charParams> element of a<line> element, /then/ you can conclude that it is line ending hyphenation. But just knowing it is line-ending isn't enough. The FineReader interface (and the HTML file generated from the "fromabbyy.php" script) also indicates whether line-ending hyphenation is hard or soft. In the file I'm looking at right now, I can't see any way to distinguish between the hyphen in "hard-working" (which is hard) and the hyphen in "prop-erty" (which is soft). How do I use the Abbyy markup to make this distinction?
The 'w' in hard-working is tagged with wordStart="true" to show a hard hyphen. The 'e' in property is tagged with wordStart="false" because it is a soft hyphen.
With this information, and a little JavaScript programming, I believe it would be possible to put together a web-based interface which could mimic the behavior of the Fine Reader interface.
I think the hard part is how to store the edits.
Storing the edits is the easiest part (depending, of course, on how the edits are created). All you need is an unambiguous way to link a character in the editor with a character in the file. The "l,t,r,b" attributes may provide just such a unique identifier. It's also possible that such a thing could be accomplished using XPATH. Setting the "proofed" attribute on a<charParams> element should be a signal that "a human being has accepted this, don't mess with it again."
When I built my proof-of-concept web-based editing system last year, I simply split the document into page-sized HTML files, and passed an entire page file to Kupu, which is a JavaScript-based HTML editor (part of one of the Apache projects, I forget which). When the user was done editing and pressed the "Save" button, the HTML document was posted back to the server where a very simple Java servlet committed the file into a CVS repository. The working directory of the repository was set as an Apache document directory, so CVS would maintain a versioning history (to protect against cyber-vandalism) but the Web Server would always return the most current version.
Saving incremental edits is a problem that has been solved over and over again. By leveraging other peoples' efforts, I'm sure that this would be the easiest part of the entire scheme.
Can you maintain the coordinates or words on the source image using this scheme?
The abbyy.xml is the source for Ken's fromabbyy.php
If you say so I've no choice but to believe you. But the output from the script has FineReader fingerprints all over it. It is so similar to the HTML output I see directly from FineReader that I have a hard time believing that FineReader itself did not generate the HTML. Why would the script take well-formed XML and convert it to non-well-formed SGML/HTML?
If the input to the "fromabbyy.php" script is a *_abbyy.xml file, then the script must be passing that file to the FineReader engine to produce the HTML output. How is that done? I own a copy of FineReader, can I do it too?
The script does not pass the file to the FineReader engine to produce the HTML output. -- Edward.

On Mon, November 21, 2011 1:56 pm, Alex Buie wrote:
The issue with it not working is a timeout on the public server side. I'll bang on the ops people tomorrow to see if we can increase the upstream timeout.
Still can't seem to get any response from the server. Any idea about what's going on? Any chance of seeing the source for fromabbyy.php?

It's a non-priority, so I don't know when, if ever, they'll get it working. fromabbyy.php is just a wrapper around a python script that does the actual work. I don't know if it's GPL licensed or not, will have to check with one of the other guys. Alex -- Alex Buie Network Coordinator / Server Engineer KWD Services, Inc Media and Hosting Solutions +1(703)445-3391 +1(480)253-9640 +1(703)919-8090 abuie@kwdservices.com ज़रा On Fri, Dec 2, 2011 at 11:39 AM, Lee Passey <lee@novomail.net> wrote:
On Mon, November 21, 2011 1:56 pm, Alex Buie wrote:
The issue with it not working is a timeout on the public server side. I'll bang on the ops people tomorrow to see if we can increase the upstream timeout.
Still can't seem to get any response from the server. Any idea about what's going on?
Any chance of seeing the source for fromabbyy.php?

On 21/11/11 12:25, Lee Passey wrote:
This is interesting, but not exactly what I had in mind. What Mr. Betts has developed here (I use the past tense, because I obviously don't know what he has in mind for the future, only what has already been accomplished) is another mechanism for comparing a line of text with a segment of an image that purports to represent that text.
Still working on it. The code is here: https://github.com/edwardbetts/corrections You can now login using an Open Library or Internet Archive account. Saving works, but fixes aren't yet visible on the edit screen. I need to add more editing beyond changing single words. My emphasis is building something that will save edits and maybe generate epubs using these fixes. I'm keen to retain word coordinates with edits, which makes things harder. With coordinates we can highlight corrected words in the original page images when using the Internet Archive book reader search and read aloud features.
Having carefully looked through the contents of one of the "*_abbyy.xml" files made available on IA, I have come to the conclusion that the results of Ken H.'s script are /not/ derived from the data from those files. Not only does Ken's script preserve geometry information (which permits a "click on a word see it highlighted on the scan" function), it also shows which words FineReader could not find in it's dictionary, and which words FineReader was uncertain of. It also marks line-ending hyphenation, and whether FineReader considered it "soft" hyphenation, which should be removed if word wrapping no longer puts it on the end of a line, or "hard" hyphenation which should be displayed in all cases. I suspect that the file also contains other useful information that I simply haven't discovered yet.
All this information is in the abbyy file, it is in the charParams tags, for example: <charParams l="316" t="51" r="336" b="76" suspicious="true" wordStart="true" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="50" serifProbability="12" wordPenalty="36" meanStrokeWidth="107">J</charParams> <charParams l="331" t="46" r="343" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="53" serifProbability="100" wordPenalty="36" meanStrokeWidth="107">t</charParams> <charParams l="339" t="57" r="348" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="44" serifProbability="255" wordPenalty="36" meanStrokeWidth="107">i</charParams> <charParams l="343" t="53" r="384" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="48" serifProbability="255" wordPenalty="36" meanStrokeWidth="107">o</charParams> <charParams l="379" t="46" r="402" b="76" suspicious="true" wordStart="false" wordFromDictionary="false" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="40" serifProbability="255" wordPenalty="36" meanStrokeWidth="107">K</charParams> The interesting attributes are: wordStart: tells if abbyy things a character is the first letter in a word wordFromDictionary: tells you if the word in the abbyy dictionary wordIdentifier: tells you if abbyy thinks the word is some kind of identifier charConfidence: gives a score for how confident abbyy is about the character You can use wordStart to find the line-ending hyphenation.
With this information, and a little JavaScript programming, I believe it would be possible to put together a web-based interface which could mimic the behavior of the Fine Reader interface.
I think the hard part is how to store the edits.
I'm forced to conclude that the source for Ken's fromabbyy.php script is /not/ any of the files that are part of the standard download location. If the files he uses /are/ publically available, so far we don't know where they are stored.
The abbyy.xml is the source for Ken's fromabbyy.php -- Edward.

Some of my impressions on Edward's tool. First, it is great. This said, here are some of the shortcomings that I remarked. Editing word by word is often insufficient; there is no way to rejoin words in which a space has been erroneously inserted (this is frequent e.g. when apstrophes are involved, but noy only) or spaces between words and punctuation (e.g. spaces before a comma, that depend on the line justification). Sometimes, especially in books with smaller font, the text display font is too large, and the text part is readable only with difficulty. See e.g. http://edwardbetts.com/correct/leaf/artofbook00holm/15 For example, see the lines ABCD....; in one line the J is a word, separated from the others; in another the whole alphabet is one word: this depends on the kerning of the different fonts. The use of sans-serif proportional fonts gravely degrades the visibility of some kind of recognition errors (I and l, uppercase i vs. lowercase L; ri vs. n etc.) especially when the font is too large and the letters fall one above the other. I would suggest to display and edit line by line, with a fixed-width font. Moreover, one should show the difference between a soft and a hard hyphen, (this is a difference in whinh often the OCR is hopeless, as well a corrector of one line or one page: is to-day or today once the lines are rejoined?) A problem might arise when the OCR has given up on a part of a page: one finds relatively often lines missing altogether, or, for example, an O (uppercase oh) word missing (this happens in Italian "O" is "Or". This might be easy to fix with line editing, but a missing line is harder. Since the image is sliced, and the slices do not cover the original page, it may even happen that a part is missed completely. This is freequent enough with the page headers or the page signature. See http://edwardbetts.com/correct/leaf/ilcavalieredello00vero/8 vs 9. Reading a line of text in a page, I tend to associate it with the image immadiately below, that of course doesn't match. When I correctly focus the pair of matching lines, I substantially read the first of the two, I find it hard to focus on the text (the second of the matching lines). I wonder how it would be having the line of text first, then the matching line of image. Carlo

On 2011-11-28 04:15, Carlo Traverso wrote:
Some of my impressions on Edward's tool. First, it is great. This said, here are some of the shortcomings that I remarked.
Editing word by word is often insufficient; there is no way to rejoin words in which a space has been erroneously inserted (this is frequent e.g. when apstrophes are involved, but noy only) or spaces between words and punctuation (e.g. spaces before a comma, that depend on the line justification).
Agreed. It needs a join word feature.
Sometimes, especially in books with smaller font, the text display font is too large, and the text part is readable only with difficulty. See e.g. http://edwardbetts.com/correct/leaf/artofbook00holm/15 For example, see the lines ABCD....; in one line the J is a word, separated from the others; in another the whole alphabet is one word: this depends on the kerning of the different fonts.
The placing of the characters is naive, I can think of ways to improve it.
The use of sans-serif proportional fonts gravely degrades the visibility of some kind of recognition errors (I and l, uppercase i vs. lowercase L; ri vs. n etc.) especially when the font is too large and the letters fall one above the other.
Good point, I can switch to serif.
I would suggest to display and edit line by line, with a fixed-width font. Moreover, one should show the difference between a soft and a hard hyphen, (this is a difference in whinh often the OCR is hopeless, as well a corrector of one line or one page: is to-day or today once the lines are rejoined?)
I'm not sure about your argument for a fixed-width font. You're right about hyphens.
A problem might arise when the OCR has given up on a part of a page: one finds relatively often lines missing altogether, or, for example, an O (uppercase oh) word missing (this happens in Italian "O" is "Or". This might be easy to fix with line editing, but a missing line is harder. Since the image is sliced, and the slices do not cover the original page, it may even happen that a part is missed completely. This is freequent enough with the page headers or the page signature. See http://edwardbetts.com/correct/leaf/ilcavalieredello00vero/8 vs 9.
Agreed. This is a problem with my model.
Reading a line of text in a page, I tend to associate it with the image immadiately below, that of course doesn't match. When I correctly focus the pair of matching lines, I substantially read the first of the two, I find it hard to focus on the text (the second of the matching lines). I wonder how it would be having the line of text first, then the matching line of image.
I could switch the order of text and images. Thanks for the feedback. -- Edward.

"Edward" == Edward Betts <edward@archive.org> writes:
>> The use of sans-serif proportional fonts gravely degrades the >> visibility of some kind of recognition errors (I and l, >> uppercase i vs. lowercase L; ri vs. n etc.) especially when the >> font is too large and the letters fall one above the other. Edward> Good point, I can switch to serif. >> I would suggest to display and edit line by line, with a >> fixed-width font. Moreover, one should show the difference >> between a soft and a hard hyphen, (this is a difference in >> whinh often the OCR is hopeless, as well a corrector of one >> line or one page: is to-day or today once the lines are >> rejoined?) Edward> I'm not sure about your argument for a fixed-width Edward> font. You're right about hyphens. The point on proportional vs. fixed point fonts is the following: some typical misrecognitions happen since some letter combinations are similar to other ones in a typical proportional font (e.g. n might recognized as ri, m as rn or even rri). Using a font that reproduces the typographical aspect of the original, it is easy to read "arid" as "and" if the context asks for "and", but it is much more difficult with a fixed width font. It might be useful to have two displays, one with a font matching the original, easier to read and to find omissions, another with a proportional font making the misrecognitions more visible. Of course, switchable with javascript. Carlo

On 2011-12-30 22:00, Carlo Traverso wrote:
"Edward" == Edward Betts<edward@archive.org> writes:
>> The use of sans-serif proportional fonts gravely degrades the >> visibility of some kind of recognition errors (I and l, >> uppercase i vs. lowercase L; ri vs. n etc.) especially when the >> font is too large and the letters fall one above the other.
Edward> Good point, I can switch to serif.
>> I would suggest to display and edit line by line, with a >> fixed-width font. Moreover, one should show the difference >> between a soft and a hard hyphen, (this is a difference in >> whinh often the OCR is hopeless, as well a corrector of one >> line or one page: is to-day or today once the lines are >> rejoined?)
Edward> I'm not sure about your argument for a fixed-width Edward> font. You're right about hyphens.
The point on proportional vs. fixed point fonts is the following: some typical misrecognitions happen since some letter combinations are similar to other ones in a typical proportional font (e.g. n might recognized as ri, m as rn or even rri). Using a font that reproduces the typographical aspect of the original, it is easy to read "arid" as "and" if the context asks for "and", but it is much more difficult with a fixed width font.
It might be useful to have two displays, one with a font matching the original, easier to read and to find omissions, another with a proportional font making the misrecognitions more visible. Of course, switchable with javascript.
Good point. I should add javascript to switch to a monospace font. -- Edward.
participants (10)
-
Alex Buie
-
Bowerbird@aol.com
-
don kretz
-
Edward Betts
-
James Adcock
-
Jim Adcock
-
Lee Passey
-
Marcello Perathoner
-
Robert Cicconetti
-
traverso@posso.dm.unipi.it