[gutvol-d] Re: Co-operative proofreading

17 Mar 2010

      First of all, let me say that I am gratified by, and appreciative of 
those who have visited the site and offered feedback. Be aware that what 
you are seeing is more than a mock-up and less than a prototype; it is, 
in fact, my workbench. As my development process proceeds, I deploy 
software and files to that site for testing and evaluation. There is no 
guarantee that the behavior or appearance today will be the behavior or 
appearance tomorrow. What I intended was to provide a window into my 
development process. When I invited people to watch me flail about, that 
was exactly what I meant.

On 3/16/2010 2:57 PM, Juliet Sutherland wrote:
...
I tried it with Chrome and couldn't get the text box to let me edit.
It worked fine in Firefox, except that I didn't find a way to save the
work, aside from asking for another page and then saying yes when it
asked if I wanted to save.
The save button is the little "floppy disk" icon in the formatting 
toolbar, next to the "block type" drop down box.
...
Also, it insisted on indenting the line at
the top of the page, even when it wasn't the beginning of the paragraph.
This is an artifact of the OCR process. To the best of my knowledge, no 
OCR program is capable of starting a page and recognizing that the text 
is, in fact, a continuation of the text on a previous page.

As bowerbird has suggested, I have named my working files sequentially 
and in synchronization with the image files. My intent is to enhance my 
post-processing program a bit so that it will look at the first 
paragraph of one page together with the last paragraph on the preceding 
page. If the first does /not/ begin with a majuscule and the following 
does /not/ end with line terminating punctuation, I would mark the 
paragraph 'class="continuation".' The editor's CSS would not indent 
paragraphs of that class, and the merge program (which would create a 
single file of all the component files) would merge paragraphs when the 
class was encountered.

Of course, this algorithm could create false positives where the OCR 
drops punctuation, or doesn't recognize capitalization, and create false 
negatives where sentences, but not paragraphs, begin on a new page. 
There will need to be a yet-to-be-determined method for the user 
interface to allow a proofreader to make this distinction.
...
I, too, find a horizontal interface works better for me.
JulietS
So before continuing, let me explain a little of my strategy and tactics.

I am a firm believer in markup. Like Mr. Frank, I believe that the 
markup should be carried though with the text at every stage of the process.

I am a firm believer in internet standards, even unofficial, de-facto 
internet standards. No re-inventing any wheels for me.

Lastly, I am an extraordinarily lazy programmer. I'm not going to write 
any new code unless I absolutely have to.

I will not, however, use any code infected by the Gnu Public License. 
Standalone programs are fine, but I won't touch GPL code with Mr. 
Haines' ten-foot pole.

So...

There is nothing any nearer to a standard for e-books than HTML. I 
decided that the original OCR should produce HTML output and that the 
markup should stick with the text until the final single file is 
created. Because of this decision, the final single file could be 
created over and over as small tweaks to the component files were made; 
there would be no need for any concept of finality or "doneness."

I discovered that both the Plone and the Apache Lenya content management 
systems used a javascript-based visual HTML editor called Kupu. Kupu is 
now part of the Apache Lenya project and the source is available from 
apache.org under the Apache license.

The editor you see at my website is Kupu, unmodified except for 
modifications to the CSS file that governs how it is displayed. I am 
assuming that at some point I will have to make slight modifications to 
the Kupu code, but that will be among the last things I do. I need to 
get the underlying workflow nailed down first. If there is anyone who 
wants to help out by tackling the Kupu interface (cough, Carel, cough) I 
would welcome the help.

Your comments about the behavior of my site with Chrome makes me wonder 
how well Chrome is supported by the Apache Lenya project; maybe I should 
ping them to try it out.

I needed a repository to track all the individual files for each 
project, and the changes thereto. Well, there's tons of software and 
applications that supports CVS, so CVS it is. The current plan is to 
have /three/ mostly-identical CVS repositories for each project. As 
registered users select a project each will be assigned to that 
repository which has been least-used. While the editor contents can be 
saved as many times as a user wants, when the user leaves a page (or 
after an appropriate timeout) the file will be committed to its repository.

Upon commitment a file will be "diffed" against the other two 
repositories. When conflicts are found, a voting algorithm will resolve 
the conflict, if possible, and the changes will be committed to /all 
three/ repositories. The algorithm will not be a pure "two out of 
three," but will be weighted based on the number of users who have view 
a page. Hopefully, this kind of algorithm can minimize the problem of 
e-graffiti. If a "vote" is two close to call, both options will be 
placed in all the committed files in a manner similar to that proposed 
by Mr. Adcock. My biggest problem here is finding a "diff" application 
that can work as I need it to.

Hmm, if I'm going to make users register and login, and if I'm going to 
track things like which repositories they have been assigned to, I'm 
going to need some kind of data store. My site uses the Apache web 
server, and has MySQL installed. Apache's authdb module can use MySQL as 
the authentication database. I guess all the data I need to track will 
be stored in MySQL (and I haven't even /started/ to think about how to 
define the tables I need).

Now all I need is some glue to hold all the pieces together. I'm an 
accomplished Java programmer, familiar with JDBC and servlets. My site's 
server has Apache Tomcat installed and available. I guess that 
decision's a no-brainer.

So, there's my strategy and some of the tactics to the extent I have 
worked them out. Now a few specific responses.
...
On 3/16/2010 3:39 PM, Jon Ingram wrote:
...
Interesting, and it's good to see someone using a rich text editor for
the text, rather than expecting proofers to mess around with <i>, etc.
As pointed out above, the editing window technically is not a rich text 
editor (which produces output in RTF format). It is the Kupu HTML 
editor, which I am still not very familiar with. But I agree that 
proofreaders need a tool where they can make the proofed text look like 
the scanned image. One of the things I like about Kupu is the little 
"scroll" button, which brings up a plain text editor where you /can/ 
edit the HTML source directly if you desire. I also need to add a method 
to add internal anchors, and a method to build tables of contents.
...
...
I'm not sure how the page was supposed to look, however --
- I'm using a widescreen 1680x1050 monitor, and there was still
material off the bottom of the page. This is using Google Chrome.
It appears that either Chrome hasn't figured out how to use the CSS 
"percentage" value either, or perhaps it's understanding simply differs 
from that of Mozilla (from your description, I would guess the latter). 
I could go on at length about how /I/ think it should be implemented, 
but I won't.
...
...
- I couldn't see any way to resize the image, so as to see the page
width rather than the zoomed in image, which gives me about 4 words
before I have to scroll to the right
Resizing images is a problem. Right now, images are the size that 
FineReader exported them. Firefox autosizes the images into the 
constraining box, and provides a "zoom" function. Apparently Chrome does 
not have any sort of similar function (IE definitely does not), and 
Opera works even worse than IE. Maybe I can come up with an automated 
tool to resize the images into a set of standard resolutions (e.g. 25%, 
50%, 75%, 100%). Then each user could individually set the preferences 
for the image size that works best for her or him. If the editor and 
image boxes are going to be fixed sizes perhaps I could add those 
parameters to a set of preferences as well.
...
...
- I couldn't see any way to change the font in the text window,
preferably to dpcustommono, which is ugly, but is the best font I've
yet used for proofing.
Well, you wouldn't want to set the font face or size for the file being 
saved, as that is a highly subjective matter. Unfortunately I have yet 
to see a browser that allows a user to override a page's stylesheet 
decision (although Opera is getting close). I've no experience yet with 
Chrome, does it do so? What I envision is allowing a user to select 
among a set of standard CSS style sheets as a sticky preference, or to 
actually upload his or her own for personal use.
...
...
- It would be nice to have some instructions. What exactly are you
expecting me to do to the page? Do you want headers/footers/page
numbers to be kept? Do you want end of line hyphens kept? Do you want
paragraphs joined?
Guilty as charged. I'm thinking of adding a "Proofing Guidelines" button 
to each page, which would popup a separate window with those 
instructions. Of course, at this stage of development I have virtually 
no idea as to what those instructions would be, but it might be a good 
idea to add it now anyway, even if the instructions are as simple as "I 
know I have to add this in the future."
...
...
- It would be nice to have (the option of) a horizontal rather than
vertical layout. I used to really like the vertical layout, but found
I was more accurate at proofing with a horizontal one.
This could be handled by (yet another) user preferred stylesheet.
...
...
- I really prefer block paragraphs rather than indented ones for
computer-based text.
A user preferred stylesheet could handle this issue as well, although if 
you were to do it you would need to figure out how to insert a visual 
signal when a paragraph is a "continuation" paragraph as opposed to a 
"real" paragraph.
...
...
A very good implementation so far -- I'll await developments.
Thank you.

Just as a reminder, however, I suspect the user interface portion will 
not receive much attention until the latter stages of development; for 
now, I only need it to work well enough for me to test other parts of 
the workflow.

[gutvol-d] Re: Co-operative proofreading

Lee Passey