
First of all, let me say that I am gratified by, and appreciative of those who have visited the site and offered feedback. Be aware that what you are seeing is more than a mock-up and less than a prototype; it is, in fact, my workbench. As my development process proceeds, I deploy software and files to that site for testing and evaluation. There is no guarantee that the behavior or appearance today will be the behavior or appearance tomorrow. What I intended was to provide a window into my development process. When I invited people to watch me flail about, that was exactly what I meant. On 3/16/2010 2:57 PM, Juliet Sutherland wrote:
I tried it with Chrome and couldn't get the text box to let me edit. It worked fine in Firefox, except that I didn't find a way to save the work, aside from asking for another page and then saying yes when it asked if I wanted to save.
The save button is the little "floppy disk" icon in the formatting toolbar, next to the "block type" drop down box.
Also, it insisted on indenting the line at the top of the page, even when it wasn't the beginning of the paragraph.
This is an artifact of the OCR process. To the best of my knowledge, no OCR program is capable of starting a page and recognizing that the text is, in fact, a continuation of the text on a previous page. As bowerbird has suggested, I have named my working files sequentially and in synchronization with the image files. My intent is to enhance my post-processing program a bit so that it will look at the first paragraph of one page together with the last paragraph on the preceding page. If the first does /not/ begin with a majuscule and the following does /not/ end with line terminating punctuation, I would mark the paragraph 'class="continuation".' The editor's CSS would not indent paragraphs of that class, and the merge program (which would create a single file of all the component files) would merge paragraphs when the class was encountered. Of course, this algorithm could create false positives where the OCR drops punctuation, or doesn't recognize capitalization, and create false negatives where sentences, but not paragraphs, begin on a new page. There will need to be a yet-to-be-determined method for the user interface to allow a proofreader to make this distinction.
I, too, find a horizontal interface works better for me.
JulietS
So before continuing, let me explain a little of my strategy and tactics. I am a firm believer in markup. Like Mr. Frank, I believe that the markup should be carried though with the text at every stage of the process. I am a firm believer in internet standards, even unofficial, de-facto internet standards. No re-inventing any wheels for me. Lastly, I am an extraordinarily lazy programmer. I'm not going to write any new code unless I absolutely have to. I will not, however, use any code infected by the Gnu Public License. Standalone programs are fine, but I won't touch GPL code with Mr. Haines' ten-foot pole. So... There is nothing any nearer to a standard for e-books than HTML. I decided that the original OCR should produce HTML output and that the markup should stick with the text until the final single file is created. Because of this decision, the final single file could be created over and over as small tweaks to the component files were made; there would be no need for any concept of finality or "doneness." I discovered that both the Plone and the Apache Lenya content management systems used a javascript-based visual HTML editor called Kupu. Kupu is now part of the Apache Lenya project and the source is available from apache.org under the Apache license. The editor you see at my website is Kupu, unmodified except for modifications to the CSS file that governs how it is displayed. I am assuming that at some point I will have to make slight modifications to the Kupu code, but that will be among the last things I do. I need to get the underlying workflow nailed down first. If there is anyone who wants to help out by tackling the Kupu interface (cough, Carel, cough) I would welcome the help. Your comments about the behavior of my site with Chrome makes me wonder how well Chrome is supported by the Apache Lenya project; maybe I should ping them to try it out. I needed a repository to track all the individual files for each project, and the changes thereto. Well, there's tons of software and applications that supports CVS, so CVS it is. The current plan is to have /three/ mostly-identical CVS repositories for each project. As registered users select a project each will be assigned to that repository which has been least-used. While the editor contents can be saved as many times as a user wants, when the user leaves a page (or after an appropriate timeout) the file will be committed to its repository. Upon commitment a file will be "diffed" against the other two repositories. When conflicts are found, a voting algorithm will resolve the conflict, if possible, and the changes will be committed to /all three/ repositories. The algorithm will not be a pure "two out of three," but will be weighted based on the number of users who have view a page. Hopefully, this kind of algorithm can minimize the problem of e-graffiti. If a "vote" is two close to call, both options will be placed in all the committed files in a manner similar to that proposed by Mr. Adcock. My biggest problem here is finding a "diff" application that can work as I need it to. Hmm, if I'm going to make users register and login, and if I'm going to track things like which repositories they have been assigned to, I'm going to need some kind of data store. My site uses the Apache web server, and has MySQL installed. Apache's authdb module can use MySQL as the authentication database. I guess all the data I need to track will be stored in MySQL (and I haven't even /started/ to think about how to define the tables I need). Now all I need is some glue to hold all the pieces together. I'm an accomplished Java programmer, familiar with JDBC and servlets. My site's server has Apache Tomcat installed and available. I guess that decision's a no-brainer. So, there's my strategy and some of the tactics to the extent I have worked them out. Now a few specific responses.
On 3/16/2010 3:39 PM, Jon Ingram wrote:
Interesting, and it's good to see someone using a rich text editor for the text, rather than expecting proofers to mess around with <i>, etc.
As pointed out above, the editing window technically is not a rich text editor (which produces output in RTF format). It is the Kupu HTML editor, which I am still not very familiar with. But I agree that proofreaders need a tool where they can make the proofed text look like the scanned image. One of the things I like about Kupu is the little "scroll" button, which brings up a plain text editor where you /can/ edit the HTML source directly if you desire. I also need to add a method to add internal anchors, and a method to build tables of contents.
I'm not sure how the page was supposed to look, however --
- I'm using a widescreen 1680x1050 monitor, and there was still material off the bottom of the page. This is using Google Chrome.
It appears that either Chrome hasn't figured out how to use the CSS "percentage" value either, or perhaps it's understanding simply differs from that of Mozilla (from your description, I would guess the latter). I could go on at length about how /I/ think it should be implemented, but I won't.
- I couldn't see any way to resize the image, so as to see the page width rather than the zoomed in image, which gives me about 4 words before I have to scroll to the right
Resizing images is a problem. Right now, images are the size that FineReader exported them. Firefox autosizes the images into the constraining box, and provides a "zoom" function. Apparently Chrome does not have any sort of similar function (IE definitely does not), and Opera works even worse than IE. Maybe I can come up with an automated tool to resize the images into a set of standard resolutions (e.g. 25%, 50%, 75%, 100%). Then each user could individually set the preferences for the image size that works best for her or him. If the editor and image boxes are going to be fixed sizes perhaps I could add those parameters to a set of preferences as well.
- I couldn't see any way to change the font in the text window, preferably to dpcustommono, which is ugly, but is the best font I've yet used for proofing.
Well, you wouldn't want to set the font face or size for the file being saved, as that is a highly subjective matter. Unfortunately I have yet to see a browser that allows a user to override a page's stylesheet decision (although Opera is getting close). I've no experience yet with Chrome, does it do so? What I envision is allowing a user to select among a set of standard CSS style sheets as a sticky preference, or to actually upload his or her own for personal use.
- It would be nice to have some instructions. What exactly are you expecting me to do to the page? Do you want headers/footers/page numbers to be kept? Do you want end of line hyphens kept? Do you want paragraphs joined?
Guilty as charged. I'm thinking of adding a "Proofing Guidelines" button to each page, which would popup a separate window with those instructions. Of course, at this stage of development I have virtually no idea as to what those instructions would be, but it might be a good idea to add it now anyway, even if the instructions are as simple as "I know I have to add this in the future."
- It would be nice to have (the option of) a horizontal rather than vertical layout. I used to really like the vertical layout, but found I was more accurate at proofing with a horizontal one.
This could be handled by (yet another) user preferred stylesheet.
- I really prefer block paragraphs rather than indented ones for computer-based text.
A user preferred stylesheet could handle this issue as well, although if you were to do it you would need to figure out how to insert a visual signal when a paragraph is a "continuation" paragraph as opposed to a "real" paragraph.
A very good implementation so far -- I'll await developments.
Thank you. Just as a reminder, however, I suspect the user interface portion will not receive much attention until the latter stages of development; for now, I only need it to work well enough for me to test other parts of the workflow.