February 2010 - gutvol-d - lists.pglaf.org

Re: let us not be confused
by cmiske＠ashzfall.com 25 Feb '10

25 Feb '10

>Jim said: >Simply my personal experience finishing a book taking this exact approach. >When you take out the easy stuff, you are left with the stuff that typically >the P2s and P3s find -- hopefully! When you do the machine marking of errors >you double the amount of errors that need to be fixed -- since each input >file contributes its errors. But now many of the "P1" type errors are >marked, which could be a win if the P1 people were presented with a >highlighted version of those errors, similar to what is currently presented >in WordCheck. To me a surprising amount of work goes into fixing >hyphenation/dehyphenation linebreak errors, where it should be possible to >make a more intelligent tool to fix most of these problems -- or does >someone claim to already have such an intelligent tool? I made several such tools many years ago. I'm not sure I can locate them in my backups. In any case, I am in the process of creating some tools to assist in the process of proofing and formating. These would be for the use of 'solo' producers as well as DP users. I do not believe in full automation for anything complex (as it can induce as many errors as it fixes), so many of these tools would primarily prompt the user to make choices about things that 'confuse' the software. Carel _______________________________________________ gutvol-d mailing list gutvol-d(a)lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

1 0

Re: the d.p. opinion on "prerelease" of e-texts
by Bowerbird＠aol.com 25 Feb '10

25 Feb '10

al said: > there would need to be some coordination to remove them from > that environment when they're posted into PG as finished products. michael said: > Greg Newby takes care of deleting from PrePrints, should be > easy enough just to send him notes of what has been completed my gawd, what a remarkable exchange. we're talking about _files_, on the _internet_, with _known_ u.r.l., going through a standard progression of states. and y'all think you need to handle this _manually_, by "sending notes"? you haven't learned one thing about why all of your processes are _so_ inefficient, have you? not a single thing! -bowerbird

2 1

Re: let us not be confused
by Bowerbird＠aol.com 25 Feb '10

25 Feb '10

jim said: > Not that I totally disagree, but > when you take out the easy stuff, > the stuff that's left is harder to find. where's your data that validates that? i think when you clear the page of "the easy stuff", it becomes _easier_ to find "the stuff that's left", because you're not distracted by the easy stuff. it might _feel_ "less effective", since you're making fewer corrections, but the ones you do make are more vital. what usually happens is that it takes one round of proofing to remove the easy stuff, and another for the rest... what i say is to make the first round a tool-aided round, to preserve your human resources, so the humans only have to do one word-by-word round, which is the difficult process. besides, it's not as if zero satisfaction comes in the tool-aided round. myself, i feel _greater_ satisfaction there, since my efficiency is boosted _considerably._ and in a roundless system, anything that moves a page closer to "finished status" is a good thing, because that's the goal. just _offer_ people a good tool to use; you will find they enjoy it immensely... by the way, dkretz has a new version of his tool available now, at the usual place: > http://code.google.com/p/dp50/downloads/list -bowerbird

2 1

let us not be confused
by Bowerbird＠aol.com 25 Feb '10

25 Feb '10

ok, we've got a couple of different topics running around, so let us take a minute to make sure we are not confused... first of all, let's talk about my campaign for preprocessing... i have demonstrated, over and over and over again, that d.p. (and rfrank) should be doing _much_ better preprocessing... i've shown how they can use _very_simple_means_ to do that, and how -- if they did -- they could reduce the error-counts in their books to a ridiculously small amount, even _before_ their text went in front of proofers. i have talked about how it is a huge _waste_ of the generous donations of volunteers (in both time and energy) not to do aggressive preprocessing, which automatically locates errors to make them easy to fix... again, the crux of my argument -- and i have proven it to be absolutely true, again and again -- is that it's _easy_ to do this. indeed, when i have shown the steps taken to locate the errors, it becomes painfully obvious how ridiculously simple they are... they include obvious checks, like a number embedded in a word, or a lowercase letter followed by a capital letter, or two commas in a row, or a period at the beginning of a line. _obvious_ stuff! this isn't rocket science. it's not even _hard_... it's dirt-simple! and yet neither d.p. nor rfrank has instituted such preprocessing. *** let's contrast this with gardner's request, which was to compile a list of reg-ex tests that will locate all possible errors in any random book. this request -- as worthy as it might seem -- is _much_ more difficult to realize. in fact, it's almost impossible. a friend of mine over in england, nick hodson, is a very prolific digitizer. all by himself, he has done some 500 books or more. nick collected an extensive set of checks over the years. i can't remember exactly how many there were, but roughly about 200. however, once nick upgraded his o.c.r. program, he found that about half of his checks were no longer required. they had been necessary essentially as an artifact of an outdated o.c.r. program. the type of books nick was digitizing hadn't changed, and neither had the quality of the scans, or the resolution of the scans, or the digital retouching that he performed on the scans -- none of that. he was the same person, using the same computer and scanner, and he was doing the same things exactly as he had done before. the only thing that changed was the version of his o.c.r. program. yet he found many checks he formerly needed became unnecessary. so, for an operation like d.p., who intakes all kinds of scans and uses a wide variety of o.c.r. programs, operated by users with a huge range of expertise, their results will be all over the board. they're _never_ gonna get a definitive list of checks to be made. it would be _immensely_ difficult, to the point of being impossible. but that's totally beside our other point, about preprocessing... because the fact of the matter is that a few dozen _simple_ tests are all that d.p. needs in order to reduce the number of errors to a level where they can be handled easily by their human proofers. they're never gonna get 100%. but they could find 90% so easily that it's criminal negligence that they aren't doing that already... heck, spell-check by itself will locate 50% of the errors for you... -bowerbird

6 6

Re: Inkmesh
by Carel 25 Feb '10

25 Feb '10

I meant that it would be nice if the site had a version for smartphones. I was failed to realize for a moment that mini has many meanings. Carel Sent from my HTC on the Now Network from Sprint! ----- Reply message ----- From: "James Adcock" <jimad(a)msn.com> Date: Thu, Feb 25, 2010 3:38 PM Subject: [gutvol-d] Re: Inkmesh To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d(a)lists.pglaf.org> >It would be nice if it had a mobi version. Not sure I understand this comment. If the result says “Kindle” and it’s a free version, then it is a ..mobi file, or else is a .prc file -- which in practice is the same as a ..mobi file if you change the file extension.

1 0

Re: Inkmesh
by Carel 25 Feb '10

25 Feb '10

I meant that it would be nice if the site had a version for smartphones. I was failed to realize for a moment that mini has many meanings. Carel Sent from my HTC on the Now Network from Sprint! ----- Reply message ----- From: "James Adcock" <jimad(a)msn.com> Date: Thu, Feb 25, 2010 3:38 PM Subject: [gutvol-d] Re: Inkmesh To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d(a)lists.pglaf.org> >It would be nice if it had a mobi version. Not sure I understand this comment. If the result says “Kindle” and it’s a free version, then it is a ..mobi file, or else is a .prc file -- which in practice is the same as a ..mobi file if you change the file extension.

1 0

Re: Inkmesh
by Carel 25 Feb '10

25 Feb '10

Very nice resource. Thank you! It would be nice if it had a mobi version. Carel Sent from my HTC on the Now Network from Sprint! ----- Reply message ----- From: "James Adcock" <jimad(a)msn.com> Date: Thu, Feb 25, 2010 2:51 PM Subject: [gutvol-d] Inkmesh To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d(a)lists.pglaf.org> Just found about this site which is a cross-site search engine for free ebooks in various formats: http://inkmesh.com _______________________________________________ gutvol-d mailing list gutvol-d(a)lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

1 0

Re: Inkmesh
by Carel 25 Feb '10

25 Feb '10

Very nice resource. Thank you! It would be nice if it had a mobi version. Carel Sent from my HTC on the Now Network from Sprint! ----- Reply message ----- From: "James Adcock" <jimad(a)msn.com> Date: Thu, Feb 25, 2010 2:51 PM Subject: [gutvol-d] Inkmesh To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d(a)lists.pglaf.org> Just found about this site which is a cross-site search engine for free ebooks in various formats: http://inkmesh.com _______________________________________________ gutvol-d mailing list gutvol-d(a)lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

1 0

Re: so what is so important about pagination?
by Bowerbird＠aol.com 25 Feb '10

25 Feb '10

michael said: > Won't ANY kind of header, footer, page indicator > "disturb when doing a search" only if the viewer-app is too stupid to know it should skip such things when searching... (unless you tell it specifically to include 'em). -bowerbird

5 7

Re: d.p.'s undeserved superior attitude about "quality"
by cmiske＠ashzfall.com 25 Feb '10

25 Feb '10

Bowerbird: >carel said: >> You don't need to talk to me about having your time wasted. >> I am the somewhat infamous Carel that left DP nearly a decade ago >> (after spending around 500 hours creating a front-end and >> revamping the horrific code that resided on the back-end). > >oh, hey, glad to meet you. :+) > >i do believe i am the one who has grabbed the "infamous" crown >at this point in time, however. i see they still let you post there. >myself, i've been banished, so as not to corrupt the youth there. LOL. It's nice to meet you as well. I've read your posts for many years, but I don't believe we have ever directly 'spoken.' >> DP has managed to produce a huge number of etexts >and they could have produced 3 times as many if their workflow >were 3 times more efficient, something it could _easily_ attain... >according to brewster, internet archive scans 1,000 books a day. >(google probably scans twice that many before their lunch break.) >d.p. digitizes 2400 books a _year_. i predict they cannot keep up. I'm not sure the most efficient system on earth could keep up with the scan rate. No matter how much you automate pre- and post-processing and formatting, it comes down to the speed of humans when it comes to proofing. And, if you speed things up by skipping the manual proofing and use automation completely for the first release then you will have text little to no better than what google already has. I realize that is not what you are proposing, but what google has done is the most efficient with little regard to the potential for the quality added by human interaction. Wikimedia then takes this further by adding the ability for humans to modify the text in a never-ending cycle. >> Their etexts also have a decent level of quality. > >i'm with michael hart on this. (even if he refuses to see that.) > >i think the first mission is to get text up with acceptable quality. >(which, by the way, most of the in-process stuff at d.p. is _not_) >and _then_ we can use the _public_ to drive it toward perfection. I'm am very much with you on this one. However, it rather duplicates the efforts of the wikimedia foundation (at least the last time I looked into their efforts), so I am not sure of the 'need' for it. In what manner do you suggest that this would be differentiated from the current wiki book editing process at wikimedia that justifies the creation of the project for the purpose of PG? I ask because I am considering giving it a try if I can find a single reason to do so that would not make me feel as though I am reinventing the wheel. I have not been able to find an acceptable reason on my own. I love PG and promote it to all that will listen, but love alone is not enough: There must be logic as well. :) For that matter, it wouldn't be that hard to cull the data from google books, cut their 'plain text' into pages (and/or re-ocr and dif) and just slap the result into a wiki (which is about what it looks like wikimedia has done). I believe the google use license allows for this for non-commercial entities such as PG? Then, someone can verify the google 'plain text' images as inline images or as content that could not be processed by the OCR (type in missing text and annotate images). Tidy it up (scannos, etc.), add some assumed markup for lines, paragraphs, pages, etc. using scripts with no human interaction. Have a human assisted 'spell check' fix the majority of remaining errors and end of line hyphenation issues, etc. and add some basic markup for italics and whatnot (although the google OCR text appears to already contain this). Run the text through an 'X number of people say this page matches the scan' proofing process (roudless process). Have a human assisted markup process to wrap the content in properly formed (but easy) XML. Convert the document to plain text (and any other formats that are desired). On a 300 page work of uncomplicated fiction, depending on the scan quality and speed of the humans doing the interactive processes, I would estimate this would take about 2-6 hours _plus_ the proofing time to create a document ready for first release and wiki storage. Then, you let it sit in a wiki for eternity being edited into perfection (or until a very large X number of people again say a page is perfect, create a final edition when all pages in a book pass this test, post the output, archive the XML version and scans and consider the book done for all time - or at least until a PG reader reports an error that must be checked). For user supplied scans, same process with OCR done on the server (if desired, can diff against the scan submitter's OCR, if supplied). I haven't thought about the process in a while and the new press for speed cuts out some of the way I recall my original ideas. I'm probably completely forgetting a topic or two, but this would pretty much work (after a few logic tweeks) to accelerate the release process and should retain some quality factors in the first release. As you know, ultimately, in regards to quality, you will always have to rely on the quality judgements of individuals working on the page or project to determine the quality of the final output. Quality is a human concept and so there is no way to automate it. I wish there were. :) I am very interested in your text processing scripts though and always have been. I just haven't been processing any OCR text for a while. >> If the volunteers at DP are blissfully unaware that their time >> is being wasted then they do not feel like victims of the process > >but they _are_ victims, whether they _feel_like_ victims or not... They are not victims in the sense that an abused child is an abused child whether the child realizes it or not. We are talking about proofreading and wasting manual hours doing tasks that could have been done by (or assisted by) automation. I may lose sleep over the former, but I'll never lose sleep over the latter. Not that I do not regret the wasting of volunteer time, I just don't have it at the top of my 'causes' list. >and, like i said, despite a constant influx of volunteers sent there >by p.g., the steady d.p. base seems to be a fairly constant number, >which means they are experiencing severe "churn" and "burnout", >neither of which bodes well for the future. the well _will_ go dry. >(especially since more and more people will come to see google as >"the source" of the books that they try to find online, since google >offers people several million more titles than project gutenberg.) Loss of volunteers is not good, obviously. Just as with any venture, it is twice as hard to win someone back as it was to gain them in the first place. The last I checked, the wiki book project also appears to have very few active members, so it is also possible that book proofing has a cap on the number of people willing to do it with any regularity that has nothing to do with the format of presentation or the process. As I said before, I don't believe it is possible to catch up to google. If DP produces 2400 books per year, then even if the process I outlined above (or your own or anyone else's) increased the production by 5x or 10x or even 25x (for first releases), we would still be far behind the scanners. (Although I do believe your scan figures for daily output by IA and google include a lot of material that is not public domain.) I would very much love for someone to prove me wrong on this point and have a million texts in PG in no time flat. >> If the inefficiency bothers you and me (and others), >> it's a moot point because it is not our time that is being wasted. > >but the potential of our society _is_ being spent inefficiently, and >the time of _good_people_ is being wasted, for no good reason. > >so i think the point is _not_ moot. i believe i should _speak_up_. Yes, you should do as you please. We all should. You do fight the good fight as they say. As much as I may agree with you most of the time, it just isn't in me to voice out about it. After all, you took my crown. ;) >> And, we have no power to change what is at DP. >> At least that is the way I look at it. > >i'm not looking for "power". i'm looking to enlighten people. 'Power' is often the easier goal to attain. ;) I merely meant that we are not in a position to have any say in what DP does with itself. Nor in what PG does with itself. Carel

1 0