[SPAM] re: Re: Bowerbird's software projects

gardner said:
Perhaps not, but over time you have described checks that your tools can do and fixes that you can automatically make that sound a little to me like a super-duper gutcheck.
yes, except that those checks and fixes are most often programmed only into one-off versions of programs... it's usually the case that making those checks and fixes useful in the general case, against any random book, is a more difficult matter. this isn't an apology of any sort; it's just that my intentions (for the most part) are to show that a particular check can be accomplished, and is useful. so far i haven't concentrated on building them into my app, because nobody's really expressed much interest in the app. the app has a general spellcheck ability, and that captures a very high percentage of the errors that occur within a text.
Also the workflow I picture is a little like gutcheck -- I am thinking of text-in text-out command line tools, not something that needs to look at image scans or makes me talk to it in a fancy U/I.
i'm not sure what you mean by the workflow you "picture". i was asking about your _actual_ workflow, the one whereby you currently digitize books. are you saying that you now do your digitizations without ever looking at image scans? because i have a hard time imagining how you can do that. you should also know i am a mac person, for good reason. for me, the interface is prime. if you're looking for tools that work on a command-line, in a text-in-text-out way, i'm the wrong tree for you to be barking up, that's for sure. i certainly wouldn't call my interface "fancy". to the contrary, it's extremely utilitarian, and not very pretty, not pretty at all. but it _is_ an interface, with buttons and menus and all that nice stuff that makes the program a lot easier to work with...
a Linux build would, I think. Windows would be fine too.
i'll send you both.
This is still fairly accurate:
ok, that was very useful... my tool assumes that the page-scans are in the same folder as the app, which is easy enough to satisfy. the tool also assumes that your text is all in one file, and that the page-boundary is of a certain type. i'd assume that your vi skills will enable you to satisfy this assumption in a fairly simple manner. other than that, i'd say you'll be good to go.
The last couple of books I've done instead by scanning, bulk OCR and then proof from the scans and raw OCR text which I can do on the road with my laptop or anywhere I can mount a USB key for a couple of hours.
that's how you'll want to operate with my software, yes.
After OCR I have a few basic things that I do via regular expressions in vi:
you can continue to do those things in vi if you like. global changes in vi are much quicker than going through one-by-one changes in the interface.
The thing is that I do not have a specific set of checks and fixes that I consistently do.
that's something you'll want to remedy. i did a series here a couple years back where i collected a list of checks that was necessary for the book i tested, and somebody turned that list into a set of reg-ex tests. you can find that set on the download page for don's app:
indeed, since you are already using reg-ex, you'll probably find that you prefer don's tool over mine, since his program lets you actually _build_in_ your own list of reg-ex checks...
I rely a lot on jeebies and gutcheck.
so, when you get a report from them on the possible errors, you enter vi and use search to locate each one of the errors?
I would like something perhaps with a wider range of things that it can find so I don't have to know all the things to look for.
well, yes. and you can find some very extensive lists of reg-ex checks, right on d.p. the problem is that many of those checks have a low signal-to-noise ratio, in that they create far too many false-alarms. this is a problem even with gutcheck and heebe-jeebe, if i'm not mistaken. so you really have to fine-tune your list of checks to the particular corpus on which you are working, to be useful. this is why don's app is so useful, because you can build in the list of checks you want to do, and modify it at will, and even enter in a specific reg-ex to see if it returns any hits.
Over the years you have mentioned several automated checks and fixes that sounded sensible enough to me.
sounds like you really want to use that reg-ex list that was based on the month-long series that i did.
I'm not keen enough to go back through the archives, find them and implement them -- but I am nevertheless interested in trying a tool like this out on a project to see if it adds value for what I do.
having heard all this, i'd guess don's app is the one for you.
i'll send mine to you too, but his is based on reg-ex checks...
http://www.gutenberg.org/dirs/3/1/2/1/31212/31212-8.txt and just tell me what you find. I have no doubt there is lots
if the scans are online too, or can be, i'll certainly take a look at it... without looking at them, i can't know if something is an error or not. -bowerbird

On 18-Feb-2010 21:21, Bowerbird@aol.com wrote:
it's usually the case that making those checks and fixes useful in the general case, against any random book, is a more difficult matter.
That's kind of my experience, I guess. Several fixes will suggest themselves, in the context of a given specific text. The next one might need different fixes. But that doesn't mean a long list of fixups might be tried when there's no cost to just adding tests/fixes to the list.
for me, the interface is prime. if you're looking for tools that work on a command-line, in a text-in-text-out way, i'm the wrong tree for you to be barking up, that's for sure.
I see.
i'll send you both.
Better stick to Windoze, if it's a GUI.
ok, that was very useful... my tool assumes that the page-scans are in the same folder as the app, which is easy enough to satisfy.
the tool also assumes that your text is all in one file, and that the page-boundary is of a certain type. i'd assume that your vi skills will enable you to satisfy this assumption in a fairly simple manner.
other than that, i'd say you'll be good to go.
Text in one file -- check. I favour marking page boundaries with "===00123" these days, but a global search/replace can fix that.
i did a series here a couple years back where i collected a list of checks that was necessary for the book i tested, and somebody turned that list into a set of reg-ex tests.
you can find that set on the download page for don's app:
Yes. Looking at that. I am not 100% sure I want to mess with Twister exactly, but the list of regular expressions looks interesting. I'm picturing building a perl script that applies all of these fixes, then creates a patch set based on the the differences it has introduced. I could then edit the patch set as a file, nuking changes that are wrong, and finally apply the patches for the changes I like.
I rely a lot on jeebies and gutcheck.
so, when you get a report from them on the possible errors, you enter vi and use search to locate each one of the errors?
Kind of. Jeebies and gutcheck reference specific line numbers. So I go through the output of these bottom up. For each hit I go to the specified line number and see what's up, fix if needed and then move to the previous hit. I work bottom to top so that changes I make don't invalidate the line numbers in the gutcheck output as I go. I find it takes a good couple of passes before I am satisfied I have all the genuine hits covered. Invariably the WW finds things I've missed anyhow.
sounds like you really want to use that reg-ex list that was based on the month-long series that i did.
Yeah. Got those. Like I say -- I will turn it into a perl script and see where that takes me.
i'll send mine to you too, but his is based on reg-ex checks...
Would be great. Thanks.
http://www.gutenberg.org/dirs/3/1/2/1/31212/31212-8.txt and just tell me what you find. I have no doubt there is lots
if the scans are online too, or can be, i'll certainly take a look at it...
Lots of choices there. http://www.canadiana.org/ECO/ItemRecord/48293?id=16c79d4f15394e51 http://www.archive.org/details/advocateanovel00heavgoog http://books.google.com/books?id=ot4OAAAAYAAJ&oe=UTF-8 There are no page numbers in the Gutenberg text though. See you, ============================================================ Gardner Buchanan <gbuchana@teksavvy.com> Ottawa, ON FreeBSD: Where you want to go. Today.

For what it's worth, Twister comes out of pretty much your approach, Gardner. I worked for a long time from regexes in vi and am writing Twister to make as much of it as "batchy" as I can. For instance, when you load a regex file, you can click a button to get a count of each of the regexes in your list. I'm currently adding the ability to choose a regex and get a list of occurrences with 3 lines of context. The goal is to make it transparent how it works, and let you adjust it to make it work the way you do. But no guarantees - it's still buggy. :( You might want to wait two or three days for a newer version.) (I sure don't miss the requirement in vi to add all those backslashes you don't need in any other regex context I know of...) On Thu, Feb 18, 2010 at 7:39 PM, Gardner Buchanan <gbuchana@teksavvy.com>wrote:

Gardner Buchanan wrote:
On 18-Feb-2010 21:21, Bowerbird@aol.com wrote:
it's usually the case that making those checks and fixes useful in the general case, against any random book, is a more difficult matter.
The tragic bb in a nutshell. He gets one easy text, then builds a `program´ that finds the bugs in that one easy text and proclaims it the ultimate fixing tools. ... Everybody laughs. ... BB waits one year. ... Repetitur. To build a useful tool you have to: 1. get two random samples of scans, say two sets of 100 complete book scans, using different scan techniques and different OCR on books of different ages and provenience. You could get those out of google or IA. 2. build a bug list of those OCRed texts against proven good copies. 3. build a program using the texts and error lists of the first group. You are not allowed to look at the second group texts. 4. run the program against your blind group and record the percentage of positives and negatives it finds. 5. run any known tools against the blind group and see if yours performs significantly better. 6. If better then brag else shut up.
That's kind of my experience, I guess. Several fixes will suggest themselves, in the context of a given specific text. The next one might need different fixes. But that doesn't mean a long list of fixups might be tried when there's no cost to just adding tests/fixes to the list.
If you have to enter the regexes manually you should use any editor that supports them. -- Marcello Perathoner webmaster@gutenberg.org
participants (4)
-
Bowerbird@aol.com
-
don kretz
-
Gardner Buchanan
-
Marcello Perathoner