gardner said:
> Perhaps not, but over time you have described
> checks that your tools can do and
> fixes that you can automatically make
> that sound a little to me like a super-duper gutcheck.
yes, except that those checks and fixes are most often
programmed only into one-off versions of programs...
it's usually the case that making those checks and fixes
useful in the general case, against any random book, is
a more difficult matter. this isn't an apology of any sort;
it's just that my intentions (for the most part) are to show
that a particular check can be accomplished, and is useful.
so far i haven't concentrated on building them into my app,
because nobody's really expressed much interest in the app.
the app has a general spellcheck ability, and that captures
a very high percentage of the errors that occur within a text.
> Also the workflow I picture is a little like gutcheck --
> I am thinking of text-in text-out command line tools,
> not something that needs to look at image scans or
> makes me talk to it in a fancy U/I.
i'm not sure what you mean by the workflow you "picture".
i was asking about your _actual_ workflow, the one whereby
you currently digitize books. are you saying that you now
do your digitizations without ever looking at image scans?
because i have a hard time imagining how you can do that.
you should also know i am a mac person, for good reason.
for me, the interface is prime. if you're looking for tools
that work on a command-line, in a text-in-text-out way,
i'm the wrong tree for you to be barking up, that's for sure.
i certainly wouldn't call my interface "fancy". to the contrary,
it's extremely utilitarian, and not very pretty, not pretty at all.
but it _is_ an interface, with buttons and menus and all that
nice stuff that makes the program a lot easier to work with...
> a Linux build would, I think. Windows would be fine too.
i'll send you both.
> This is still fairly accurate:
ok, that was very useful... my tool assumes that the page-scans
are in the same folder as the app, which is easy enough to satisfy.
the tool also assumes that your text is all in one file, and that the
page-boundary is of a certain type. i'd assume that your vi skills
will enable you to satisfy this assumption in a fairly simple manner.
other than that, i'd say you'll be good to go.
> The last couple of books I've done instead by scanning,
> bulk OCR and then proof from the scans and raw OCR text
> which I can do on the road with my laptop or
> anywhere I can mount a USB key for a couple of hours.
that's how you'll want to operate with my software, yes.
> After OCR I have a few basic things that I do
> via regular expressions in vi:
you can continue to do those things in vi if you like.
global changes in vi are much quicker than going through
one-by-one changes in the interface.
> The thing is that I do not have
> a specific set of checks and fixes that I consistently do.
that's something you'll want to remedy.
i did a series here a couple years back where i collected
a list of checks that was necessary for the book i tested,
and somebody turned that list into a set of reg-ex tests.
you can find that set on the download page for don's app:
> http://code.google.com/p/dp50/downloads/list
indeed, since you are already using reg-ex, you'll probably
find that you prefer don's tool over mine, since his program
lets you actually _build_in_ your own list of reg-ex checks...
> I rely a lot on jeebies and gutcheck.
so, when you get a report from them on the possible errors,
you enter vi and use search to locate each one of the errors?
> I would like something perhaps with
> a wider range of things that it can find
> so I don't have to know all the things to look for.
well, yes. and you can find some very extensive lists of
reg-ex checks, right on d.p. the problem is that many
of those checks have a low signal-to-noise ratio, in that
they create far too many false-alarms. this is a problem
even with gutcheck and heebe-jeebe, if i'm not mistaken.
so you really have to fine-tune your list of checks to the
particular corpus on which you are working, to be useful.
this is why don's app is so useful, because you can build in
the list of checks you want to do, and modify it at will, and
even enter in a specific reg-ex to see if it returns any hits.
> Over the years you have mentioned
> several automated checks and fixes
> that sounded sensible enough to me.
sounds like you really want to use that reg-ex list
that was based on the month-long series that i did.
> I'm not keen enough to go back through the archives,
> find them and implement them -- but I am
> nevertheless interested in trying a tool like this out
> on a project to see if it adds value for what I do.
having heard all this, i'd guess don's app is the one for you.
> http://code.google.com/p/dp50/downloads/list
i'll send mine to you too, but his is based on reg-ex checks...
> http://www.gutenberg.org/dirs/3/1/2/1/31212/31212-8.txt
> and just tell me what you find. I have no doubt there is lots
if the scans are online too, or can be, i'll certainly take a look at it...
without looking at them, i can't know if something is an error or not.
-bowerbird