here are the suggestions that should be accepted

25 Jul 2005

      jon said:
...
Well, we've both had our say, 
   so let the few others who
   are following this topic 
   decide for themselves.
how about instead we have people
forget about what you said so they
can pay attention to what i said?              :+)

the system you want to lay on people
is so convoluted from the standpoint of
their _actual_ workflow right now that
they reject what you say out of hand,
and end up unwilling to discuss the topic.

and that makes it all that more difficult
to get them to accept reasonable ideas.

as david put it:
...
That is true. That will continue to be true 
   what ever standard is agreed upon, since 
   people aren't going to suddenly change 
   their scanning patterns based on what the 
   final product, a long ways down the line, looks like.
now, the reality of the situation is that
a change in their "scanning patterns"
might well _improve_ their work flow,
make them more efficient, and so on.

but since you don't understand that flow
-- and since they know that you don't --
you can't really hope to talk at that level.

i'm sorry to have to keep on harping that
you don't know what you're talking about.

i'm aware that some people hear that and
think i'm some mean person dumping on you.

the truth of the matter is i don't like saying it.
i don't like that you make it necessary to do so.

but _somebody_ has to level with you, jon.

take another example, from your latest post:
...
But your system requires that each image, 
   when it is first saved, needs a human being 
   to eyeball the page, determine the publisher
   supplied page number (if any; may be implied), 
   and then manually save the page 
   using the publisher number.
so i tell you again that you need to look at the
auto-naming feature in your scanning program.
it saves the page automatically, then increments
the page-counter in anticipation of the next file.

(if you don't have such a capability, you really
need to get yourself a more modern program.
but since it was common even a decade ago,
i assume that your program has this feature.)

auto-naming is what most all d.p. scanners use.
they don't enter a name for each individual file,
as that would waste energy and slow them down.
they get into a dream routine where their body is
flying in a ritual of page-turning and positioning.
some of 'em can scan hundreds of pages an hour.

if you had scanned in more than a few books, and
scanned 'em in the manner that d.p. scanners do,
you would know this, and you wouldn't say things
so inconsistent with the reality of their processes.

***

here's what _really_ needs to be done
to take page-image scan-sets public...

1.   scanning should be done such that
the auto-naming of the scanned images
fits the outcome we want them to have.
however, if the scanner for some reason
cannot or did not do this, it's no problem,
as our tools can rename the files _easily_.
i developed such a renamer using an e-text
that had unnumbered plates and blank pages
(some of which were scanned and some not),
plus a whole raft of missing pages (part of an
english/hawaiian facing-translations section),
and so on.   and the renaming tool worked fine.

2.   quality-control needs to be done
_during_ the process of scanning, so
errors (like missing pages or duplicates)
are caught when they can be fixed easily.
i believe d.p. scanners have learned this.
you do a visual inspection of each page,
to make sure it is acceptable quality, and
then a check where you step through all
of the images checking their page-numbers
to confirm you got every single one of 'em.
again, regular d.p. scanners do this already,
because they've learned that it's worthwhile.
finding bad pages after fact = pain in the ass.

3.   all of the scans need to be _cleaned_.
they need to be deskewed and regularized,
both in terms of size, and placement of text.
since this job can be done fairly automatically,
without much need for human intervention, and
since it will help the o.c.r. accuracy, it should be
done _before_ the o.c.r.   the original files can be
discarded; the foibles of scanning are unimportant.
i have not lobbied on behalf of such cleanup before
because i figured d.p. would learn it sooner or later.
(now if only the large scanning projects would too!)
but if you take scans public, they must be cleaned.
but again, if this wasn't done originally, it can easily
be done later, since it is generally rather automatic.

4.   the o.c.r. process should output individual files,
so as to retain the linkage between text and its scan.
at this point, any weirdnesses in the page-numbering
can be addressed, and blank scans inserted if needed.
i won't get into the reasons why now, but these files
should have all styling (margins, italics, etc.) saved!
folks, do _not_ strip your o.c.r. files to plain-ascii!

5.   the posted e-text-files -- ascii, html, all of 'em --
need to have the page-breaks indicated in some way,
so that end-users and their tools can easily correlate
the text with the exact page-scan from which it came.
so pagebreaks must be retained every step of the way!
stop discarding this useful information from the files!
if you need input on how to represent them, ask me!

6.   all the way through the cleaning-up of the text,
a person should be able to deal with one page of text
_or_ with the entire book, whichever is convenient
for the particular clean-up task at hand at the time.
this is, of course, a function of the _tools_ they use,
rather than the scan-set per se, but i mention it now
since the implication is you'll need to be receptive to
feedback and requests you get from your toolmakers;
an example of that is my input on naming of the files.

7.   every page-scan should be accessible separately.
bundles -- such as djvu and .zip -- can also be made,
but the world must be able to grab each page-scan
_individually_ when one is all that they want to see;
it's unreasonable to expect 'em to download 'em all.

8.   a system of "continuous proofreading" should be
initiated to make the best use of these public scans,
as it will leverage the immense power of user eyeballs
to march the e-texts toward an error-free perfection.

9.   any all-image-format should be easily ported to
the current and future platforms that can utilize it,
like the sony playstation portable and the nokia770.
it would be great if you offered such porting online.
i'll be offering it as a feature in my viewer-program,
with the text-file as input, but it would be nice if you
could offer it using the original page-scans as well...

10.   whenever possible, the converted e-text should
accompany the scans, to overcome the weaknesses
of an image-format (e.g., the lack of searchability,
the inability to copy-and-paste text, and so forth...).

-bowerbird

Bowerbird＠aol.com

tags

participants (1)