Re: [gutvol-d] Fwd: Re: epubeditor.sourceforge.net

26 Oct 2011

      On 26 October 2011 22:09,  <Bowerbird@aol.com> wrote:
...
jimmy said:
[I'll just snip out the parts where you can imagine a nodding head and 'uhuh']
...
and when they are confronted about their dreadful o.c.r. text,
they respond with "well, you could trawl our x.m.l. instead...".
I've given equivalent answers myself, and always meant them to mean
"this is something I've been meaning to try, but haven't found the
time for".
...
as if throwing _all_ of the data at us is an acceptable answer
to our reasonable request for _better_ text in the first place.
It could be a "Field of Dreams" ploy - if we provide it, they will
come (and fix it).
...
(and again, i love that alex actually provided a script to us, or
at least the location of a script we can run against every book.
like i said, i wish there were more people like him over there.
he is an excellent example of the way that they _should_ be.)
I met some of the Wikimedia Foundation's developers over the weekend,
and was shocked by how few of them there are. Maybe there just aren't
enough of them in general. (And maybe a weekend in the sun has made me
overly positive :)
...
...
   so I'm glad they provide it
i'm glad they provide it too, for the one-tenth-of-one-percent
of the population who _might_ be expected to make use of it...
I think you're being generous with that amount :)
...
but i'm not glad they're ignoring the needs of the much-larger
percentage of people who would make use of cleaned-up text.
Sure.
...
why are they just now, after having collected 3 million books,
bothering to develop a script to give people "useable" output?

...
...
   not as useful as corrected text, granted
let me be clear that i'm not asking for "corrected text"...
i am asking for text without any globally-replaceable flaws.
I'm not a data scientist, and I don't play one on the internet, but if
they were trying to do it the right way, I imagine they would collect
a representative sample of their collection, test on that, and if the
results were 1) positive and 2) statistically significant, they would
then (and only then) attempt it on the entire collection.

For testing, they would need a good baseline, so it would ideally be
proofread text. I imagine most would turn to PG for this, but this may
be the first hurdle - PG is not homogenous, so it would require
significant effort to normalise the collection.

A representative sample of books in IA would include a relatively
large amount of foreign language books. The same set of global
replacements for English does not, at least in my experience, apply to
other languages - so if they had tried, without considering language,
they could easily have come up with a negative result (and been
baffled by it).

(Feel free to take that with a pinch of the overly positive salt, and
sorry if any or all of it has come up before).
...
i am asking for text that has some reasonable resemblance
to the structure that finereader can _easily_ give its output.
look at the o.c.r. that is offered for "the art of the book" --
with faults that are typical of _all_ of the archive.org books.
there's no clear indication of pagebreaks, for crying out loud.
yes, the pagebreaks have 3 blank lines, and that's consistent.
but there are many other places in the book with 3 blank lines,
places which are _not_ pagebreaks.  so you can't count on that.
really, if they can't even get _that_ right, what are we left with?
(coincidentally, it's _possible_ for them to get it right, because
the .djvu file itself, which produces the text-file, gets it right.
just have abby put a formfeed at the end of each page, stupid.)
Have you tried reporting that as an individual bug? Specific issues
like that are more easily acted upon than general issues, and maybe
they can add up to a general improvement.
...
or consider that, where there were illustrations, this text is
littered with nonsense characters.  why weren't they filtered?
again, it's not as if finereader has no knowledge of pictures...
many of the illustrations are saved out specifically, so abbyy
had to know the locations, and could've ignored those areas.
but the people who devised the workflow simply did not care.
they made garbage.  and now we have to live in their garbage.
I haven't used Finereader for quite some time, but I do remember that
it got the image segmentation quite badly wrong from time to time, so
perhaps it was safer to trade some gibberish for missing text?
...
or try this on for size.  here are two scans from the scan-set:
...
   http://z-m-l.com/misc/artofbook00holm_0108.jpg
   http://z-m-l.com/misc/artofbook00holm_0109.jpg
ok, i understand, mistakes happen, yadda yadda, and i note
that the pages did get rescanned, except without the hand...
my question is this:  why were these 2 scans still included in
the scan-set which stands as the representation of this book?
why weren't they simply discarded?
Well, that's just bizarre. Maybe the other hand was holding a crack pipe?
...
and since they didn't get included in any of the output that
was generated by archive.org, that means that archive.org
has in its workflow a mechanism that says "ignore these"...
which seems to my mind like an unnecessary complication,
a grafted-on non-solution to a problem that _should_have_
been solved by addressing its cause, not one of its effects...
and now every other developer who wants to "add value" to
that scan-set is going to have to build in an "ignore these".
it's true.  the more you know about the archive.org workflow,
the more you realize that it's just one jerry-rigged _disaster_.
it's all dressed up, in the emperor's new x.m.l. clothes, but
the guys who designed it obviously have little competence.
Are you referring to another XML file here? The excerpt you posted was
of Finereader's own XML, which is more or less just a dump of its
internal state.
...
or i dunno, maybe their main priority was on "job security"
-- nudge, nudge, wink, wink, know what i mean? -- but if
that was the case, boy, are they in for a surprise, because
archive.org laid off more people than they hired in 2011,
if my guess is correct, and as the economy gets _worse_
-- which it certainly will -- they will "let go" even more...
and we will be left with a complex mess that no one can
even _understand_, let alone _fix_ so it works _correctly_.
we might as well not even bother spending the money to
create a cyberlibrary, if all we're going to do is waste it...
...
   but I think the clearest example of
   the value of such data is reCAPTCHA,
   which (in part of its operation)
   compares the output of two OCR systems,
   and extracts images from
   the coordinates where they disagree.
you might not know that google bought recaptcha,
The OCR project I'm loosely affiliated with, Tesseract, is also a
Google product. I was aware of that, and I was told a few things about
it, but as I can't remember which I was asked not to repeat, I'd just
prefer to stay on the safe side and not mention anything about it :)
...
shortly after which internet archive stopped using it.
Now that I didn't know. Seems counterproductive.
...
but yeah, that's one example of using coordinate data.
the _clearest_ example, for archive.org, is using the data
to highlight a hit found if you search in their flip-books.
The XML data is quite poor for that purpose, as it lacks word
coordinates, so I chose not to mention it. (I think they made public
the scripts they use to convert the data into a usable form for that
purpose, but I'm not sure).

-- 
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

Re: [gutvol-d] Fwd: Re: epubeditor.sourceforge.net

Jimmy O'Regan