jimmy said:
> if you're only interested in the text content, it's quite useless.
that's right.
> It is useful for OCR research to have that data
that's right too. i'm not saying they shouldn't make that data
available. but there's a huge middle-ground they're ignoring.
they provide dirty o.c.r. as their "text", and this bloated x.m.l.
and when they are confronted about their dreadful o.c.r. text,
they respond with "well, you could trawl our x.m.l. instead...".
as if throwing _all_ of the data at us is an acceptable answer
to our reasonable request for _better_ text in the first place.
(and again, i love that alex actually provided a script to us, or
at least the location of a script we can run against every book.
like i said, i wish there were more people like him over there.
he is an excellent example of the way that they _should_ be.)
> so I'm glad they provide it
i'm glad they provide it too, for the one-tenth-of-one-percent
of the population who _might_ be expected to make use of it...
but i'm not glad they're ignoring the needs of the much-larger
percentage of people who would make use of cleaned-up text.
why are they just now, after having collected 3 million books,
bothering to develop a script to give people "useable" output?
> not as useful as corrected text, granted
let me be clear that i'm not asking for "corrected text"...
i am asking for text without any globally-replaceable flaws.
i am asking for text that has some reasonable resemblance
to the structure that finereader can _easily_ give its output.
look at the o.c.r. that is offered for "the art of the book" --
with faults that are typical of _all_ of the archive.org books.
there's no clear indication of pagebreaks, for crying out loud.
yes, the pagebreaks have 3 blank lines, and that's consistent.
but there are many other places in the book with 3 blank lines,
places which are _not_ pagebreaks. so you can't count on that.
really, if they can't even get _that_ right, what are we left with?
(coincidentally, it's _possible_ for them to get it right, because
the .djvu file itself, which produces the text-file, gets it right.
just have abby put a formfeed at the end of each page, stupid.)
or consider that, where there were illustrations, this text is
littered with nonsense characters. why weren't they filtered?
again, it's not as if finereader has no knowledge of pictures...
many of the illustrations are saved out specifically, so abbyy
had to know the locations, and could've ignored those areas.
but the people who devised the workflow simply did not care.
they made garbage. and now we have to live in their garbage.
or try this on for size. here are two scans from the scan-set:
> http://z-m-l.com/misc/artofbook00holm_0108.jpg
> http://z-m-l.com/misc/artofbook00holm_0109.jpg
ok, i understand, mistakes happen, yadda yadda, and i note
that the pages did get rescanned, except without the hand...
my question is this: why were these 2 scans still included in
the scan-set which stands as the representation of this book?
why weren't they simply discarded?
and since they didn't get included in any of the output that
was generated by archive.org, that means that archive.org
has in its workflow a mechanism that says "ignore these"...
which seems to my mind like an unnecessary complication,
a grafted-on non-solution to a problem that _should_have_
been solved by addressing its cause, not one of its effects...
and now every other developer who wants to "add value" to
that scan-set is going to have to build in an "ignore these".
it's true. the more you know about the archive.org workflow,
the more you realize that it's just one jerry-rigged _disaster_.
it's all dressed up, in the emperor's new x.m.l. clothes, but
the guys who designed it obviously have little competence.
or i dunno, maybe their main priority was on "job security"
-- nudge, nudge, wink, wink, know what i mean? -- but if
that was the case, boy, are they in for a surprise, because
archive.org laid off more people than they hired in 2011,
if my guess is correct, and as the economy gets _worse_
-- which it certainly will -- they will "let go" even more...
and we will be left with a complex mess that no one can
even _understand_, let alone _fix_ so it works _correctly_.
we might as well not even bother spending the money to
create a cyberlibrary, if all we're going to do is waste it...
> but I think the clearest example of
> the value of such data is reCAPTCHA,
> which (in part of its operation)
> compares the output of two OCR systems,
> and extracts images from
> the coordinates where they disagree.
you might not know that google bought recaptcha,
shortly after which internet archive stopped using it.
but yeah, that's one example of using coordinate data.
the _clearest_ example, for archive.org, is using the data
to highlight a hit found if you search in their flip-books.
which, by the way, reminds me about this screenshot:
> http://z-m-l.com/misc/prime-example.jpg
oops! there's a prime example of a search fail!
go ahead. try it yourself:
> http://www.archive.org/stream/booksculture00mabiuoft#page/236
the stupid. it's everywhere. (but jimmy's point was good.)
-bowerbird