
that means that archive.org has in its workflow a mechanism that says "ignore these"... which seems to my mind like an unnecessary complication, a grafted-on non-solution to a problem that _should_have_ been solved by addressing its cause, not one of its effects...
Yes, part of the bookscanning process (at the end) is reviewing all the images and hitting (basically) "good" or "rescan", and then replacements are added. The bad ones are then marked in the (XML?) file that describes the images for the OCR system, but not deleted. (_scandata.xml iirc) -- Alex Buie Network Coordinator / Server Engineer KWD Services, Inc Media and Hosting Solutions +1(703)445-3391 +1(480)253-9640 +1(703)919-8090 abuie@kwdservices.com On Wed, Oct 26, 2011 at 5:09 PM, <Bowerbird@aol.com> wrote:
jimmy said:
if you're only interested in the text content, it's quite useless.
that's right.
It is useful for OCR research to have that data
that's right too. i'm not saying they shouldn't make that data available. but there's a huge middle-ground they're ignoring. they provide dirty o.c.r. as their "text", and this bloated x.m.l.
and when they are confronted about their dreadful o.c.r. text, they respond with "well, you could trawl our x.m.l. instead...".
as if throwing _all_ of the data at us is an acceptable answer to our reasonable request for _better_ text in the first place.
(and again, i love that alex actually provided a script to us, or at least the location of a script we can run against every book. like i said, i wish there were more people like him over there. he is an excellent example of the way that they _should_ be.)
so I'm glad they provide it
i'm glad they provide it too, for the one-tenth-of-one-percent of the population who _might_ be expected to make use of it...
but i'm not glad they're ignoring the needs of the much-larger percentage of people who would make use of cleaned-up text.
why are they just now, after having collected 3 million books, bothering to develop a script to give people "useable" output?
not as useful as corrected text, granted
let me be clear that i'm not asking for "corrected text"...
i am asking for text without any globally-replaceable flaws.
i am asking for text that has some reasonable resemblance to the structure that finereader can _easily_ give its output.
look at the o.c.r. that is offered for "the art of the book" -- with faults that are typical of _all_ of the archive.org books.
there's no clear indication of pagebreaks, for crying out loud. yes, the pagebreaks have 3 blank lines, and that's consistent. but there are many other places in the book with 3 blank lines, places which are _not_ pagebreaks. so you can't count on that. really, if they can't even get _that_ right, what are we left with? (coincidentally, it's _possible_ for them to get it right, because the .djvu file itself, which produces the text-file, gets it right. just have abby put a formfeed at the end of each page, stupid.)
or consider that, where there were illustrations, this text is littered with nonsense characters. why weren't they filtered? again, it's not as if finereader has no knowledge of pictures... many of the illustrations are saved out specifically, so abbyy had to know the locations, and could've ignored those areas. but the people who devised the workflow simply did not care. they made garbage. and now we have to live in their garbage.
or try this on for size. here are two scans from the scan-set:
http://z-m-l.com/misc/artofbook00holm_0108.jpg http://z-m-l.com/misc/artofbook00holm_0109.jpg
ok, i understand, mistakes happen, yadda yadda, and i note that the pages did get rescanned, except without the hand...
my question is this: why were these 2 scans still included in the scan-set which stands as the representation of this book?
why weren't they simply discarded?
and since they didn't get included in any of the output that was generated by archive.org, that means that archive.org has in its workflow a mechanism that says "ignore these"... which seems to my mind like an unnecessary complication, a grafted-on non-solution to a problem that _should_have_ been solved by addressing its cause, not one of its effects...
and now every other developer who wants to "add value" to that scan-set is going to have to build in an "ignore these".
it's true. the more you know about the archive.org workflow, the more you realize that it's just one jerry-rigged _disaster_.
it's all dressed up, in the emperor's new x.m.l. clothes, but the guys who designed it obviously have little competence. or i dunno, maybe their main priority was on "job security" -- nudge, nudge, wink, wink, know what i mean? -- but if that was the case, boy, are they in for a surprise, because archive.org laid off more people than they hired in 2011, if my guess is correct, and as the economy gets _worse_ -- which it certainly will -- they will "let go" even more...
and we will be left with a complex mess that no one can even _understand_, let alone _fix_ so it works _correctly_.
we might as well not even bother spending the money to create a cyberlibrary, if all we're going to do is waste it...
but I think the clearest example of the value of such data is reCAPTCHA, which (in part of its operation) compares the output of two OCR systems, and extracts images from the coordinates where they disagree.
you might not know that google bought recaptcha, shortly after which internet archive stopped using it.
but yeah, that's one example of using coordinate data.
the _clearest_ example, for archive.org, is using the data to highlight a hit found if you search in their flip-books.
which, by the way, reminds me about this screenshot:
oops! there's a prime example of a search fail!
go ahead. try it yourself:
http://www.archive.org/stream/booksculture00mabiuoft#page/236
the stupid. it's everywhere. (but jimmy's point was good.)
-bowerbird
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d