
On 26 October 2011 22:09, <Bowerbird@aol.com> wrote:
jimmy said:
[I'll just snip out the parts where you can imagine a nodding head and 'uhuh']
and when they are confronted about their dreadful o.c.r. text, they respond with "well, you could trawl our x.m.l. instead...".
I've given equivalent answers myself, and always meant them to mean "this is something I've been meaning to try, but haven't found the time for".
as if throwing _all_ of the data at us is an acceptable answer to our reasonable request for _better_ text in the first place.
It could be a "Field of Dreams" ploy - if we provide it, they will come (and fix it).
(and again, i love that alex actually provided a script to us, or at least the location of a script we can run against every book. like i said, i wish there were more people like him over there. he is an excellent example of the way that they _should_ be.)
I met some of the Wikimedia Foundation's developers over the weekend, and was shocked by how few of them there are. Maybe there just aren't enough of them in general. (And maybe a weekend in the sun has made me overly positive :)
so I'm glad they provide it
i'm glad they provide it too, for the one-tenth-of-one-percent of the population who _might_ be expected to make use of it...
I think you're being generous with that amount :)
but i'm not glad they're ignoring the needs of the much-larger percentage of people who would make use of cleaned-up text.
Sure.
why are they just now, after having collected 3 million books, bothering to develop a script to give people "useable" output?
not as useful as corrected text, granted
let me be clear that i'm not asking for "corrected text"...
i am asking for text without any globally-replaceable flaws.
I'm not a data scientist, and I don't play one on the internet, but if they were trying to do it the right way, I imagine they would collect a representative sample of their collection, test on that, and if the results were 1) positive and 2) statistically significant, they would then (and only then) attempt it on the entire collection. For testing, they would need a good baseline, so it would ideally be proofread text. I imagine most would turn to PG for this, but this may be the first hurdle - PG is not homogenous, so it would require significant effort to normalise the collection. A representative sample of books in IA would include a relatively large amount of foreign language books. The same set of global replacements for English does not, at least in my experience, apply to other languages - so if they had tried, without considering language, they could easily have come up with a negative result (and been baffled by it). (Feel free to take that with a pinch of the overly positive salt, and sorry if any or all of it has come up before).
i am asking for text that has some reasonable resemblance to the structure that finereader can _easily_ give its output.
look at the o.c.r. that is offered for "the art of the book" -- with faults that are typical of _all_ of the archive.org books.
there's no clear indication of pagebreaks, for crying out loud. yes, the pagebreaks have 3 blank lines, and that's consistent. but there are many other places in the book with 3 blank lines, places which are _not_ pagebreaks. so you can't count on that. really, if they can't even get _that_ right, what are we left with? (coincidentally, it's _possible_ for them to get it right, because the .djvu file itself, which produces the text-file, gets it right. just have abby put a formfeed at the end of each page, stupid.)
Have you tried reporting that as an individual bug? Specific issues like that are more easily acted upon than general issues, and maybe they can add up to a general improvement.
or consider that, where there were illustrations, this text is littered with nonsense characters. why weren't they filtered? again, it's not as if finereader has no knowledge of pictures... many of the illustrations are saved out specifically, so abbyy had to know the locations, and could've ignored those areas. but the people who devised the workflow simply did not care. they made garbage. and now we have to live in their garbage.
I haven't used Finereader for quite some time, but I do remember that it got the image segmentation quite badly wrong from time to time, so perhaps it was safer to trade some gibberish for missing text?
or try this on for size. here are two scans from the scan-set:
http://z-m-l.com/misc/artofbook00holm_0108.jpg http://z-m-l.com/misc/artofbook00holm_0109.jpg
ok, i understand, mistakes happen, yadda yadda, and i note that the pages did get rescanned, except without the hand...
my question is this: why were these 2 scans still included in the scan-set which stands as the representation of this book?
why weren't they simply discarded?
Well, that's just bizarre. Maybe the other hand was holding a crack pipe?
and since they didn't get included in any of the output that was generated by archive.org, that means that archive.org has in its workflow a mechanism that says "ignore these"... which seems to my mind like an unnecessary complication, a grafted-on non-solution to a problem that _should_have_ been solved by addressing its cause, not one of its effects...
and now every other developer who wants to "add value" to that scan-set is going to have to build in an "ignore these".
it's true. the more you know about the archive.org workflow, the more you realize that it's just one jerry-rigged _disaster_.
it's all dressed up, in the emperor's new x.m.l. clothes, but the guys who designed it obviously have little competence.
Are you referring to another XML file here? The excerpt you posted was of Finereader's own XML, which is more or less just a dump of its internal state.
or i dunno, maybe their main priority was on "job security" -- nudge, nudge, wink, wink, know what i mean? -- but if that was the case, boy, are they in for a surprise, because archive.org laid off more people than they hired in 2011, if my guess is correct, and as the economy gets _worse_ -- which it certainly will -- they will "let go" even more...
and we will be left with a complex mess that no one can even _understand_, let alone _fix_ so it works _correctly_.
we might as well not even bother spending the money to create a cyberlibrary, if all we're going to do is waste it...
but I think the clearest example of the value of such data is reCAPTCHA, which (in part of its operation) compares the output of two OCR systems, and extracts images from the coordinates where they disagree.
you might not know that google bought recaptcha,
The OCR project I'm loosely affiliated with, Tesseract, is also a Google product. I was aware of that, and I was told a few things about it, but as I can't remember which I was asked not to repeat, I'd just prefer to stay on the safe side and not mention anything about it :)
shortly after which internet archive stopped using it.
Now that I didn't know. Seems counterproductive.
but yeah, that's one example of using coordinate data.
the _clearest_ example, for archive.org, is using the data to highlight a hit found if you search in their flip-books.
The XML data is quite poor for that purpose, as it lacks word coordinates, so I chose not to mention it. (I think they made public the scripts they use to convert the data into a usable form for that purpose, but I'm not sure). -- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you