jimmy, thanks for your input. it's cool.
but i'm guessing you must be new... very new...
because some of these things we just went over.
i'll run though them again, but only very quickly,
because i'm afraid of boring the lurkers... :+)
> It could be a "Field of Dreams" ploy -
> if we provide it, they will come (and fix it).
nope. almost all of my requests were targeted at
_creating_ a system for users to correct their text.
but i couldn't get the minimal cooperation necessary
from the archive.org folks to _build_ such a system...
they don't seem to care in the least to have it "fixed".
> Maybe there just aren't enough of them in general.
there seems to be no shortage of them when it comes to
making the mess in the first place. only in cleaning it up.
> I'm not a data scientist, and I don't play one on the internet,
> but if they were trying to do it the right way, I imagine they
> would collect a representative sample of their collection,
> test on that, and if the results were 1) positive and
> 2) statistically significant, they would then (and only then)
> attempt it on the entire collection.
well, yeah, that would be "research methodology" level 101.
unfortunately, archive.org is still stuck back in kindergarten.
as i have gone over recently, the o.c.r. has _correctable_ stuff
like spacey commas -- literally commas preceded by a space.
you don't have to "do research" to know how to correct those.
you just need to do a reg-ex across the entire text collection.
like i said, jimmy, kindergarten. nap time and everything.
> so if they had tried, without considering language,
> they could easily have come up with a negative result
even in kindergarten, it's easy to tell if a book is in english.
but it doesn't matter, because what you are describing is
_way_ over their head. it's _miles_ over their head. really.
to bring the discussion back to the level of actual reality,
consider that the text for _many_ archive.org books lost
its em-dashes. the em-dashes were accidentally deleted
sometime _after_ the o.c.r. and the text-file was posted...
for an example of such a book, look here:
> http://
ia700303.us.archive.org/10/items/booksandculture00mabirich/booksandculture00mabirich_djvu.txt
go down to page 30, and then compare the text to this:
> http://z-m-l.com/go/mabie/mabiep030.jpg
voila. you'll see that the em-dash has simply vanished...
(ditto with the end-of-line hyphen, as well as a couple of
apostrophes, but we'll just focus on em-dashes for now,
because that glitch can't be solved by software algorithm.)
this problem has been exceedingly common at archive.org.
i pointed out this glitch to them many _years_ ago... and
they didn't care! they simply didn't care. i'm not kidding.
it took me years -- literally _years_ -- of pestering them
before they "agreed" to fix this flaw. and -- as we see --
years after they promised that, things _still_ ain't fixed...
i mean, seriously, jimmy. if they are unwilling to fix a bug
that's _this_ serious -- across tens of thousands of books --
they certainly won't do the type of stuff you're talking about.
kindergarten, jimmy. kindergarten. that's where they are.
and every year they keep being "held back" to _repeat_ it...
> Have you tried reporting that as an individual bug?
you make me chuckle, jimmy. :+)
if they won't replace missing em-dashes, you can bet
your very last dollar they don't care about pagebreaks.
when i initially reported the em-dash glitch, i was very
nice, as i thought they would be very appreciative, and
thank me for the error-report, and be embarrassed by
the lapse in their quality-control which allowed such a
grievous problem to _exist_, let alone be widespread...
instead, they turned _me_ into "the bad guy", and they
ignored me, and when they could no longer ignore me,
they demonized me, and eventually they _banned_ me.
they are not being interested in "bug-reports", jimmy...
as far as i can tell, that's the last thing they wanna hear.
> Specific issues like that are more easily acted upon
> than general issues, and maybe they can
> add up to a general improvement.
well, it's hard for me to imagine something more "specific"
than tens of thousands of books missing their em-dashes.
(not to mention the end-of-line dashes and apostrophes.)
and i am sure this problem could be "easily acted upon"...
somewhere along the line, they made an encoding error,
and it couldn't be all _that_ difficult to track down where.
(just look at every change, and see where they disappear.)
but i can't describe to you the stunning immensity of their
total, complete apathy about solving this specific problem.
except, perhaps, to point out that -- years later -- i can still
give a u.r.l. that points you directly to a book with the issue.
(and i assure you with great confidence it's not the only one.)
> I haven't used Finereader for quite some time,
> but I do remember that it got the image segmentation
> quite badly wrong from time to time, so perhaps it was
> safer to trade some gibberish for missing text?
they had to get the picture locations to clip the pictures.
> Well, that's just bizarre.
> Maybe the other hand was holding a crack pipe?
you funny. :+)
but on a serious note, running one of their scanners is
an extremely challenging job, because it's both boring
_and_ demanding of attention. takes a special person.
so errors like that happen. they are to be fully expected.
what's _wrong_ with this picture is that those mis-scans
should have been discarded, not included in the final set.
that's a problem with workflow. not the person scanning.
> Are you referring to another XML file here?
> The excerpt you posted was of Finereader's own XML,
> which is more or less just a dump of its internal state.
right. finereader took a "dump", and archive.org archived it.
whole. and now suggests we trawl it, to find any good stuff.
including all of those em-dashes which their workflow "lost".
i mean, if there were _diamonds_ in this dump, i'd do it...
but em-dashes? um, no thanks.
in many cases, it would be quicker just to re-do the o.c.r.
> The OCR project I'm loosely affiliated with, Tesseract,
> is also a Google product. I was aware of that, and
> I was told a few things about it, but as I can't remember
> which I was asked not to repeat, I'd just prefer to
> stay on the safe side and not mention anything about it :)
i'll save you the trouble. tesseract is a piece of crap. :+)
but its lousy performance will mean that google has to
do a lot of very hard work to write the routines that can
correct the lousy output turned out by that piece of crap.
which will turn out to be a _great_ thing in the long run.
but hey, i would _love_ an open-source o.c.r. program.
love love love... so if you can turn it into a worthy app,
i'm sure a lot of people like me will heap praise on you.
ok, yes, i admit it, i haven't actually evaluated tesseract
in the last year or so, so maybe it improved immensely.
_maybe_. but i really doubt it. it needed a lot of work.
but hey, let me know if i'm wrong, ok?, and i'll review it.
> Now that I didn't know. Seems counterproductive.
if my enemy buys my friend, my friend is no longer my friend.
> The XML data is quite poor for that purpose, as
> it lacks word coordinates, so I chose not to mention it.
> (I think they made public the scripts they use to
> convert the data into a usable form for that purpose,
> but I'm not sure).
hmm... i guess i'll have to take another look at that.
as soon as i work up the nerve; i'm allergic to x.m.l.
-bowerbird