
Sorry, what I was trying to say is that the current script isn't prerun across every item, it's done on each request, so it should be relatively easy for someone to write one that outputs whatever they want from an abbyy.gz, and then I can get it pushed into the tree. (Although I get the feeling this still isn't the resolution you [or I] am looking for) Alex On Mon, Oct 24, 2011 at 5:29 PM, <Bowerbird@aol.com> wrote:
alex said:
There's nothing stopping someone writing an abbyy.gz to plain (non-encumbered) html converter, aside from effort, you know...
this is the type of response we often get from archive.org.
because it insulates them from criticism of their workflow. and allows them to float mindlessly in their little bubble...
look, people, it's _stupid_ to embed the important data in a mass of unimportant data. it's the _wrong_ way to do it!
_stupid_. and _wrong_. do you understand those words?
yes, of course, we can clean the crucial data out of the mess.
but why should we have to do that?
why are _you_ taking action that _requires_ us to do that?
i mean, really, when you boast with pride that you've made "three million books available to the public", do you really want us to have to decode that boast to its more accurate "we've put three million books inside of a big pile of mud, and there's nothing stopping you from pulling them out..."
is that _really_ what you want to say?
because that is _really_ what you are _actually_ saying now.
besides, even though it might be rather simple to _write_ such a "converter" -- although it shouldn't output "html", as you said, but rather the text directly -- it's another thing entirely to then execute that converter on _3_million_texts_.
_any_ kind of bottleneck, no matter how small, grows large when you have to do it a million times -- let alone 3 million.
and if you are talking about a _big_ bottleneck, like parsing x.m.l. files that are 58 megabytes (like the current example) -- again, across a corpus that includes millions of books -- then we're talking about very _significant_ processor-time, not to mention bandwidth issues if we have to retrieve them, and storage concerns (even to save only the smaller output). not to mention that then we've essentially forked your library.
so _you_ doing it _right_ in the _first_place_ is very important.
(not trying to snark at BB)
nor am i trying to snark back.
-bowerbird
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d