jon hurst said:
>   The other day I noticed that DP has recently
>   uploaded a reworking of Frankenstein (#41445),
>   the 12th most downloaded "work" on PG.
>   I am assuming that it wasn't cross-checked against #84,
>   so I've started the process of doing this.

ok, well, you're gonna get schooled now, that's for sure...        :+)

"frankenstein" was a stalking-horse that jon noring and lee passey
rode for a _very_ long time in their assault on p.g. some years back.

basically, they destroyed any reputation that #84 might have had,
and tried to extend the collateral damage as far as it'd possibly go.

so i assume that post from lee that's sitting in my spam box now
will have him crawling back up on that hobby-horse to spout off.

the thing is, #84 was always a worst-case example, _very_ atypical.

was it bad?  heck yes.  it was _awful_.  just _dreadful_.

and it _remains_ awful, to this very day.  and _oft-downloaded._

(don't let the "updates" fool you, either.  if you battle through a
diff operation, you'll see the edits were relatively minor, akin to
putting band-aids on a crashed motorcyclist with broken bones,
simply because anyone can see the scratches, but not the bones.)

so i guess the noring/passey gang lost the battle _and_ the war.

because frankenstein #84 is alive, and well, and doing just fine...
clearly, it takes more than 2 guys and a horse to kill frankenstein.

and the new one is gonna plod along with relatively few downloads,
because... well, because it has fewer downloads: p.g.'s vicious circle.

and, for the reasons i posted, nothing about that will change.  ever.

welcome to the machine...  the p.g. sausage-grinder machine.

***

but there are other reasons a "comparison" would be bogus...

mainly because even if you could point to a _difference_, or 1000,
there's no way to _resolve_ the difference, because #84 is a mirage.
there is no source to check against, at least as far as anyone knows.

indeed, if you check the "old" versions, you'll find that the book was
done in separate chunks, by different people. and they might have
used entirely different editions.  so, you know, have fun with _that_.

but guess what?  #41445 isn't much more clear about _its_ source!
i mean, there's a print-date on a scan that we might _infer_ has
come from the title-page of the work which served as the source,
but i'd feel a whole lot more comfortable with an actual _link_ to
a scan-set hosted at the internet archive or (if it's the case) google.

even if you post the scans, and i do believe that's a great workflow,
you still need to link to the external source from where you got 'em,
so the end-user doesn't have to take your scans just on faith alone.

***

so jon, here's feedback on your stuff...  my main impression is that
you lack some experience, so i'll explicate that to speed you along.

first of all, if you used the text from #41445, you introduced some
errors to it...  you should do a diff to see what, so as to learn why,
such that you can go back to your code and squash the bugs in it.
also, when you rejoin end-line hyphenates, you're doing it wrong.
(and the "clothing" of end-line em-dashes is completely ridiculous.)

second, all of the page-scans from a book should be the same size.
plucking each text-block is fine, but lay it on a standardized canvas.

third, do _not_ remove the page-number from a scan; that's stupid.
that page-number constitutes the auto-documentation for the scan.
which is why you should also never ever remove the running-heads.

fourth, don't remove the page-numbers from the text, either, for
the very same reason.  they are a grounding mechanism for _you_,
as you are working with the content...  otherwise, sooner or later,
you _will_ get lost, and figuring your way out is a waste of energy.

fifth, _never_ever_ give _any_ files generic names, like "003.png".
again, sooner or later, you _will_ encounter problems with that...
every book in your library should have a unique prefix, all its own,
and each file associated with the book should start with that prefix.
i use a 5-letter prefix, meaning it can handle over 3 million books.
with generic filenames, you can't confidently handle _two_ books...
every file should have a unique filename that identifies it _alone_.
(you can have the same file saved under two different filenames,
but you should _never_ have different files with the same name.)
oh, and yes, the filename should also reflect the page-number,
the _real_ page-number, the one that was printed on the page...

sixth, don't split the book's text up into a file for every single page.
put it all into a single file, so it's easy for the humans to work with.
the computer can split and join with ease, so design for the people.

seventh, i've done it both ways over the years, so i can tell you that
a viewer-program that puts the whole book on one page is _not_
the one you will end up using for most operations, simply because
it bogs down the browser once you've loaded multiple page-scans.
all-on-one-page is nice as an option, but much less useful overall.
don't get me wrong -- your viewer is nice.  but its design is faulty.
and if you actually use it enough, you'll agree from the experience.

***

so there you go.  7 tips is quite enough for one day, i would say.

especially because i know you're not too keen on receiving _any_...

which is ok.  i did it anyway, so someone else could learn from it,
either right away, or sometime down the line.  however it happens.

-bowerbird

p.s.  good luck with getting p.g. to sponsor your viewer-program.
i have offered them much better ones, which they never accepted,
so i'm not optimistic for you, but it's certainly a thing p.g. needs...

p.p.s.  just exactly what is "a kindle-sized .pdf"?, i'd like to know.
because kindles now come in a wide size variety, they really do.