this thread explains how to digitize a book quickly and easily.
search archive.org for "booksculture00mabiuoft" for the book.
***
the original o.c.r. text-file, copied from archive.org, is here:
> http://zenmarkuplanguage/grapes000.txt
the file, after its first round of editing by me, is here:
> http://zenmarkuplanguage/grapes001.txt
here are the changes i made in my text-editor:
eliminate: front-matter (for now)
eliminate: back-matter
eliminate: blank-lines in front of numbers-in-column-1
auto-change: delete trailing spaces
auto-change: 3-blank-line sequences with === line
auto-change: double-quotes at line-start
auto-change: double-quotes at line-end
auto-change: floating semicolons
auto-change: floating colons
search: floating commas
search: floating periods
search: period at line-start
search: comma at line-start
search: single-quotes at line-start
search: single-quotes at line-end
search: letter-number combo
search: number-letter combo
search: number-space-number combo
search: one-letter-lines
search: two-letter-lines
search: three-letter-lines
search: four-letter-lines
search: one-character-lines
search: two-character-lines
search: two-number-lines (line #s from 10 to 99)
search: chapter-headers & subheads & dropcaps/smallcaps
search: singlequote-singlequote
search: floating doublequotes
search: floating singlequotes
search: opening singlequotes
search: closing singlequotes
search: lowercase-uppercase
search: uppercase-uppercase (roman numerals)
search: 11 ii vv
search: lowercase-linebreak-linebreak-lowercase
search: comma-linebreak-linebreak
search: [@#$%^&*()_+{}\|]
auto-change: utf8 emdashes
fixes: various fixes of stuff i encountered along the way
they should all be pretty obvious, but if you have any questions,
don't hesitate to ask me. this is the important "first step" here.
***
your mission, if you choose to accept it, is to edit the text-file,
exactly as i described above, and compare your results to mine.
a good file-comparison utility is an essential when you digitize,
so that you can compare subsequent versions of your text-file,
to ensure that you made (1) the changes that you intended, and
(2) no _unintended_ changes. find a good file-comparison tool.
i use and like diffmerge, which is cross-platform and cost-free:
> http://www.sourcegear.com
***
i get backchannels. i'd much rather talk frontchannel, because
dialog is one of the objects here, but i do get backchannels too.
like this:
> Why not put your project documentations in a blog/tumblr?
> It seems it has far wider general interest than the PG project.
i might put some stuff out in a location that's a bit more public.
or i might not. it doesn't seem to me that there are many people
looking for tools to help them digitize books. i wish there were.
but mainly it's project gutenberg with distributed proofreaders...
for a while, i posted in the forums at distributed proofreaders...
i carefully documented many problems with their workflow, and
i volunteered to write software for them, software they _needed_
-- desperately -- way back then, and ever since, to this very day.
but they didn't like my constructive criticism, so they banned me.
that's right, they _banned_ me from posting... can you imagine?
merely because i was telling them they were doing it _wrong_,
they decided to declare me persona non grata, stick their heads
"in the sand" (if you know what i mean), and simply pretend that
i hadn't offered to help them -- a fact they continue to ignore...
meanwhile, their volunteers have struggled with primitive tools,
again _to_this_very_day_. indeed, here's a recent forum thread:
> http://www.pgdp.net/phpBB2/viewtopic.php?p=788130
as background, you should know that "guiguts" is the name of
the software they currently use to do "post-processing", which
is shortened to "pp", so "a new pp-er" is a person who wants to
learn post-processing, and has thus set out to install "guiguts".
the thread begins like this:
> As a new PP-er who has just struggled, cursing all the while,
> through installing Guiguts under MacOSX, could I endorse
> the observation that:
>
> "It would be of more benefit to find a middle ground.
> We should not be so dependent upon difficult or
> buggy software that it interferes with our production."
>
> And dare I look forward to the development of a Mac version
> of Guiguts that is as simple to install as (say) TextWrangler?
> If prohibitively difficult to encompass all of Guiguts'
> bells and whistles could a subset be useful and feasible?
this trouble installing guiguts (and then using it as well) has
been experienced by tons of people over the past many years.
and thus d.p. volunteers have been crying out for better tools.
the "powers that be" at d.p. -- that's what they call themselves,
it's not (just) a disparaging term that i have applied to them --
keep saying "we don't have programmers to code better tools."
but that's a bald-faced lie, and they know it. they chased me
from their midst, and they did it intentionally (and with malice).
and meanwhile, their volunteers suffer with outdated tools...
so one reason i post here, and not there, is because i'm banned
over there... but i can still rub their nose in it by posting here...
their people moan, the "powers that be" respond by saying that
they have no programmers, and i come here and say "i program,
and i code apps that execute on the full range of offline systems,
and i've written clean-up programs, and you can have them _if_
you ask me for them", and of course nobody ever asks for them.
it goes deeper than that, though.
i have a respect for project gutenberg, and for michael hart.
when i was being attacked by the wolf-pack here on this list
-- for a period of _years_ -- michael defended my presence.
the idiots here tried to get me banned here, but _michael_
wouldn't let them get their way. he knew i had something
to offer to the cause, and he refused to let them bully him
into banning me. so, in my mind, he _deserves_ my posts.
he _earned_ 'em. (it's hard to speak of him in past tense.)
but it goes even deeper than that.
the cyber-archeologists of the future will want to study the
evolution of thinking about e-books, and this listserve will
be one of their major watering holes. so i want to be here,
and i want to have my posts be here. (of course, one of the
main reasons why this list's archives will be so valuable to
these future scholars will be _because_ my posts are here...
so there's a certain degree of circularity going on with this.)
and then, of course, we have the dynamics of the past...
if i put my stuff on an open blog, it would mostly be ignored.
of course, i'm ignored here as well. but at least when i post
to this list, i _know_ the actual _people_ who're ignoring me.
and i know they're ignoring me at odds to their own interest.
which amuses me, making the whole thing that much sweeter.
so i just gradually notch it up, bit by bit, as the years go by...
you know, kinda like turning up the heat on a lobster in a pot.
plus it makes it all the more difficult for d.p. "powers that be"
to continue to lie to their volunteers about having no options.
they know they're lying, and at some point in the future, the
entire sordid mess will be exposed, and those d.p volunteers
will realize that their "managers" _intentionally_ lied to them,
and karma will do what it always does -- punish the guilty...
so about every 6 months, i put forth yet another initiative,
aimed at helping d.p. -- if the "powers that be" are simply
smart enough to put aside their stupid stubborn "pride" --
or -- when they are not -- showcasing their bad decision...
if i have time sometime, i might document the full history
of these initiatives, but most recently, i made an attempt
in september of last year, to do a "collaboration" that was
designed to program an online post-processing system,
named "babelfish", except nobody else did any work on it,
so the so-called "collaboration" collapsed because of that.
(which is pretty much exactly what i figured would happen,
but i wanted to demonstrate that _i_ was willing to do it.)
then, in spring of this year, i displayed how an online app
could be coded that would convert a clean text to an .epub,
without using any markup sludge except for my "zen magic".
(d.p. has the proofers and formatters put in a bunch'o'crap
which the post-processers then have to take out; so dumb.)
a year before that -- in the spring of 2010 -- i used perl to
code an online text-cleanup system that could've been used
in a collaborative fashion, just to prove that it could be done.
but, as usual, the "powers that be" over at d.p. just ignored it,
all the while claiming there was no one to code such a thing.
not that they just ignore me, either. years back, when i was
arguing that they could code an offline app for cleaning text
-- because i had already programmed such a thing -- one of
the d.p. volunteers (dkretz) went to work on his own version
of such an app. he did a good job of showing the potential,
but the "powers that be" hate him, too, so they ignored that.
what was really astounding, though, was when rfrank -- yet
another d.p. volunteer -- built a distributed proofing system
that worked online. his system challenged a lot of the myths
of the d.p. system, and it produced some very good results...
but you know the story -- the "powers that be" just ignored it.
the thing is, rfrank had been one of the _darlings_ over at d.p.
he was (and remains) a very popular member of the community,
and was even voted on to the board of directors for the site, but
that evidently wasn't enough to get the attention that he needed
in order to propagate change within the system as it now exists.
so, you know, if rfrank can't even get the system fixed over there,
there's no chance that _i_ can.
but i come here anyway, and post, and show proof-of-concept,
and then come back later and show a working prototype, and
then come back even later with a finished product, all because
i know those future cyber-archeologists will be digging here...
so this is when i pause and say "hi, future cyber-archeologists!"
***
here's another piece of an e-mail from the backchannel:
> I haven't specifically looked around for such, but
> I have a vague idea of starting some sort of site
> where folks can submit ebooks for the general public
> (haven't decided if it should allow selling of them or not)
> and having a way to convert uploaded text
> directly to epub sure would be useful.
lots of similar sites already exist:
> http://www.smashwords.com
> http://www.pressbooks.com
> http://www.pandamian.com
> http://www.leanpub.com
> http://www.gluejar.com
there are more, too, but that's off the top of my head...
> I have my doubts that any system can effectively
> produce an error-free html version,
you are mistaken. look up "light markup". it's easy.
> I have my doubts that any system can effectively
> produce an error-free html version,
> but hand-editing html isn't really all that hard to do,
> so I'm not worried about that one myself.
> On the other hand, epub is something loads of folks ask for,
> and I don't think it'd be much of a site if it didn't allow such.
take a look at what's already out there.
if you decide you can add something, do it... :+)
-bowerbird