[gutvol-d] My spanking <laugh/>, and my reply

12 Nov 2004

      Greg wrote:
...
Jon wrote:
Wow, my backside is really sore from the spanking Greg just
administered to me. Some of the spanking was deserved, but some of it
was not, imho. More on that later, but first I'd like to first give
some thoughts on the problems with the gutvol-d list and archive
before answering several of Greg's comments. (Walking slowly...)
...
(I redirected to gutvol-d@lists.pglaf.org.  Who sent this
to Lyris @ listserv.unc.edu?  That server is broken, the list
there is defunct.  I have been trying to delete the list there
for months, but the software is perpetually non-responsive)
I'm not sure. I think I directed all my replies to the right place.

Btw, I tried to search the gutvol-d archives with regards to the
FAQ0 and FAQ1 issue (that is, how much was it really discussed on the
lists as Greg said it was?), and noticed that indeed the archive
appears broken -- everything before August is gone. James Linden told
me that the older archives may be lost for good, at least the Lyris
version.

Did anyone here keep their own copy of the gutvol-d (and I suppose
other gut*) archives? I've kept full backup archives of the several
dozen mailing lists I've run since 1992 (by simply collecting all the
emails sent out in plain text unix mbox format), but not lists I don't
run, thinking that those who administer them do as I do and create
redundant backups in a universal plain text format (as Michael would
approve!)

Since I've lately been sticking my nose into various affairs here some
think I should not, I may as well do it one more time and give another
opinion that the various Gutenberg lists be moved to YahooGroups (with
2-3 people designated as backup archivists in unix mbox format -- I'll
gladly volunteer to be one of the backup archivists since I already do
that for over twenty lists I run and co-administer.)

Why YahooGroups and not some listserv software running on PG's own
server?

1) I've had experience running various listserv software since 1992,
   and I find a lot of time is saved when someone else does it for me
   as YahooGroups does.

2) YahooGroups is actually very good and reliable, and since so many
   people now subscribe to one or more YG lists, it's easy to
   subscribe to one more. My decision to move The eBook Community, now
   with over 2400 subscribers, to YahooGroups in 1999 has proven to be
   the right decision.

   With a custom listserv run by PG, it's just another list I have to
   separately subscribe to, and if I have to change my email address,
   it's another separate service I have to contend with. YahooGroups
   consolidates all my subscribed lists into an easy, manageable form
   that no other listserv software comes close to in power and
   convenience.

3) YahooGroups includes other useful services, such as a Files Area
   and facilitated YahooIM Chat.

4) It's free! It doesn't take up any diskspace or bandwidth on the
   local server. (There are the insufferable ads, though, but these
   are easily ignored.)

5) It is possible to extract plain text (with full headers) for every
   posted message to any YahooGroup.

6) It archives messages for quite a while back. The eBook Community
   presently has 21289 messages available in the online archive dating
   from 1999 -- I don't know when YahooGroups will begin lopping off
   the oldest ones to save space, but it hasn't yet. I have separate
   archives for the mirrored unix mbox archive.

7) It has a web access for those who prefer that over receiving email.

8) Administration by the moderators is a breeze.

**********************************************************************

O.k., to address some of the issues Greg brings up. He is certainly
angry with several of my comments today. As noted above, some of it
I deserved, either in what I said or how I said something...
...
My view is that Jon will not be content until all the people
working on PG are ousted, in favor of his preferred organization,
governance, fundraising, production rules, and collection
guidelines.  This is not going to happen anytime soon, and 
other than being critical of the status quo, Jon has contributed
nothing towards making it happen anyway.  Instead, Jon has
repeatedly been offered the ability -- with support and encouragement --
to create the organization or content he so strongly desires.
There are several related points I'd like to address here, since Greg
brings up a couple I didn't really want to talk about (who otherwise
cares about my motivation for being here and for what I've brought up
recently?):

*****

My motivation is certainly not to "take over" PG and build a
dictatorship, and to kick out the old guard. Those who know me know
that I'm the opposite and in fact fear the same things Michael does
with respect to proprietary interests trying to defang the growth of a
robust and fully available digital public domain. The OpenReader
Project, which I co-founded, clearly shows my focus on open standards,
open source, and creating an ebook future founded on these principles.
In personality type, I am definitely a Fighting Idealist, for better
and for worse. I am definitely not very politically savvy and not
very diplomatic with my words, again for better and oftentimes for
worse.

For example, I commented earlier today, in response to a message
Juliet posted, that maybe DP should consider a policy that if they
don't get unencumbered page scans to put freely online (because some
group is anal about their beloved source document of a public domain
work), then they should not accept that situation and work around it.
Who's the idealist here? (referring to PG's FAQ0 or FAQ1.)

(But DP has their way of doing things and policies, which is fine. I
greatly admire DP for what they have accomplished, are now doing, and
fully support their vision for going to the next-level with an
XML-based system. Juliet is doing an extraordinary job and has not
been thanked enough for what she and her volunteers have accomplished,
which borders on the remarkable. I am working with Juliet and Charles
(who's currently on "sabbatical") to help them, as I can, with the
organizational challenges in their wish to move to next level, both in
XML implementation, and in increasing their capacity to meet the
challenges for the intriguing "Million Digital Texts Project.")

I make no bones I have strong feelings based on the bigger picture as
I see it -- and I honestly believe my vision is even bigger than
Michael's. I don't believe the ad-hoc, everyone does it their own way
approach for producing etexts is sufficient any more to accomplish
this Big Vision, and in fact will work against the Big Vision. Greg no
doubts disagrees with me as FAQ0/1/3 outlines, but so be it -- history
will be the ultimate arbiter of our differing world views.

I see how inadequate the current PG collection is for the future. This
evaluation is based upon three different ventures I've been involved
with since 1999 (including one now in development) where this Big
Vision has been, and is now being researched, by some really sharp
technical people who are nailing down the many architectural and
technical requirements. There are many more subtle requirements than
one would at first imagine -- I'm only now beginning to understand
them in a holistic sense -- and they reflect themselves all the way
back to the fundamental structure of the texts themselves, and the
associated metadata/catalog information.

I see millions of high-quality, uniform digital texts, both public
domain and Creative Commons, in a single repository which allows
people to access them, annotate them, and link them together and with
other texts and with other types of multimedia content in other
repositories in very powerful ways that would take too long to
describe here. That's one reason I state the master texts must be in
well-structured XML, since that will enable the advanced features this
repository will have. Properly done XML also confers many other
benefits too numerous to mention here. Both DP and PG have blessed the
right XML approach (e.g., as exemplified by Marcellos PGTEI), which is
very encouraging. But there's more.

For reasons I won't go into here (again for brevity sake), this Big
Vision also sets slightly more stringent requirements on both metadata
and cataloging than is currently done in PG, and it's the spinning
wheels of the current discussion on metadata and cataloging that lead
to my posts this afternoon out of sheer frustration. I see no
*requirements* mentioned, and no vision as to *what* the metadata/
catalog information is to be used for. How can one fix the metadata
requirements without a discussion of what the metadata will be used,
and useful for?

It is frustrating to see all this ad-hoc activity happening with no
guidance as to the who, what, when, where, why and to what extent --
the purpose of the metadata -- being resolved based on general
requirements, which in turn are derived from the full and detailed
vision (which is NOT given in the FAQs) of why PG exists and what it
produces. Certainly I could try to force my way further into the
discussion (more than I have now) and try to provide answers to these
questions, but then I'll just become another voice to add the ad-hoc
cacophony we now have where the one who produces something first wins,
even if it ends up not meeting the full long-term goals.

This is the result of the FAQ0 and FAQ1 philosophy, which does not
always give the results one hopes for. To get resolution on tough
issues it is oftentimes necessary for the leadership to take charge
and to firmly guide discussion to logically resolve what must be
done. In some ways, it may be that the "leadership" simply doesn't
have the time (because it is voluntary) to formalize the process to
force a structured approach to fast decision-making and buy-in to
the result. Understandable, but sad.

What I fear the most, and this I've expressed to Brewster Kahle (who I
meet again next week about Project Gramophone) and to JD Lasica (who's
launching the ourmedia project and I'm assisting with the metadata/
cataloging side) is that many people will develop these wonderful
repositories of digital content (I'm also working on Project
Gramophone/Sound Preserve to transfer and archive millions of old
sound recordings), with billions of digital objects, which simply
won't and can't "talk" with each other, because everyone is "doing
their own thing" PG-style. Wheeee, the late 60's all over again.
<smile/>

Let me give a small example to illustrate just a corner of what the
world could be like if everything is done properly:

   Imagine someone creating a video for ourmedia where someone is
   playing the piano, say "Take the 'A' Train", composed by Billy
   Strayhorn and which became Duke Ellington's theme song. We would
   want to be able to allow the viewer to link, if they so choose,
   with the song lyric repository, with various wikipedia entries,
   and to Sound Preserve to bring up orchestral recordings of "Take
   The 'A' Train" by Duke Ellington and others. We'd also like to link
   to the Project Gutenberg collection for any works, such as Duke
   Ellington's book "Music is My Mistress" (assuming PG got permission
   to add it, likely not.) And of course we'd allow the end-user to
   join special communities built around any particular topic
   connected with that song -- just as Ellington communities, jazz
   communities, Strayhorn communities, etc.

Doing all of this (and a lot more) confers a few added requirements,
especially with regards to metadata information (text has the
redeeming grace that it is fairly easy to dig out some information by
full text searching -- but not standardized subject matter fields! --
but it is much harder with video and audio so the metadata and
cataloging requirements for video and audio will likely be more
stringent and extensive.)

PG's self-enforced isolation, because of its seeming fear of working
with the Big Boys (which is somewhat understandable) is working
against PG in various ways in seeing the bigger picture of how the
text production activities it is catalyzing will mesh with this much
bigger, more wonderful world. But if the various repositories don't do
it right from the start, including Project Gutenberg, and they end up
with millions and billions of digital objects *not done right*, then
the interlinkage will be much more difficult and nowhere near as
powerful and useful as it could be. It will be essentially impossible
to fix after the fact. JD Lasica now recognizes this and is supporting
somewhat expanded metadata standards to assure inter-repository
linkage, but I don't see the PG "leadership" seeing this, nor am I
confident it can because of the FAQ0/1/3 constraints.

Note how PG is having difficulty fixing the metadata and catalog info
for a *measly* 10,000 or so texts. Imagine having a million of them
*not done right* (especially with regards to metadata and catalog
information requiring human input -- for some digital objects, if the
data is not collected right at the start, it will be impossible to
figure it out much later, even with human intervention. So much for
the power of our digital future.)

(Part of the Big Vision calls for aiding integration using James
Linden's very interesting "Open Genesis" concept, currently under
development. James is probably not yet ready to discuss this, but it
is best described as the "Semantic Web Done Right From the Start."
The requirements Open Genesis confers upon digital content
repositories are surprisingly quite minimal -- but it is needed to
have a standardized framework to improve inter-repository and
inter-object linking. Marcello's effort to bring RDF into the mix is
laudable and will certainly aid more robust intra- and inter-
repository linking.)

I'd love to see PG take the lead to make this happen for the text side
of the house, and that's my motivation in pressing a lot of issues
here to the point where I may become personna non grata, but it won't
happen until PG realizes that it needs to confer more requirements on
the texts and metadata it catalyzes and collects from the many
volunteers (outside of DP, which is doing things mostly right by my
reckoning), as well as to more actively work with other repositories
-- to become a part of the bigger world rather than isolating itself
as it seems to. It needs one or two full-time people -- this costs
some $$$ -- this requires a somewhat higher level of organization and
a maybe a slightly different governance to even be given this $$$ (or
to develop some ongoing revenue stream.) And if it wants to play a
major role in the "Million Digitized Texts Project" (should it get
successfully launched), it *has* to change its governance and how it
interacts with the world at large.

Frankly, the FAQ0 and FAQ1 documents are actually quite hostile by
inferring the world at large is somehow evil and out to get PG. Yes,
some parts of the world at large are hostile to PG and wish it gone,
but not all of them. The wisdom is to associate with your friends and
those who share the same vision, not drive them away by painting
everyone with the same "evil" brush. If you don't believe FAQ0 and
FAQ1 sends this message to those in various outside groups, I suggest
the wording of FAQ0/1 be looked at again by what it doesn't say but
should say. For example, there's little in there about building, for
example, close strategic partnerships with other like-minded
organizations, and to work together on common standards and common
goals. Nothing there is mentioned about joining standards and other
types of organizations so as to promote PG's interests. PG has become
disturbingly quite xenophobic in orientation -- it acts as if the rest
of the world does not exist or does exist and is evil, and that magic
will always automatically happen if you simply let everyone do their
own thing. Magic does happen often, but magic can also run out.

To answer Greg's "I don't take Yes For an Answer" (which is,
interestingly enough, what the New York Times William Safire today
used to describe Arafat's 1999 refusal of unbelievable concessions by
the Israelis), let me say that I am working hard on the vision. I'm
coordinating with ourmedia, with Project Gramophone (now called Sound
Preserve), and working with another venture dedicated to tying this
all together and to launch the "Million Digitized Texts Project." Will
we succeed in at least launching MDTP? Maybe. Maybe not. But I am
taking Greg's "Yes for an answer" to heart and I am working on it as
I envision it -- it's just that it is not restricted to the closed
world of PG so that's why it seems somewhat out of lockstep with what
is going on here. But if we do succeed in launching MDTP and the Bigger
Vision it will be a part of, and if PG wants to play a *major* role
with MDTP -- and I'd certainly welcome PG and its "leader volunteers"
to jump onboard for many obvious reasons -- PG will have to change in
certain ways simply to work as a major player with the MDTP project.
If PG decides it rather not change its governance and focus by
increasing the acceptable text and metadata standards (which really
are not that much), then that's totally understandable -- PG could
still play a role, but it would essentially be peripheral and the
parade may end up marching by it.

*****

On another point, if I expressed wording reflecting hostility to those
who have contributed texts to the PG collection over the years, this
was not my intent, and I apologize for this. I've typed in whole books
by hand, and then laboriously proofed them, marked them up, and
converted into ebooks, so I am familiar firsthand with this process
of love. Some of the books being talked about here -- the very
difficult 17th/18th century texts -- is a remarkable achievement to
digitize (and markup as well.) It amazes me the commitment many people
here have to digitizing texts.

My comments were directed at the leadership for not following what I
believe are slightly more stringent policies with regards to metadata
and text formatting requirements (some of which are understandable
given where things were in the early 1990's). I'm a firm believer in
the principle of "the buck stops here". That is, if there are
problems, it is the responsibility of the PG leadership due to their
prior decisions and established system. It may be unfair at times
since it is impossible to accurately predict the future and to develop
the right approach to meet that future (e.g., Michael Hart's early
allergy to including source information in texts appeared to be a
protection mechanism against copyright infringement claims.) But
nevertheless it is up to the leadership to take responsibility, adjust
accordingly, and to pro-actively "fix it". Maybe some of the problems
are best solved by the ad-hoc, hands-off approach as given in FAQ0/1/3,
but I don't believe all problems with the PG Collection will be solved
by this approach, especially when looking at the useful linkage of the
PG collection with other content repositories as outlined above, which
requires an integrated approach, and working cooperatively with other
groups.

*****

On a point related to what I wrote earlier, I'm troubled by this view
that PG's collection should be focused toward a particular use niche,
rather than to be designed to be useful for just about every use. As
I've analyzed things, the added requirements to make PG digital texts
useful not only for general reading, but for scholarship and research
(plus linking to other repositories) are so few that to ignore them is
downright puzzling. What is needed? Well, require the source info be
included in the metadata -- that's the major one. The next one is to
work hard to acquire and preserve page scans. There is likely a few
other requirements which are even less burdensome.

The vast majority of the effort to produce digital texts from paper
copy is to scan (or type in) the book and then proofreading it. The
rest of the added stuff to make the texts more useful is time- and
effort-wise miniscule by comparison.

This reminds me of a Minnesota-Norwegian joke about the Norwegian
who tried to swim across a lake -- when he got 95% of the way to the
other side, he decided he couldn't make it, and swam back. It's
ludicrous not to make that extra 5% effort, and elevate the PG
collection to a significantly higher plane of usefulness, quality, and
better digital integrity (talked about next). This is especially
tragic given the hundreds of thousands of hours already devoted to the
PG collection, when that extra 5% (if that) would have made a
significant improvement.

*****

And about digital integrity, I stick to my position that anything
which PG requires to increase the digital integrity of the text
itself to the original source is a Good Thing (tm). Certainly
deviations from the source must be allowed, such as correcting some
obvious typesetting errors (as an aside, has PG established a uniform
policy for what types of edits/corrections in the digital text are
allowed? Or is this again one of those FAQ0/1 "let's not interfere
with anyone", type of things?)

But what I mean by digital integrity has to do with the faithfulness,
or more importantly, the perception of faithfulness, of the *meaning*
of the text to the original source. It's a legitimate question to ask
whether those involved in producing digital texts took more liberties
with the text than they should have? This is not a trivial issue when
we look at history where censorship is the norm. Certainly, as Greg
pointed out, the source texts themselves may have been grossly edited
contrary to the author's original intent (if it were not the first
edition, for example), but we must not add to this problem in any way
(instead, let's also do the first edition!) In addition, I believe one
intent of PG is to assist with the effort to assure the digital texts
will survive into the distant future, to hopefully survive wars,
revolutions, totalitarianism, digital "book burnings", etc. As the
centuries roll by, the issue of digital integrity becomes more and
more important for the integrity of the information being passed on to
future generations.

That is why I believe it is necessary for PG to establish policies for
new texts, and to begin working on upgrading some of the existing
texts at the appropriate time, to standardize the digital integrity
requirements as much as possible, and more importantly to acquire and
preserve the original page scans whenever possible.

Having the original page scans available side-by-side with the digital
texts also benefits everyone (and the Big Vision) by resolving any
difficulties in presentation of the digital texts (we all know how
weird some texts are), and for fighting against claims of copyright
infringement. Contrary to Michael Hart's early policies in hiding the
pedigree of digital texts, having the page scans available, so long as
our copyright clearance procedure is sufficent, actually strengthens
PG against claims of copyright infringement.

*****

As a final note, I do agree with several who responded today about
my call for redoing the older PG texts, saying we should wait until
DP moves to the next-generation XML-based system before redoing these
texts. I definitely agree as I think about it. What I think could be
done, however, is to prepare for this eventuality by 1) flagging those
texts we'd like to redo someday, 2) search for higher-quality source
books which will give us *unencumbered* page scans, and then 3) file
those page scans away in the archive for later conversion to digital
text at the appropriate time. There's nothing wrong with decoupling
the scanning stage from the proofreading stage.

No doubt my answers will not satisfy everyone, and may not satisfy
anyone. But after my spanking, I needed to reply, and in one case
apologize.

Jon Noring