Re: [PGCanada] Text management/formatting (was: Moving ahead with PGCanada)

15 Jan 2005

      James:

Thanks for sharing more with us regarding what your UniBook
system is like.

However, we don't need to get to carried away yet.

I'd like to point out (with all due respect) that many people
have developed a "perfect markup system" for electronic
books, and then become frustrated when it doesn't catch on
with many (or any) other people. That's just the way things go.

However, I am willing to put plenty of effort into this as
a "proof-of-concept" of what a PG-type effort can be like.
I do sincerely hope it will not be just another "let's reformat
PG texts to our own specifications" effort.

What I would ask for right now, in the beta stage, is a documented
markup that we can use, and the ability to consistently produce a
PG-type plain text file from it.

I know that XML has the promise of all sorts of wonderful possibilities,
but let's leave that for the middle term.
...
...
My system uses psuedo markup
I'm curious what you mean by _psuedo_ markup. I'd love to take
a look at it soon.
...
...
As long as the content is imported into UniBook using this syntax
Which brings up the inevitable possibilities of human error in
getting the syntax wrong, or enclosing incorrect material within
syntactical delimiters.
...
...
should I mention some of the other cool things that can be done
I'll put it on a list of future plans...

Thanks,
Andrew

On Fri, 14 Jan 2005, James Linden wrote:
...
First off, Andrew is correct -- I do not want my project, UniBook, to
be under any sort of PG umbrella -- I wrote it for a far bigger purpose.
PG is just one of many projects that can make use of it. I have no
problem providing the source code (once I clean it up a bit) under an
open source license.
My system uses psuedo markup, and is actually _easier_ to do than
PG's vanilla text (in my opinion). I still have to write full
documentation on the syntax, something I've held off doing because of
aforementioned political BS.
As long as the content is imported into UniBook using this syntax, it
can be automatically parsed with accuracy. Obviously, all imports would
be vetted by humans, but that'd be a minimal amount of work.
I should mention that the demo at ibiblio.org/edison is very rough,
and doesn't have all the formats support that I've actually written and
have backed up on CD. That CD also has the search engine, browse by
title/author/date/genre/LOC heading/style, etc.
When I last worked on the code (over a year ago), I had full output
support for 6 formats, and beta level output for another 3. There are 4
more still on my list to write after those 3 beta ones are finished.
Once a text is in the system, outputing takes an average of 1/2 second
per format (TXT and XML are much faster, but TEI and PDF are a bit
slower). So, assuming the code is done for all 13 formats, that'd take
less than 7 seconds to (re)generate all formats for each text (assuming
the text is 1MB in size) in the archive. It averages out (based on
current texts in PG) to be about 3 seconds per text, because many of
them are well under 1MB.
Assuming we have 15,000 items (as MH says), which we actually do NOT
have, that'd take about 32 hrs to regenerate the entire library in 13
formats.
Adding new output formats is very easy -- it's just a PHP class with
a single required function which accepts one parameter -- the document
content. What that function does is irrelevent as long as it returns the
final output or filename as a string. This means it can either build the
output itself, or call an external program, etc.
Let's say that PG's desired master format is TEI, UniBook can output
it as mentioned. If that TEI spec every changes, we just have to change
the output function, and regenerate the archive in only that format.
Maintaining the archive becomes child's play as well -- make any
edits to the database record(s) that are needed, then re-generate the
output formats. This makes it extremely easy to implement a user
submitted error corrections system which "admins" can just verify items
to be changed, instead of having to go through the files manually, etc.
Here's where UniBook currently stands:
1) Need some code cleanup (I pretty much have to do that since I
wrote it) After that, we can CVS/SVN it for cooperative maintainance.
2) Need administration interface (web based) for importing files,
confirming imports, managing extra catalog data (LOC headings, etc). I
can handle this as well if needed.
3) Need GUI for building the importable files. I've written several
different versions of such an app in VB, but it really needs to be done
in Java, so it's portable as an app, and embeddable as an applet for
web-based interface. This is where I need help -- I don't know enough
Java to write GUIs from scratch. I can provide a fully functioning VB
GUI (with code if desired) that would just need to be reproduced in
Java. The whole interface is relatively simple - a WYSIWYG with limited
functionality.
Once a GUI is written, it'd be child's play to get ALL of PG's
current  text imported into the system - by volunteers interested in
doing it - along with all new text being done with it natively.
Oh yeah, should I mention some of the other cool things that can be
done with this system as the base? Like automatically generating CD ISO
images for any combination of texts? For example: we can do a CD for
each year's new/updated texts, without wasting space on ones that
haven't changed. Or, we can generate a CD image for all of Shakespeare,
etc. People can build their own list and have an ISO automatically
generated for them to download, with the texts in the format(s) of their
choice...
...the list goes on and on...
-- James
_______________________________________________
Project Gutenberg of Canada
Website: http://www.projectgutenberg.ca/
List: pgcanada@lists.pglaf.org
Archives: http://lists.pglaf.org/private.cgi/pgcanada/

Re: [PGCanada] Text management/formatting (was: Moving ahead with PGCanada)

Andrew Sly