[PGCanada] Text management/formatting (was: Moving ahead with PGCanada)

Andrew Sly sly at victoria.tc.ca
Sat Jan 15 02:22:19 PST 2005


James:

Thanks for sharing more with us regarding what your UniBook
system is like.

However, we don't need to get to carried away yet.

I'd like to point out (with all due respect) that many people
have developed a "perfect markup system" for electronic
books, and then become frustrated when it doesn't catch on
with many (or any) other people. That's just the way things go.

However, I am willing to put plenty of effort into this as
a "proof-of-concept" of what a PG-type effort can be like.
I do sincerely hope it will not be just another "let's reformat
PG texts to our own specifications" effort.

What I would ask for right now, in the beta stage, is a documented
markup that we can use, and the ability to consistently produce a
PG-type plain text file from it.

I know that XML has the promise of all sorts of wonderful possibilities,
but let's leave that for the middle term.

>>My system uses psuedo markup

I'm curious what you mean by _psuedo_ markup. I'd love to take
a look at it soon.

>> As long as the content is imported into UniBook using this syntax

Which brings up the inevitable possibilities of human error in
getting the syntax wrong, or enclosing incorrect material within
syntactical delimiters.

>>should I mention some of the other cool things that can be done

I'll put it on a list of future plans...


Thanks,
Andrew

On Fri, 14 Jan 2005, James Linden wrote:

>
>    First off, Andrew is correct -- I do not want my project, UniBook, to
> be under any sort of PG umbrella -- I wrote it for a far bigger purpose.
> PG is just one of many projects that can make use of it. I have no
> problem providing the source code (once I clean it up a bit) under an
> open source license.
>
>    My system uses psuedo markup, and is actually _easier_ to do than
> PG's vanilla text (in my opinion). I still have to write full
> documentation on the syntax, something I've held off doing because of
> aforementioned political BS.
>
>    As long as the content is imported into UniBook using this syntax, it
> can be automatically parsed with accuracy. Obviously, all imports would
> be vetted by humans, but that'd be a minimal amount of work.
>
>    I should mention that the demo at ibiblio.org/edison is very rough,
> and doesn't have all the formats support that I've actually written and
> have backed up on CD. That CD also has the search engine, browse by
> title/author/date/genre/LOC heading/style, etc.
>
>    When I last worked on the code (over a year ago), I had full output
> support for 6 formats, and beta level output for another 3. There are 4
> more still on my list to write after those 3 beta ones are finished.
> Once a text is in the system, outputing takes an average of 1/2 second
> per format (TXT and XML are much faster, but TEI and PDF are a bit
> slower). So, assuming the code is done for all 13 formats, that'd take
> less than 7 seconds to (re)generate all formats for each text (assuming
> the text is 1MB in size) in the archive. It averages out (based on
> current texts in PG) to be about 3 seconds per text, because many of
> them are well under 1MB.
>
>    Assuming we have 15,000 items (as MH says), which we actually do NOT
> have, that'd take about 32 hrs to regenerate the entire library in 13
> formats.
>
>    Adding new output formats is very easy -- it's just a PHP class with
> a single required function which accepts one parameter -- the document
> content. What that function does is irrelevent as long as it returns the
> final output or filename as a string. This means it can either build the
> output itself, or call an external program, etc.
>
>    Let's say that PG's desired master format is TEI, UniBook can output
> it as mentioned. If that TEI spec every changes, we just have to change
> the output function, and regenerate the archive in only that format.
>
>    Maintaining the archive becomes child's play as well -- make any
> edits to the database record(s) that are needed, then re-generate the
> output formats. This makes it extremely easy to implement a user
> submitted error corrections system which "admins" can just verify items
> to be changed, instead of having to go through the files manually, etc.
>
>    Here's where UniBook currently stands:
>
>     1) Need some code cleanup (I pretty much have to do that since I
> wrote it) After that, we can CVS/SVN it for cooperative maintainance.
>
>     2) Need administration interface (web based) for importing files,
> confirming imports, managing extra catalog data (LOC headings, etc). I
> can handle this as well if needed.
>
>     3) Need GUI for building the importable files. I've written several
> different versions of such an app in VB, but it really needs to be done
> in Java, so it's portable as an app, and embeddable as an applet for
> web-based interface. This is where I need help -- I don't know enough
> Java to write GUIs from scratch. I can provide a fully functioning VB
> GUI (with code if desired) that would just need to be reproduced in
> Java. The whole interface is relatively simple - a WYSIWYG with limited
> functionality.
>
>    Once a GUI is written, it'd be child's play to get ALL of PG's
> current  text imported into the system - by volunteers interested in
> doing it - along with all new text being done with it natively.
>
>    Oh yeah, should I mention some of the other cool things that can be
> done with this system as the base? Like automatically generating CD ISO
> images for any combination of texts? For example: we can do a CD for
> each year's new/updated texts, without wasting space on ones that
> haven't changed. Or, we can generate a CD image for all of Shakespeare,
> etc. People can build their own list and have an ISO automatically
> generated for them to download, with the texts in the format(s) of their
> choice...
>
>    ...the list goes on and on...
>
> -- James
> _______________________________________________
> Project Gutenberg of Canada
> Website: http://www.projectgutenberg.ca/
> List: pgcanada at lists.pglaf.org
> Archives: http://lists.pglaf.org/private.cgi/pgcanada/
>



More information about the PGCanada mailing list