[PGCanada] Text management/formatting (was: Moving ahead with PGCanada)

James Linden jlinden at pglaf.org
Fri Jan 14 11:31:28 PST 2005


Andrew Sly wrote:
>>>I would like PG of Canada to be more restrictive in its "guidelines",
>>>even if this does mean that fewer titles will be contributed.
>>
>>Can you offer some guidance as to more restrictive, how?
> 
> To start with, assuming we are going to use James' system for our
> files, only taking texts which are whatever markup format we use,
> instead of saying "plain text first, and then additional formats
> whenever people are interested in preparing them."
> 
> I can tell you from the point of view of those of us who have
> been making corrections, improve cataloging records, etc. for
> older PG files, having various files in different formats can
> be something of a headach to deal with.
> 
> And assuming we can this whole xml system to work out,
> I would cautiously say that it may help avoid some of
> the inconsistency I mentioned in my last message
> (such as marking italics in different ways)

   First off, Andrew is correct -- I do not want my project, UniBook, to 
be under any sort of PG umbrella -- I wrote it for a far bigger purpose. 
PG is just one of many projects that can make use of it. I have no 
problem providing the source code (once I clean it up a bit) under an 
open source license.

   My system uses psuedo markup, and is actually _easier_ to do than 
PG's vanilla text (in my opinion). I still have to write full 
documentation on the syntax, something I've held off doing because of 
aforementioned political BS.

   As long as the content is imported into UniBook using this syntax, it 
can be automatically parsed with accuracy. Obviously, all imports would 
be vetted by humans, but that'd be a minimal amount of work.

   I should mention that the demo at ibiblio.org/edison is very rough, 
and doesn't have all the formats support that I've actually written and 
have backed up on CD. That CD also has the search engine, browse by 
title/author/date/genre/LOC heading/style, etc.

   When I last worked on the code (over a year ago), I had full output 
support for 6 formats, and beta level output for another 3. There are 4 
more still on my list to write after those 3 beta ones are finished. 
Once a text is in the system, outputing takes an average of 1/2 second 
per format (TXT and XML are much faster, but TEI and PDF are a bit 
slower). So, assuming the code is done for all 13 formats, that'd take 
less than 7 seconds to (re)generate all formats for each text (assuming 
the text is 1MB in size) in the archive. It averages out (based on 
current texts in PG) to be about 3 seconds per text, because many of 
them are well under 1MB.

   Assuming we have 15,000 items (as MH says), which we actually do NOT 
have, that'd take about 32 hrs to regenerate the entire library in 13 
formats.

   Adding new output formats is very easy -- it's just a PHP class with 
a single required function which accepts one parameter -- the document 
content. What that function does is irrelevent as long as it returns the 
final output or filename as a string. This means it can either build the 
output itself, or call an external program, etc.

   Let's say that PG's desired master format is TEI, UniBook can output 
it as mentioned. If that TEI spec every changes, we just have to change 
the output function, and regenerate the archive in only that format.

   Maintaining the archive becomes child's play as well -- make any 
edits to the database record(s) that are needed, then re-generate the 
output formats. This makes it extremely easy to implement a user 
submitted error corrections system which "admins" can just verify items 
to be changed, instead of having to go through the files manually, etc.

   Here's where UniBook currently stands:

    1) Need some code cleanup (I pretty much have to do that since I 
wrote it) After that, we can CVS/SVN it for cooperative maintainance.

    2) Need administration interface (web based) for importing files, 
confirming imports, managing extra catalog data (LOC headings, etc). I 
can handle this as well if needed.

    3) Need GUI for building the importable files. I've written several 
different versions of such an app in VB, but it really needs to be done 
in Java, so it's portable as an app, and embeddable as an applet for 
web-based interface. This is where I need help -- I don't know enough 
Java to write GUIs from scratch. I can provide a fully functioning VB 
GUI (with code if desired) that would just need to be reproduced in 
Java. The whole interface is relatively simple - a WYSIWYG with limited 
functionality.

   Once a GUI is written, it'd be child's play to get ALL of PG's 
current  text imported into the system - by volunteers interested in 
doing it - along with all new text being done with it natively.

   Oh yeah, should I mention some of the other cool things that can be 
done with this system as the base? Like automatically generating CD ISO 
images for any combination of texts? For example: we can do a CD for 
each year's new/updated texts, without wasting space on ones that 
haven't changed. Or, we can generate a CD image for all of Shakespeare, 
etc. People can build their own list and have an ISO automatically 
generated for them to download, with the texts in the format(s) of their 
choice...

   ...the list goes on and on...

-- James



More information about the PGCanada mailing list