Andrew Sly wrote:
I would like PG of Canada to be more restrictive in its "guidelines", even if this does mean that fewer titles will be contributed.
Can you offer some guidance as to more restrictive, how?
To start with, assuming we are going to use James' system for our files, only taking texts which are whatever markup format we use, instead of saying "plain text first, and then additional formats whenever people are interested in preparing them."
I can tell you from the point of view of those of us who have been making corrections, improve cataloging records, etc. for older PG files, having various files in different formats can be something of a headach to deal with.
And assuming we can this whole xml system to work out, I would cautiously say that it may help avoid some of the inconsistency I mentioned in my last message (such as marking italics in different ways)
First off, Andrew is correct -- I do not want my project, UniBook, to be under any sort of PG umbrella -- I wrote it for a far bigger purpose. PG is just one of many projects that can make use of it. I have no problem providing the source code (once I clean it up a bit) under an open source license. My system uses psuedo markup, and is actually _easier_ to do than PG's vanilla text (in my opinion). I still have to write full documentation on the syntax, something I've held off doing because of aforementioned political BS. As long as the content is imported into UniBook using this syntax, it can be automatically parsed with accuracy. Obviously, all imports would be vetted by humans, but that'd be a minimal amount of work. I should mention that the demo at ibiblio.org/edison is very rough, and doesn't have all the formats support that I've actually written and have backed up on CD. That CD also has the search engine, browse by title/author/date/genre/LOC heading/style, etc. When I last worked on the code (over a year ago), I had full output support for 6 formats, and beta level output for another 3. There are 4 more still on my list to write after those 3 beta ones are finished. Once a text is in the system, outputing takes an average of 1/2 second per format (TXT and XML are much faster, but TEI and PDF are a bit slower). So, assuming the code is done for all 13 formats, that'd take less than 7 seconds to (re)generate all formats for each text (assuming the text is 1MB in size) in the archive. It averages out (based on current texts in PG) to be about 3 seconds per text, because many of them are well under 1MB. Assuming we have 15,000 items (as MH says), which we actually do NOT have, that'd take about 32 hrs to regenerate the entire library in 13 formats. Adding new output formats is very easy -- it's just a PHP class with a single required function which accepts one parameter -- the document content. What that function does is irrelevent as long as it returns the final output or filename as a string. This means it can either build the output itself, or call an external program, etc. Let's say that PG's desired master format is TEI, UniBook can output it as mentioned. If that TEI spec every changes, we just have to change the output function, and regenerate the archive in only that format. Maintaining the archive becomes child's play as well -- make any edits to the database record(s) that are needed, then re-generate the output formats. This makes it extremely easy to implement a user submitted error corrections system which "admins" can just verify items to be changed, instead of having to go through the files manually, etc. Here's where UniBook currently stands: 1) Need some code cleanup (I pretty much have to do that since I wrote it) After that, we can CVS/SVN it for cooperative maintainance. 2) Need administration interface (web based) for importing files, confirming imports, managing extra catalog data (LOC headings, etc). I can handle this as well if needed. 3) Need GUI for building the importable files. I've written several different versions of such an app in VB, but it really needs to be done in Java, so it's portable as an app, and embeddable as an applet for web-based interface. This is where I need help -- I don't know enough Java to write GUIs from scratch. I can provide a fully functioning VB GUI (with code if desired) that would just need to be reproduced in Java. The whole interface is relatively simple - a WYSIWYG with limited functionality. Once a GUI is written, it'd be child's play to get ALL of PG's current text imported into the system - by volunteers interested in doing it - along with all new text being done with it natively. Oh yeah, should I mention some of the other cool things that can be done with this system as the base? Like automatically generating CD ISO images for any combination of texts? For example: we can do a CD for each year's new/updated texts, without wasting space on ones that haven't changed. Or, we can generate a CD image for all of Shakespeare, etc. People can build their own list and have an ISO automatically generated for them to download, with the texts in the format(s) of their choice... ...the list goes on and on... -- James