re: Re: [gutvol-d] Re: barriers to XML posting

joshua said:
Honestly, I see an easy compromise here.
oh-oh, something smells bad here... :+)
As long as a conformant TEXT file and a conformant HTML file show up with the XML file, I say post all three.
well then i'm sure glad you don't _have_ a say...
Granted, right now we don't have a method for the WW'ers to verify the XML file is valid
all these minor details, eh?
so if you want to put a disclaimer to that effect in the file ... fine.
a disclaimer? that's your "compromise". yeah, right. for those on the outside, who wonder why this fuss is being made whether an .xml file can be "posted", it's because the x.m.l. people want the imprimatur of being "official". why is that so important? because that's how the x.m.l. ponzi game is being played these days. people are adopting x.m.l. not because they think it's the best route, but rather because they've been "convinced" that it is "inevitable". even though -- in too many situations -- it just plain doesn't work, this "inevitability" makes people shrug and say, "ok, give me some." after all, you don't want to miss out on the ground floor, do you? they will tell you time and time again how there are "so many tools" for dealing with x.m.l., how x.m.l. is gonna be able to do conversions for whatever format anyone wants, but they can't even demonstrate a simple ability to convert out a text file and an .html version now. and when you call them on that, they whine about how unreasonable you are being, and how unfair it is to expect "150% perfection". bull. it is _far_ more sensible -- especially in, as jim delicately put it, a "production environment rather than an experimental one" -- to make the process _work_ before you put it in play, the x.m.l. people don't want to be bothered with that "technicality" before the fact. that's something that someone will figure out "later". yeah, right. if x.m.l. gets the stamp of approval here, what's the motivation for x.m.l. experts to come make it work? after all, there's no money in it for them here. they're off being high-paid consultants, telling the next mark, "look, even project gutenberg is using x.m.l. now too." as marcello puts it:
At this point we need to set a signal that the TEI era has started.
he's not interested in actually making t.e.i. _work_ in reality -- he tried, and got a grand total of two simple e-texts done -- he just wants to "set a signal" that the "era has started" here... -bowerbird

Bowerbird@aol.com wrote:
as marcello puts it:
At this point we need to set a signal that the TEI era has started.
he's not interested in actually making t.e.i. _work_ in reality -- he tried, and got a grand total of two simple e-texts done --
Bzzzzt. Wrong. But thank you for playing. He has *25* titles marked up in 3 languages. Ranging from Alice (illustrated), to Life on the Mississippi (tables and footnotes), Faust and Wallenstein (plays), Deutschland. Ein Wintermärchen (lyrics) and a technical manual about, guess what? PGTEI. Go to http://www.gnutenberg.de/search/titles/results/ and eat crow, Bowerbird. -- Marcello Perathoner webmaster@gutenberg.org

The problem is how to have beta-testing AND respect PG tradition of posting only definitive stuff. Would this be useful? I might offer web space, computing and bandwidth to post XML, convert it to txt and html and what else, and submit the result to whitewashing. You will be able to have installed all the software to handle the conversion, and have submissions converted by automatic procedures. This might be seen as a beta-test of xml whitewashing procedures. I am at most neutral to xml (I recognize its unavoidability, but I complain the trend, I would prefer a more human-friendly markup). So it will not be pro-XML biased. And I am authorized to whitewashing, so this can be seen as making my whitewashing in public. The posting-and-converting should be automatic: a web interface to submit a zip/tar.gz/tar.bz2 file, semi-automatic unzipping and conversion, poster and site administrator OK to make the posting public. Then the whitewashing could start WITHOUT corrections: if anything in the result is wrong, then one should repeat the submission. If the post will be XML + converted, or converted only, will be PG choice. The posts, complete of XML, will remain indefinitely on the test site. Of course, an additional line will be included to warn that the file is not an official PG file but only an intermediate working file. But except for this line, everything should be identical to a PG file, header and footer, PG number and filename included. Drawback: the server is located in Italy, so I cannot do it for non-EU clearable items. You'll have to submit clearance for death+70 (with procedures to decide, but a copy of a LOC authority record or an encyclopaedia article will of course be enough). Carlo

A question (possibly better put over on the DP list): Is it possible to OCR a scan directly to XML? Or is the output from OCR always going to be text? If the first, then we need two processes -- one to deal with new scans (OCR to XML), one to deal with existing plain texts (to convert them to XML). But if the output of OCR is still going to be plain text, then we can use the same process to convert both existing and new books to XML. Steve -- Stephen Thomas, Senior Systems Analyst, Adelaide University Library ADELAIDE UNIVERSITY SA 5005 AUSTRALIA Tel: +61 8 8303 5190 Fax: +61 8 8303 4369 Email: stephen.thomas@adelaide.edu.au URL: http://staff.library.adelaide.edu.au/~sthomas/

----- Original Message ----- From: "Steve Thomas" <stephen.thomas@adelaide.edu.au> To: "Project Gutenberg Volunteer Discussion" <gutvol-d@lists.pglaf.org> Sent: Friday, October 22, 2004 5:47 AM Subject: [gutenberg] Re: [gutvol-d] Re: barriers to XML posting
A question (possibly better put over on the DP list):
Is it possible to OCR a scan directly to XML? Or is the output from OCR always going to be text? Abbyy Finereader 7.0 has the capability of saving each page of OCR as Microsoft Word XML format. I have not experimented with it, and am not even knowlegable about XML yet, but if at some point PGDP wanted to use XML as a source format, it could be done, if the project manager has this software to work with. Abbyy 7 can also output its OCR as HTML, Excel spreadsheet, and many other formats.
Ronald Holder PGDP volunteer

It's nice that it can output it in some sort of XML, but I wouldn't want it in Microsoft Word XML format. Microsoft file formats are proprietary and have a tendency to change at every release. Ronald Holder writes:
Abbyy Finereader 7.0 has the capability of saving each page of OCR as Microsoft Word XML format. I have not experimented with it, and am not even knowlegable about XML yet, but if at some point PGDP wanted to use XML as a source format, it could be done, if the project manager has this software to work with. Abbyy 7 can also output its OCR as HTML, Excel spreadsheet, and many other formats.

Carlo Traverso wrote:
The problem is how to have beta-testing AND respect PG tradition of posting only definitive stuff.
I believe the PG policy is (or at least, has been at some point) to encourage the posting of preliminary material. From the PG header: Please note: neither this list nor its contents are final till midnight of the last day of the month of any such announcement. The official release date of all Project Gutenberg Etexts is at Midnight, Central Time, of the last day of the stated month. A preliminary version may often be posted for suggestion, comment and editing by those who wish to do so. To be sure you have an up to date first edition [xxxxx10x.xxx] please check file sizes in the first week of the next month. That is exactly what we want to do: post a preliminary version for suggestion, comment and editing. I don't understand why this is not possible for a TEI file.
I might offer web space, computing and bandwidth to post XML, convert it to txt and html and what else, and submit the result to whitewashing.
Thank you. As for the server I can also offer one located in Germany, so the same limitations apply. But this is sooo tedious! We have to replicate the exact setup of gutenberg.org *and* pglaf.org to get reliable results from the beta-test. Example: my servers are all debian and have perl 5.8 whereas ibiblio is redhat enterprise with perl 5.6. This has often before given me headache because programs that ran at home, misteriously failed at ibiblio. -- Marcello Perathoner webmaster@gutenberg.org

"Marcello" == Marcello Perathoner <marcello@perathoner.de> writes:
Marcello> Example: my servers are all debian and have perl 5.8 Marcello> whereas ibiblio is redhat enterprise with perl 5.6. This Marcello> has often before given me headache because programs that Marcello> ran at home, misteriously failed at ibiblio. That's one of the points. The conversion tools are mature when they are independent on the exact version of the software that you have. And having a "neutral" site for testing is one of the important points: you cannot rely on your own configuration. PG has to rely on tools that are stable, not on bleeding edge. Carlo

Carlo Traverso wrote:
That's one of the points. The conversion tools are mature when they are independent on the exact version of the software that you have.
I was referring to the scripts that run the catalog.
PG has to rely on tools that are stable, not on bleeding edge.
This is open source development. We dont have enough resources to test the tools everywhere before releasing. We need bug reports and patches from the people out there. I don't even have a Winsloth machine ... The mentality of "everything has to be perfect before we start" doesn't work. Linus didn't post Linux when it was ready, he posted it when it was no more than a filesystem with a bit of memory management attached. Tim Berners-Lee didn't start with XHTML 1.1. He started with what he had and refined it later. Michael Hart didn't wait till he got a computer that understood lower case. He started with upper case only and fixed that later. Success stories. Not an argument, but maybe an illustration. -- Marcello Perathoner webmaster@gutenberg.org

If Carlo or someone else is willing to help with admnistering it, we can provide webspace, computing, and bandwidth on either the PGDP server or our test server to be used for this same purpose. Being located in the US, we would be following the same copyright rules at PG. We would also be happy to keep XML versions of any projects until PG is ready to accept them. JulietS ----- Original Message ----- From: "Carlo Traverso" <traverso@dm.unipi.it> To: <gutvol-d@lists.pglaf.org> Sent: Friday, October 22, 2004 5:06 AM Subject: Re: [gutvol-d] Re: barriers to XML posting
The problem is how to have beta-testing AND respect PG tradition of posting only definitive stuff.
Would this be useful?
I might offer web space, computing and bandwidth to post XML, convert it to txt and html and what else, and submit the result to whitewashing.
You will be able to have installed all the software to handle the conversion, and have submissions converted by automatic procedures.
This might be seen as a beta-test of xml whitewashing procedures. I am at most neutral to xml (I recognize its unavoidability, but I complain the trend, I would prefer a more human-friendly markup). So it will not be pro-XML biased. And I am authorized to whitewashing, so this can be seen as making my whitewashing in public.
The posting-and-converting should be automatic: a web interface to submit a zip/tar.gz/tar.bz2 file, semi-automatic unzipping and conversion, poster and site administrator OK to make the posting public. Then the whitewashing could start WITHOUT corrections: if anything in the result is wrong, then one should repeat the submission. If the post will be XML + converted, or converted only, will be PG choice. The posts, complete of XML, will remain indefinitely on the test site. Of course, an additional line will be included to warn that the file is not an official PG file but only an intermediate working file. But except for this line, everything should be identical to a PG file, header and footer, PG number and filename included.
Drawback: the server is located in Italy, so I cannot do it for non-EU clearable items. You'll have to submit clearance for death+70 (with procedures to decide, but a copy of a LOC authority record or an encyclopaedia article will of course be enough).
Carlo _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d
participants (7)
-
Bowerbird@aol.com
-
Bruce Albrecht
-
Carlo Traverso
-
Juliet Sutherland
-
Marcello Perathoner
-
Ronald Holder
-
Steve Thomas