
Dear fellow PG volunteers, I know that discussing issues of markup in PG files is a pointless argument that rarely goes anywhere. Still, I must ask if is it generally acceptable to most PG volunteers to have HTML files in the collection with massive amounts of redundant white space in them? By this point in time, there are megabytes of storage space in the PG archive which consist of only spaces because of much indentation in html files. Take a look at the html source of the recently released Edward Lear "A Book of Nonsense" to see an example a little more extreme than most I've seen: http://www.gutenberg.net/etext/13646 Andrew

On Sat, 9 Oct 2004 01:26:13 -0700 (PDT), Andrew Sly <sly@victoria.tc.ca> wrote: | Dear fellow PG volunteers, | | I know that discussing issues of markup in PG files | is a pointless argument that rarely goes anywhere. | Still, I must ask if is it generally acceptable to | most PG volunteers to have HTML files in the collection | with massive amounts of redundant white space in them? | | By this point in time, there are megabytes of storage | space in the PG archive which consist of only spaces | because of much indentation in html files. | | Take a look at the html source of the recently released | Edward Lear "A Book of Nonsense" to see an example a little | more extreme than most I've seen: | | http://www.gutenberg.net/etext/13646 Just had a look at it and IMO it appears to be *very* well done. The indentation is only *two* spaces per level, whereas some would use *eight* spaces per level. As anyone who has done hand programming of html or any computer language, knows, the indenting and other white space in the code is *absolutely* essential for understanding the code, especially after a year or two, when you have forgotten everything about it. The white space is even more essential when modifieing other peoples code. -- Dave Fawthrop <dave hyphenologist co uk> Don't eat cousin Banana she shares 50% of your genes. Do not kill cousin House Mouse, it is not his fault he is doubly incontinent. Flies need your help. Killing cousin salmonella with bleach is murder, he is as much alive as you are. ;-)

Thank you for everyone's feedback. A closer look at the file I mentioned shows that it uses tabs, not spaces for indenting, so it will appear differently depending on what program you use to view it. (the main body of the text is all indented by eight tabs, which for me, made it appear to start in the 64th column) Thanks, Andrew

On Sat, 9 Oct 2004, Andrew Sly wrote:
Thank you for everyone's feedback.
A closer look at the file I mentioned shows that it uses tabs, not spaces for indenting, so it will appear differently depending on what program you use to view it. (the main body of the text is all indented by eight tabs, which for me, made it appear to start in the 64th column)
That's why TABS are not recommended in any Project Gutenberg file. Michael

"Andrew" == Andrew Sly <sly@victoria.tc.ca> writes:
Andrew> Dear fellow PG volunteers, Andrew> I know that discussing issues of markup in PG files is a Andrew> pointless argument that rarely goes anywhere. Still, I Andrew> must ask if is it generally acceptable to most PG Andrew> volunteers to have HTML files in the collection with Andrew> massive amounts of redundant white space in them? Andrew> By this point in time, there are megabytes of storage Andrew> space in the PG archive which consist of only spaces Andrew> because of much indentation in html files. Andrew> Take a look at the html source of the recently released Andrew> Edward Lear "A Book of Nonsense" to see an example a Andrew> little more extreme than most I've seen: I have taken the file, unzipped, replaced every multiple whitespace with single withspace and rezipped; the saving has been 365 bytes (out of 640KB). The message of Andrew, as received by me, with all the headers etc, was 3247 bytes. Although one might discuss logical indenting in html sources, versus 75 column texts, I don't think that the space is at issue; discussing bytes, or even megabytes, when the archive is terabytes, is discussing 00001% savings. Carlo

--- Andrew Sly <sly@victoria.tc.ca> wrote:
Dear fellow PG volunteers,
I know that discussing issues of markup in PG files is a pointless argument that rarely goes anywhere. Still, I must ask if is it generally acceptable to most PG volunteers to have HTML files in the collection with massive amounts of redundant white space in them?
By this point in time, there are megabytes of storage space in the PG archive which consist of only spaces because of much indentation in html files.
Just as much space is wasted by the pointless way we insert newlines into text editions to keep line lengths down to 80 characters. Much more space is wasted by the odd decision to include in PG poor-quality computerized 'readings' of PG material. The easiest way to drastically reduce the amount of wasted space used by PG is to get rid of the multiple editions, transition to one decently marked up XML master format, and convert to required output formats on the fly. This has approximately zero chance of happening any time soon. -- Jon Ingram _______________________________ Do you Yahoo!? Declare Yourself - Register online to vote today! http://vote.yahoo.com

Jonathan Ingram wrote:
The easiest way to drastically reduce the amount of wasted space used by PG is to get rid of the multiple editions, transition to one decently marked up XML master format, and convert to required output formats on the fly. This has approximately zero chance of happening any time soon.
I am a big supporter of XML, but I challenge you to automatically create an acceptible ASCII version from one of my XML files without manual intervention... One small warning, they have loads of tables and other challenging stuff. I think it can be done, but it is far from trivial. Jeroen.

Jeroen Hellingman <jeroen@bohol.ph> writes:
I am a big supporter of XML, but I challenge you to automatically create an acceptible ASCII version from one of my XML files without manual intervention...
Don't waste your time on so called ASCII version. Simple HTML as a replacement for the traditional ASCII version is "good enough" - then tools like lynx or w3m or links(?) can do the dirty work. I do not know whether there are special HTML device for the blind; but I know some of them use lynx to browse (parts of) the web.
One small warning, they have loads of tables and other challenging stuff. I think it can be done, but it is far from trivial.
First, these text browser can display tables and if this is not good enough, you can always press a magic key and view the HTML source. Of course, if people want to spend their time on ASCII versions, it is their business. But the XML version must be the source for all other formats. -- | ,__o | _-\_<, http://www.gnu.franken.de/ke/ | (*)/'(*)

On Sat, Oct 09, 2004 at 06:50:38PM +0200, Karl Eichwalder wrote:
Jeroen Hellingman <jeroen@bohol.ph> writes:
I am a big supporter of XML, but I challenge you to automatically create an acceptible ASCII version from one of my XML files without manual intervention...
Don't waste your time on so called ASCII version. Simple HTML as a replacement for the traditional ASCII version is "good enough" - then tools like lynx or w3m or links(?) can do the dirty work. I do not know whether there are special HTML device for the blind; but I know some of them use lynx to browse (parts of) the web.
One small warning, they have loads of tables and other challenging stuff. I think it can be done, but it is far from trivial.
First, these text browser can display tables and if this is not good enough, you can always press a magic key and view the HTML source.
Of course, if people want to spend their time on ASCII versions, it is their business. But the XML version must be the source for all other formats.
I'm just writing to point out that Karl's statements are not consistent with how Project Gutenberg processes and distributes eBooks. See more in our FAQ at gutenberg.net In short: - we *require* plain text, except in cases where the format, language or other aspects make it impossible or highly difficult As Jeroen mentioned, we're anxious to have an automatic transformation from XML to HTML and from XML to plain text. These have proven more difficult than expected, although both Jeroen & Marcello have solutions that are pretty good. People who think they know how to accomplish this task should send a URL to documentation & a demonstration. -- Greg PS: People who want PDF-only, XML-only, HTML-only, TeX-only, etc. are welcome to start their own projects. PG might even be willing to license our name to you (more on this in http://gutenberg.net/about).

Greg Newby <gbnewby@pglaf.org> writes:
- we *require* plain text, except in cases where the format, language or other aspects make it impossible or highly difficult
In cases where plain text is not impossible or highly difficult, use lynx's or w3m's -dump option. Problem solved. Most of the time this will look better than hand-crafted .txt files.
PS: People who want PDF-only, XML-only, HTML-only, TeX-only, etc. are welcome to start their own projects.
That's what I do. But this does not mean I did not try to cooperate.
PG might even be willing to license our name to you (more on this in http://gutenberg.net/about).
The name is not important for my (little) project. -- | ,__o | _-\_<, http://www.gnu.franken.de/ke/ | (*)/'(*)

At 06:50 PM 10/9/2004 +0200, you wrote:
Jeroen Hellingman <jeroen@bohol.ph> writes:
I am a big supporter of XML, but I challenge you to automatically create an acceptible ASCII version from one of my XML files without manual intervention...
Don't waste your time on so called ASCII version. Simple HTML as a replacement for the traditional ASCII version is "good enough" - then tools like lynx or w3m or links(?) can do the dirty work. I do not know whether there are special HTML device for the blind; but I know some of them use lynx to browse (parts of) the web.
Hello. Yes, I am blind and I still use Lynx regularly. However, it does not create clean ASCII files. Every page I convert has tww blank spaces at the beginning of every line and it inserts junk to mark links and image placeholders. Also, more and more sites no longer work with text browsers so using Lynx or Links is becoming a thing of the past. Please don't even get me started on how poor Internet Explorer does at plain text dumps, however it is currently the most accessible graphical browser. One thing I really like about the current PG model is that I can quickl go to the ftp site, grab a file, unzip it and have readable plain text. I would not want to have to download a master xml file and convert it or have the PG site convert it on the fly and try to download it with my browser. Let's not lose sight of the goal of PG, to make as many ebooks available to as many people on as many platforms as possible. I can download the same file to my Windows or Linux machines and they are just as accessible. I can load it into a portable notetaker for the blind and it is still just as accessible. I can even put it on my old Apple II and yes, it's still accessible. I hope this doesn't change.

Tony Baechler <tb@baechler.net> writes:
However, it does not create clean ASCII files. Every page I convert has tww blank spaces at the beginning of every line and it inserts junk to mark links and image placeholders.
I appreciate your feedback very much! I guess with a little bit post-processing we can improve the output. Or we should use 'w3m' for creating txt files.
One thing I really like about the current PG model is that I can quickl go to the ftp site, grab a file, unzip it and have readable plain text.
Yes, I don't want you to produce txt files on your own. We should change the way how we create txt files. Doing txt files by hand is too slow. Often it is necessary to improve a text (typos, missing part, random garbage); if you have to apply the same correction to various files manually you must spend more time than necessary and such a procedure is error prone by itself. -- | ,__o | _-\_<, http://www.gnu.franken.de/ke/ | (*)/'(*)

On Sun, 10 Oct 2004, Karl Eichwalder wrote:
Tony Baechler <tb@baechler.net> writes:
However, it does not create clean ASCII files. Every page I convert has tww blank spaces at the beginning of every line and it inserts junk to mark links and image placeholders.
I appreciate your feedback very much! I guess with a little bit post-processing we can improve the output. Or we should use 'w3m' for creating txt files.
One thing I really like about the current PG model is that I can quickl go to the ftp site, grab a file, unzip it and have readable plain text.
Yes, I don't want you to produce txt files on your own. We should change the way how we create txt files. Doing txt files by hand is too slow. Often it is necessary to improve a text (typos, missing part, random garbage); if you have to apply the same correction to various files manually you must spend more time than necessary and such a procedure is error prone by itself.
When I was faced with these problems, I just wrote macros for my word processor to take out leading and trailing spaces. If there were sections of poetry or songs that looked better indented, then I just changed the spaces in those to @'s and then did a global search and replace [after first searching for @'s already there]. These steps all combined take less time than I spent writing this. Michael Hart

One more suggestion: there are many brands of word processors and other programs that include file conversion [and the kinds of macros I had mentioned earlier], so I should think it would be easy enuf to find one that met your specifications. My own suggestion would be to start with things such as the Word Perfect versions, don't just try one version, they are quite different version to version. Michael

Michael Hart <hart@pglaf.org> writes:
there are many brands of word processors and other programs that include file conversion [and the kinds of macros I had mentioned earlier], so I should think it would be easy enuf to find one that met your specifications.
Converting any file into HTML is easy, but that's not the point. If you are interested in good HTML or PDF you must start with a sematically tagged file (these days that's mostly an XML file). -- | ,__o | _-\_<, http://www.gnu.franken.de/ke/ | (*)/'(*)

Most of the white space in the html is tab characters, so that cuts things down quite a bit. Also if a few tabs make the html source more readable (and editable) then why not? In any event the tabs use up far less space than the pictures. nwolcott2@post.harvard.edu Friar Wolcott, Gutenberg Abbey, Sherwood Forrest ----- Original Message ----- From: "Andrew Sly" <sly@victoria.tc.ca> To: <gutvol-d@lists.pglaf.org> Sent: Saturday, October 09, 2004 4:26 AM Subject: [gutvol-d] Extra spaces in html files
Dear fellow PG volunteers,
I know that discussing issues of markup in PG files is a pointless argument that rarely goes anywhere. Still, I must ask if is it generally acceptable to most PG volunteers to have HTML files in the collection with massive amounts of redundant white space in them?
By this point in time, there are megabytes of storage space in the PG archive which consist of only spaces because of much indentation in html files.
Take a look at the html source of the recently released Edward Lear "A Book of Nonsense" to see an example a little more extreme than most I've seen:
http://www.gutenberg.net/etext/13646
Andrew _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d
participants (10)
-
Andrew Sly
-
Carlo Traverso
-
Dave Fawthrop
-
Greg Newby
-
Jeroen Hellingman
-
Jonathan Ingram
-
Karl Eichwalder
-
Michael Hart
-
Norm Wolcott
-
Tony Baechler