From john_redmond at optusnet.com.au Mon Dec 14 19:44:58 2009 From: john_redmond at optusnet.com.au (John Redmond) Date: Tue, 15 Dec 2009 14:44:58 +1100 Subject: [gutvol-p] Getting Involved Message-ID: <1260848698.2347.84.camel@localhost.localdomain> Hello all: I write code, specifically code to process XML/XHTML, etc, as well as lightly marked-up text. I am keen to get involved with the Gutenberg activities, and I think that I can contribute best by providing software to value-add on existing offerings. The best way to have a look at what I do is to go to my site (www.limpidsoft.com), where I: 1. Outline, in a couple of PDF files, what I have been doing recently; 2. Give some rather crude examples of PDF files I have adapted from existing Gutenberg files; 3. Provide downloads of free, command-line, utilities (for Linux and Windows) to convert text or XML to LaTeX and PDF files. I would like to get advice on how to get involved with the Gutenberg activities: how to submit files, what declarations and other inclusions are required. John Redmond Sydney, Australia From ajhaines at shaw.ca Tue Dec 15 21:04:50 2009 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Tue, 15 Dec 2009 21:04:50 -0800 Subject: [gutvol-p] Re: Getting Involved References: <1260848698.2347.84.camel@localhost.localdomain> Message-ID: <6FA83B2D449948D4ACAB9764AB75F242@alp2400> John, I've taken the liberty of re-posting this message to Project Gutenberg's general discussion forum (gutvol-d). (It's usually a more active and vociferous forum than gutvol-p.) You don't say precisely what you mean by "submitting files", but if you mean submitting new ebooks, you should probably start by reading PG's various How-To's and FAQ's at http://www.gutenberg.org/wiki/Category:How-To and http://www.gutenberg.org/wiki/Category:FAQ, respectively. You'll need to obtain a copyright clearance for the book you want to produce, described here: http://www.gutenberg.org/wiki/Gutenberg:Copyright_How-To. (You'll also need to create an account for yourself at http://upload.pglaf.org/ to submit copyright clearance requests and to upload your finished ebook.) For general information and standards for the text and HTML versions, respectively, see the Volunteers' FAQ and the HTML FAQ at http://www.gutenberg.org/wiki/Category:FAQ. For checking text files, the three utilities Gutcheck, Jeebies, and Gutspell are indispensable. They're free, and downloadable at http://gutcheck.sourceforge.net/etc.html PG requires at least a plain text version of the book. (See the Volunteers' FAQ, especially section 7.) HTML is optional, but desirable. It's pretty much required if the book has illustrations or graphical content. Other formats (doc, rtf, pdf, etc) are optional. See the File Formats FAQ at the above FAQ link for more information. Comments on your assorted utilities and conversions I leave to others. Al Haines Project Gutenberg. ----- Original Message ----- From: "John Redmond" To: Sent: Monday, December 14, 2009 7:44 PM Subject: [gutvol-p] Getting Involved > Hello all: > > I write code, specifically code to process XML/XHTML, etc, as well as > lightly marked-up text. I am keen to get involved with the Gutenberg > activities, and I think that I can contribute best by providing software > to value-add on existing offerings. > > The best way to have a look at what I do is to go to my site > (www.limpidsoft.com), where I: > > 1. Outline, in a couple of PDF files, what I have been doing recently; > 2. Give some rather crude examples of PDF files I have adapted from > existing Gutenberg files; > 3. Provide downloads of free, command-line, utilities (for Linux and > Windows) to convert text or XML to LaTeX and PDF files. > > I would like to get advice on how to get involved with the Gutenberg > activities: how to submit files, what declarations and other inclusions > are required. > > John Redmond > Sydney, Australia > > > _______________________________________________ > gutvol-p mailing list > gutvol-p at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-p > From bowerbird at aol.com Thu Dec 17 00:49:06 2009 From: bowerbird at aol.com (bowerbird at aol.com) Date: Thu, 17 Dec 2009 03:49:06 EST Subject: [gutvol-p] Re: Getting Involved Message-ID: john said: > I would like to get advice on how to > get involved with the Gutenberg activities john, before you give up for lack of meaningful advice, try one more time next week and i'll give you feedback. -bowerbird -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 535 bytes Desc: not available URL: From ajhaines at shaw.ca Thu Dec 17 09:39:45 2009 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Thu, 17 Dec 2009 09:39:45 -0800 Subject: [gutvol-p] Re: Getting Involved References: Message-ID: Actually, I've already done so. I reposted his message to gutvol-p into gutvol-d (check the gutvol-d archives for Dec 15). I then got an email from him, to which I responded. Al ----- Original Message ----- From: bowerbird at aol.com To: gutvol-d at lists.pglaf.org ; gutvol-p at lists.pglaf.org ; john_redmond at optusnet.com.au ; bowerbird at aol.com Sent: Thursday, December 17, 2009 12:49 AM Subject: [gutvol-p] Re: Getting Involved john said: > I would like to get advice on how to > get involved with the Gutenberg activities john, before you give up for lack of meaningful advice, try one more time next week and i'll give you feedback. -bowerbird ------------------------------------------------------------------------------ _______________________________________________ gutvol-p mailing list gutvol-p at lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-p -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2588 bytes Desc: not available URL: From john_redmond at optusnet.com.au Fri Dec 18 18:35:10 2009 From: john_redmond at optusnet.com.au (John Redmond) Date: Sat, 19 Dec 2009 13:35:10 +1100 Subject: [gutvol-p] Re: Getting Involved In-Reply-To: <55C90F6EDDFF43B28A45D40C719558B0@alp2400> References: <1260848698.2347.84.camel@localhost.localdomain> <6FA83B2D449948D4ACAB9764AB75F242@alp2400> <1261003378.2340.17.camel@localhost.localdomain> <55C90F6EDDFF43B28A45D40C719558B0@alp2400> Message-ID: <1261190110.2340.204.camel@localhost.localdomain> Hello Al, Juliet and Jim: Thanks for your detailed replies. I now start to understand the train of dependencies. The credit line was what had bothered me in particular, given that a PDF file cannot be altered after submission. More generally, I now understand that, if I can contribute at all, it would be to help in packaging content for others to convert to PDF, XML, or whatever other format may become fashionable In this connection, the reply from Juliet looks very enticing. To explain: I am one of the XML true believers and TEI, or something like it, is ultimately the way to go. But TEI seeks to be all-inclusive and doomed to be very big and complicated (think SGML). And working with XML is so painful and error-prone for humans. I don't know how big the PGTEI subset is, but there is a good chance that it might be expressible in lightly marked-up text, which can easily be parsed into XML. If that were the case, I can become usefully involved at the DP end. To state Basil Fawlty's bleedingly obvious, it might then become PG's long-term aim to provide PGTEI versions of all texts, from which all styled versions can be derived--and the only one version to be maintained. But where is a spec for PGTEI? And samples? If I could have a look at them, I could very quickly decide whether I could be of any use. In summary, without being very clear about it, I had thought that I might be able to contribute to PG by generating more refined documents from existing books (gratuitous, I admit); but now I suspect that I might be more useful by wrestling with the software. (Notes for Juliet: 1. I could not find a spec for PGTEI on the pgdp.net site. Is one available? 2. I am a Linux user by choice, but should I presume that all software is required for Windows? ) Note for all responders: Thanks for your thoughtful responses; I am starting to learn the issues! In conclusion: many thanks to Al, Juliet and Jim for their detailed responses. John Redmond On Wed, 2009-12-16 at 15:58 -0800, Al Haines (shaw) wrote: > If a PDF, or any other format, is generated from an existing PG text, it > won't get a new number. It would be bundled in with all other files for > that etext number, and would appear in PG's catalog as an addition filetype. > > To use Copperfield as an example, if it was in PG originally as only a text > file, then at a later date an HTML version was generated from the text file, > the text and HTML files would appear as two filetype entries under that > particular Copperfield. If a PDF file was then added, generated from either > that Copperfield's text or HTML file, the PDF file would appear as another > filetype. > > New numbers are given to ebooks that are new to PG, or are created from a > different edition, with significant enhancements/differences, than a current > PG ebook. > > On occasion, a new number is assigned if a new set of files is created from > the same source edition, but the new version has significant enhancements, > e.g. illustrations, an index, etc, that may have been omitted from the > current PG version. This usually applies only to PG's oldest texts, before > HTML/images/ISO files were commonly provided. > > One other point, again with Copperfield as the example. You say that yours > was generated from one of PG's editions, but you appear to have stripped out > the producer's credit line ("Produced by..."). Some PG files, usually the > older ones, may not have originally had such a credit line, but if the file > is cleaned up and reposted some time after its original submission, it's > standard practice to add "Produced by an anonymous Project Gutenberg > volunteer". > > Whatever the case, stripping out a credit line is a distinct no-no. The > original producers always get credit for the original production, with the > producer of the new format getting additional credit. For example, PG#552 > (The People that Time Forgot) was produced in 1996 by Judith Boss. In July > 2008, I created an HTML version from her text file. She retains basic > credit; I took credit only for the HTML file. If you created a PDF file > from either of those two files, your credit would be added to the other two. > These credit lines are respected by most harvesters of PG files. > > > ----- Original Message ----- > From: "John Redmond" > To: "Al Haines (shaw)" > Sent: Wednesday, December 16, 2009 2:42 PM > Subject: Re: [gutvol-p] Getting Involved > > > > Hello Al: > > > > Thanks for responding. I will certainly work through all the links that > > you have listed. I can't help feeling, though, that what I want to do is > > somewhat different from the usual: > > > > 1. I see my contribution, apart from providing the software, is to > > value-add on existing books. For example, the files on my site > > (www.limpidsoft.com) are derived from PG books, but I presume that they > > will have new catalog numbers. > > > > 2. I can provide XHTML files -- although there is no shortage of them in > > PG. So my particular contribution would be PDF files, possibly with the > > associated LaTeX files. Now, because PDF files are locked, it will not > > be possible to include any statements after they are built. As I see it, > > then, I would need to tie up all this detail before submitting. > > > > 3. Plain text versions are automatically accounted for (see 1. above), > > but it would probably be appropriate to identify these somewhere in the > > PDF. > > > > John Redmond > > > > > > From marcello at perathoner.de Fri Dec 18 22:56:10 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 19 Dec 2009 07:56:10 +0100 Subject: [gutvol-p] Re: Getting Involved In-Reply-To: <1261190110.2340.204.camel@localhost.localdomain> References: <1260848698.2347.84.camel@localhost.localdomain> <6FA83B2D449948D4ACAB9764AB75F242@alp2400> <1261003378.2340.17.camel@localhost.localdomain> <55C90F6EDDFF43B28A45D40C719558B0@alp2400> <1261190110.2340.204.camel@localhost.localdomain> Message-ID: <4B2C790A.3000704@perathoner.de> John Redmond wrote: > I don't know how big the > PGTEI subset is, PGTEI is basically TEI-Lite. > 1. I could not find a spec for PGTEI on the pgdp.net site. Is one > available? http://pgtei.pglaf.org/marcello/0.4/ http://www.gnutenberg.de/pgtei/0.5/examples/ > 2. I am a Linux user by choice, but should I presume that all software > is required for Windows? PGTEI is developed on Linux. Maybe you could figure out a port to Windows + a GUI. -- Marcello Perathoner webmaster at gutenberg.org From i30817 at gmail.com Wed Dec 23 14:41:14 2009 From: i30817 at gmail.com (Paulo Levi) Date: Wed, 23 Dec 2009 22:41:14 +0000 Subject: [gutvol-p] Hi, about filtering the license information. Message-ID: <212322090912231441w3f4566a8v65e0f99a610242eb@mail.gmail.com> Right now i'm preparing the filtering of the license information in from books downloaded from project Gutenberg, and i have 2 doubts: 1) I understand i can replace the license with this string ""This Ebook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included in this eBook or online at www.gutenberg.net"" is that enough? 2) Is there any other markup beside the '_' to represent italic boundaries? And does this markup only occur in txt files? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 665 bytes Desc: not available URL: From i30817 at gmail.com Wed Dec 23 15:55:59 2009 From: i30817 at gmail.com (Paulo Levi) Date: Wed, 23 Dec 2009 23:55:59 +0000 Subject: [gutvol-p] Re: Hi, about filtering the license information. In-Reply-To: <212322090912231441w3f4566a8v65e0f99a610242eb@mail.gmail.com> References: <212322090912231441w3f4566a8v65e0f99a610242eb@mail.gmail.com> Message-ID: <212322090912231555j115f1a63g5966930a5e490c6@mail.gmail.com> Also, can a italic markup span more than one line/paragraph? On Wed, Dec 23, 2009 at 10:41 PM, Paulo Levi wrote: > Right now i'm preparing the filtering of the license information in from > books downloaded from project Gutenberg, and i have 2 doubts: > 1) I understand i can replace the license with this string ""This Ebook is > for the use of anyone anywhere at no cost and with almost no restrictions > whatsoever. You may copy it, give it away or re-use it under the terms of > the Project Gutenberg License included in this eBook or online at > www.gutenberg.net"" is that enough? > 2) Is there any other markup beside the '_' to represent italic boundaries? > And does this markup only occur in txt files? > -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1073 bytes Desc: not available URL: From i30817 at gmail.com Wed Dec 23 15:58:40 2009 From: i30817 at gmail.com (Paulo Levi) Date: Wed, 23 Dec 2009 23:58:40 +0000 Subject: [gutvol-p] Re: Hi, about filtering the license information. In-Reply-To: <212322090912231555j115f1a63g5966930a5e490c6@mail.gmail.com> References: <212322090912231441w3f4566a8v65e0f99a610242eb@mail.gmail.com> <212322090912231555j115f1a63g5966930a5e490c6@mail.gmail.com> Message-ID: <212322090912231558i36f4cf27gec9c04c90269c683@mail.gmail.com> Answering myself. They can. On Wed, Dec 23, 2009 at 11:55 PM, Paulo Levi wrote: > Also, can a italic markup span more than one line/paragraph? > > > On Wed, Dec 23, 2009 at 10:41 PM, Paulo Levi wrote: > >> Right now i'm preparing the filtering of the license information in from >> books downloaded from project Gutenberg, and i have 2 doubts: >> 1) I understand i can replace the license with this string ""This Ebook is >> for the use of anyone anywhere at no cost and with almost no restrictions >> whatsoever. You may copy it, give it away or re-use it under the terms of >> the Project Gutenberg License included in this eBook or online at >> www.gutenberg.net"" is that enough? >> 2) Is there any other markup beside the '_' to represent italic >> boundaries? And does this markup only occur in txt files? >> > > -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1492 bytes Desc: not available URL: From i30817 at gmail.com Thu Dec 24 13:53:36 2009 From: i30817 at gmail.com (Paulo Levi) Date: Thu, 24 Dec 2009 21:53:36 +0000 Subject: [gutvol-p] Re: Hi, about filtering the license information. In-Reply-To: <212322090912231558i36f4cf27gec9c04c90269c683@mail.gmail.com> References: <212322090912231441w3f4566a8v65e0f99a610242eb@mail.gmail.com> <212322090912231555j115f1a63g5966930a5e490c6@mail.gmail.com> <212322090912231558i36f4cf27gec9c04c90269c683@mail.gmail.com> Message-ID: <212322090912241353p160e5375n4029413384d8f531@mail.gmail.com> This is the code i ended up with. The context is that i don't control the lines (so i can't use a regex - in don't know if i have all the input). I suppose to be sure i should employ a stringbuilder to concatenate the lines while they don't match the end of the "tag". boolean isMarkupStart = line.startsWith("\n***"); if (isMarkupStart) { isStart = line.contains(START_TAG); isEnd = line.contains(END_TAG); inTag = isStart || isEnd; } if (isInValidText && !inTag) { super.insertString(offset, line, attr); } //best i can do. If the string breaks exactly at *\n** //or something, i suppose this wil break horribly. if (inTag && line.endsWith("***")) { isInValidText = isStart && !isEnd; inTag = false; } -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1038 bytes Desc: not available URL: From i30817 at gmail.com Thu Dec 24 14:00:12 2009 From: i30817 at gmail.com (Paulo Levi) Date: Thu, 24 Dec 2009 22:00:12 +0000 Subject: [gutvol-p] Re: Hi, about filtering the license information. In-Reply-To: <212322090912241353p160e5375n4029413384d8f531@mail.gmail.com> References: <212322090912231441w3f4566a8v65e0f99a610242eb@mail.gmail.com> <212322090912231555j115f1a63g5966930a5e490c6@mail.gmail.com> <212322090912231558i36f4cf27gec9c04c90269c683@mail.gmail.com> <212322090912241353p160e5375n4029413384d8f531@mail.gmail.com> Message-ID: <212322090912241400y2b29de0amf517b9d99785e6a0@mail.gmail.com> Removed the possibility using a class StringBuilder. boolean isMarkupStart = line.startsWith("\n***"); if (isMarkupStart) { isStart = line.contains(START_TAG); isEnd = line.contains(END_TAG); inTag = isStart || isEnd; } if(inTag){ tagForMatch.append(line); }else if (isInValidText) { super.insertString(bypass, offset, line, attr); } //The stringbuffer is to be sure *** is not broken into many lines if (inTag && tagForMatch.toString().endsWith("***")) { isInValidText = isStart && !isEnd; inTag = false; tagForMatch.setLength(0); } -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 870 bytes Desc: not available URL: From i30817 at gmail.com Thu Dec 24 14:18:34 2009 From: i30817 at gmail.com (Paulo Levi) Date: Thu, 24 Dec 2009 22:18:34 +0000 Subject: [gutvol-p] Re: Hi, about filtering the license information. In-Reply-To: <212322090912241400y2b29de0amf517b9d99785e6a0@mail.gmail.com> References: <212322090912231441w3f4566a8v65e0f99a610242eb@mail.gmail.com> <212322090912231555j115f1a63g5966930a5e490c6@mail.gmail.com> <212322090912231558i36f4cf27gec9c04c90269c683@mail.gmail.com> <212322090912241353p160e5375n4029413384d8f531@mail.gmail.com> <212322090912241400y2b29de0amf517b9d99785e6a0@mail.gmail.com> Message-ID: <212322090912241418mc544474gcf71a75317ebe3@mail.gmail.com> Also using _ to represent italic is ... awkward to say the least. Many url or other data have italics. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 110 bytes Desc: not available URL: