Unicode UTF-8 Compatible Version of Gutcheck

I've created a Unicode UTF-8 Compatible Version of Gutcheck, calling it gutcheck_u -- if anyone wants to try it. This was primarily an exercise in finding and changing 8-bit char coding dependencies to 16-bit widechar coding dependencies, but it did require some additional coding. Currently it is a somewhat Windows-dependent implementation, so if you want to run it on another OS it would take a little bit of work. I find it more pleasant to use if one's development file format of choice is UTF-8, and/or if one are doing such things as left-handed / right-handed quotes. Testing it on PG released files I am in fact finding a fair amount of left / right handedness errors that are not currently being discovered. Let me know if you want to try it, same distribution terms as the original. Jim Adcock jimad@msn.com

I'd like to give it a try. Can you upload it somewhere? Or you can send me a copy as a zipped attachment. And if you can mention the etext numbers of several of the files you tested it against, I can cross-check them with Gutcheck and Bookloupe (http://www.juiblex.co.uk/pgdp/bookloupe/index.html). Al -----Original Message----- From: gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of James Adcock Sent: Wednesday, February 25, 2015 7:49 PM To: gutvol-d@lists.pglaf.org Subject: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck I've created a Unicode UTF-8 Compatible Version of Gutcheck, calling it gutcheck_u -- if anyone wants to try it. This was primarily an exercise in finding and changing 8-bit char coding dependencies to 16-bit widechar coding dependencies, but it did require some additional coding. Currently it is a somewhat Windows-dependent implementation, so if you want to run it on another OS it would take a little bit of work. I find it more pleasant to use if one's development file format of choice is UTF-8, and/or if one are doing such things as left-handed / right-handed quotes. Testing it on PG released files I am in fact finding a fair amount of left / right handedness errors that are not currently being discovered. Let me know if you want to try it, same distribution terms as the original. Jim Adcock jimad@msn.com

Off the top of my head, try 48325, which demonstrates the handedness issue. From: gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Al Haines Sent: Wednesday, February 25, 2015 9:37 PM To: 'Project Gutenberg Volunteer Discussion'; 'James Adcock' Subject: Re: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck I'd like to give it a try. Can you upload it somewhere? Or you can send me a copy as a zipped attachment. And if you can mention the etext numbers of several of the files you tested it against, I can cross-check them with Gutcheck and Bookloupe (http://www.juiblex.co.uk/pgdp/bookloupe/index.html). Al -----Original Message----- From: gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of James Adcock Sent: Wednesday, February 25, 2015 7:49 PM To: gutvol-d@lists.pglaf.org <mailto:gutvol-d@lists.pglaf.org> Subject: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck I've created a Unicode UTF-8 Compatible Version of Gutcheck, calling it gutcheck_u -- if anyone wants to try it. This was primarily an exercise in finding and changing 8-bit char coding dependencies to 16-bit widechar coding dependencies, but it did require some additional coding. Currently it is a somewhat Windows-dependent implementation, so if you want to run it on another OS it would take a little bit of work. I find it more pleasant to use if one's development file format of choice is UTF-8, and/or if one are doing such things as left-handed / right-handed quotes. Testing it on PG released files I am in fact finding a fair amount of left / right handedness errors that are not currently being discovered. Let me know if you want to try it, same distribution terms as the original. Jim Adcock jimad@msn.com <mailto:jimad@msn.com>

Interesting--gutcheck_u reported several mismatched double quotes, but all were explainable, e.g. poem fragments inside a quoted paragraph, or two quoted paragraphs with an illustration between. On the other hand, there were many mismatched left/right single quotes--evidently DP's ASCII single quote to Unicode left/right single quotes needs some work. I'll do some more testing of gutcheck_u while WWing submissions. It would be very handy if someone could upgrade Gutspell, which doesn't work properly on texts containing Unicode quotes (single and double), em-dashes, etc. Al -----Original Message----- From: gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of James Adcock Sent: Thursday, February 26, 2015 10:24 AM To: 'Project Gutenberg Volunteer Discussion' Subject: Re: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck Off the top of my head, try 48325, which demonstrates the handedness issue. From: gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Al Haines Sent: Wednesday, February 25, 2015 9:37 PM To: 'Project Gutenberg Volunteer Discussion'; 'James Adcock' Subject: Re: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck I'd like to give it a try. Can you upload it somewhere? Or you can send me a copy as a zipped attachment. And if you can mention the etext numbers of several of the files you tested it against, I can cross-check them with Gutcheck and Bookloupe (http://www.juiblex.co.uk/pgdp/bookloupe/index.html). Al -----Original Message----- From: gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of James Adcock Sent: Wednesday, February 25, 2015 7:49 PM To: gutvol-d@lists.pglaf.org Subject: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck I've created a Unicode UTF-8 Compatible Version of Gutcheck, calling it gutcheck_u -- if anyone wants to try it. This was primarily an exercise in finding and changing 8-bit char coding dependencies to 16-bit widechar coding dependencies, but it did require some additional coding. Currently it is a somewhat Windows-dependent implementation, so if you want to run it on another OS it would take a little bit of work. I find it more pleasant to use if one's development file format of choice is UTF-8, and/or if one are doing such things as left-handed / right-handed quotes. Testing it on PG released files I am in fact finding a fair amount of left / right handedness errors that are not currently being discovered. Let me know if you want to try it, same distribution terms as the original. Jim Adcock jimad@msn.com

Huh. Not obvious to me what he is doing in bookloupe. Jim. From: gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Al Haines Sent: Wednesday, February 25, 2015 9:37 PM To: 'Project Gutenberg Volunteer Discussion'; 'James Adcock' Subject: Re: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck I'd like to give it a try. Can you upload it somewhere? Or you can send me a copy as a zipped attachment. And if you can mention the etext numbers of several of the files you tested it against, I can cross-check them with Gutcheck and Bookloupe (http://www.juiblex.co.uk/pgdp/bookloupe/index.html). Al -----Original Message----- From: gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of James Adcock Sent: Wednesday, February 25, 2015 7:49 PM To: gutvol-d@lists.pglaf.org <mailto:gutvol-d@lists.pglaf.org> Subject: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck I've created a Unicode UTF-8 Compatible Version of Gutcheck, calling it gutcheck_u -- if anyone wants to try it. This was primarily an exercise in finding and changing 8-bit char coding dependencies to 16-bit widechar coding dependencies, but it did require some additional coding. Currently it is a somewhat Windows-dependent implementation, so if you want to run it on another OS it would take a little bit of work. I find it more pleasant to use if one's development file format of choice is UTF-8, and/or if one are doing such things as left-handed / right-handed quotes. Testing it on PG released files I am in fact finding a fair amount of left / right handedness errors that are not currently being discovered. Let me know if you want to try it, same distribution terms as the original. Jim Adcock jimad@msn.com <mailto:jimad@msn.com>

It may take me a little while, but I'm happy to try and answer any questions you have about bookloupe... Ali. On Thu, Feb 26, 2015, at 06:30 PM, James Adcock wrote:
Huh. Not obvious to me what he is doing in bookloupe.
Jim.
*From:* gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] *On Behalf Of *Al Haines *Sent:* Wednesday, February 25, 2015 9:37 PM *To:* 'Project Gutenberg Volunteer Discussion'; 'James Adcock' *Subject:* Re: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck
I'd like to give it a try. Can you upload it somewhere? Or you can send me a copy as a zipped attachment.
And if you can mention the etext numbers of several of the files you tested it against, I can cross-check them with Gutcheck and Bookloupe (http://www.juiblex.co.uk/pgdp/bookloupe/index.html).
Al
-----Original Message----- *From:* gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] *On Behalf Of *James Adcock *Sent:* Wednesday, February 25, 2015 7:49 PM *To:* gutvol-d@lists.pglaf.org *Subject:* [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck
I’ve created a Unicode UTF-8 Compatible Version of Gutcheck, calling it gutcheck_u -- if anyone wants to try it.
This was primarily an exercise in finding and changing 8-bit char coding dependencies to 16-bit widechar coding dependencies, but it did require some additional coding.
Currently it is a somewhat Windows-dependent implementation, so if you want to run it on another OS it would take a little bit of work.
I find it more pleasant to use if one’s development file format of choice is UTF-8, and/or if one are doing such things as left-handed / right-handed quotes.
Testing it on PG released files I am in fact finding a fair amount of left / right handedness errors that are not currently being discovered.
Let me know if you want to try it, same distribution terms as the original.
Jim Adcock
jimad@msn.com
_________________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Bookloupe supports UTF8, but does the same checks as Gutcheck. It doesn't check Unicode quotes. No idea what its developer's plans may be for adding such functions. There's a thread for it in DP's Tool Development forum (http://www.pgdp.net/phpBB2/viewforum.php?f=13). It's one of the Sticky topics. Al -----Original Message----- From: gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of James Adcock Sent: Thursday, February 26, 2015 10:30 AM To: 'Project Gutenberg Volunteer Discussion' Subject: Re: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck Huh. Not obvious to me what he is doing in bookloupe. Jim. From: gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Al Haines Sent: Wednesday, February 25, 2015 9:37 PM To: 'Project Gutenberg Volunteer Discussion'; 'James Adcock' Subject: Re: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck I'd like to give it a try. Can you upload it somewhere? Or you can send me a copy as a zipped attachment. And if you can mention the etext numbers of several of the files you tested it against, I can cross-check them with Gutcheck and Bookloupe (http://www.juiblex.co.uk/pgdp/bookloupe/index.html). Al -----Original Message----- From: gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of James Adcock Sent: Wednesday, February 25, 2015 7:49 PM To: gutvol-d@lists.pglaf.org Subject: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck I've created a Unicode UTF-8 Compatible Version of Gutcheck, calling it gutcheck_u -- if anyone wants to try it. This was primarily an exercise in finding and changing 8-bit char coding dependencies to 16-bit widechar coding dependencies, but it did require some additional coding. Currently it is a somewhat Windows-dependent implementation, so if you want to run it on another OS it would take a little bit of work. I find it more pleasant to use if one's development file format of choice is UTF-8, and/or if one are doing such things as left-handed / right-handed quotes. Testing it on PG released files I am in fact finding a fair amount of left / right handedness errors that are not currently being discovered. Let me know if you want to try it, same distribution terms as the original. Jim Adcock jimad@msn.com

Bookloupe supports UTF8, but does the same checks as Gutcheck. It doesn't check Unicode quotes. No idea what its developer's
The alpha versions do have (some) support for additional checks that have been requested and approved by Charlie. I'm hoping to get back to it in April, finish of the pending requests and get a new version released. Ali. On Thu, Feb 26, 2015, at 08:15 PM, Al Haines wrote: plans may be for adding such functions. There's a thread for it in DP's Tool Development forum (http://www.pgdp.net/phpBB2/viewforum.php?f=13). It's one of the Sticky topics.
Al
-----Original Message----- *From:* gutvol-d
[mailto:gutvol-d-bounces@lists.pglaf.org] *On Behalf Of *James Adcock
*Sent:* Thursday, February 26, 2015 10:30 AM *To:* 'Project Gutenberg Volunteer Discussion' *Subject:* Re: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck
Huh. Not obvious to me what he is doing in bookloupe.
Jim.
*From:* gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] *On Behalf Of *Al Haines *Sent:* Wednesday, February 25, 2015 9:37 PM *To:* 'Project Gutenberg Volunteer Discussion'; 'James Adcock' *Subject:* Re: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck
I'd like to give it a try. Can you upload it somewhere? Or you can send me a copy as a zipped attachment.
And if you can mention the etext numbers of several of the files you tested it against, I can cross-check them with Gutcheck and Bookloupe (http://www.juiblex.co.uk/pgdp/bookloupe/index.html).
Al
-----Original Message----- *From:* gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] *On Behalf Of *James Adcock *Sent:* Wednesday, February 25, 2015 7:49 PM *To:* gutvol-d@lists.pglaf.org *Subject:* [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck
I’ve created a Unicode UTF-8 Compatible Version of Gutcheck, calling it gutcheck_u -- if anyone wants to try it.
This was primarily an exercise in finding and changing 8-bit char coding dependencies to 16-bit widechar coding dependencies, but it did require some additional coding.
Currently it is a somewhat Windows-dependent implementation, so if you want to run it on another OS it would take a little bit of work.
I find it more pleasant to use if one’s development file format of choice is UTF-8, and/or if one are doing such things as left-handed / right-handed quotes.
Testing it on PG released files I am in fact finding a fair amount of left / right handedness errors that are not currently being discovered.
Let me know if you want to try it, same distribution terms as the original.
Jim Adcock
jimad@msn.com
_________________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 2/26/2015 1:30 PM, James Adcock wrote:
Message
Huh. Not obvious to me what he is doing in bookloupe.
Bookloupe was intended to be the UTF-8-aware replacement for gutcheck. However, if I remember correctly, at the time the author wrote it PG wanted it to report all the same errors as gutcheck would or they would not agree to switch over to using it. So it ended up only partially UTF-8-aware, as with full UTF-8 awareness it wouldn't report some errors that gutcheck would have reported. But as far as I know Bookloupe is what PG now uses, not gutcheck. -- Walt

Would like to try gutcheck_u too - a copy anyone? - thanks! Marc dH FreeLiterature.org <http://www.freeliterature.org> @FreeLitOrg <https://twitter.com/FreeLitOrg/> (on Twitter) @gutenberg_org <https://twitter.com/gutenberg_org> Project Gutenberg <https://www.facebook.com/project.gutenberg> (official facebook page) Project Gutenberg G+ <https://plus.google.com/u/1/b/102625479490584889340/+gutenberg/posts/p/pub> On 26 February 2015 at 20:07, Walt Farrell <walt.farrell@charter.net> wrote:
On 2/26/2015 1:30 PM, James Adcock wrote:
Huh. Not obvious to me what he is doing in bookloupe.
Bookloupe was intended to be the UTF-8-aware replacement for gutcheck. However, if I remember correctly, at the time the author wrote it PG wanted it to report all the same errors as gutcheck would or they would not agree to switch over to using it.
So it ended up only partially UTF-8-aware, as with full UTF-8 awareness it wouldn't report some errors that gutcheck would have reported. But as far as I know Bookloupe is what PG now uses, not gutcheck.
-- Walt
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Actually, the WWers pretty much have had to use both since PG decided to accept only a single text file of whatever character set was needed to express the source material. To use 48325 as an example, Gutcheck sees this line (line 27 of the submitted file): _“THE WIDE, WIDE WORLD,” “QUEECHY,” like this: _“THE WIDE, WIDE WORLD,†“QUEECHY,†resulting in a "missing space" warning. If the line is too long, Gutcheck will also issue a "line too long" warning. Bookloupe sees the line properly, but doesn't/can't check the quotes. Gutspell sees the strange characters as part of the word, and flags the word as a misspell. This isn't limited to quotes--words adjacent to, or that contain, Unicode em-dashes, apostrophes, or ellipses cause the same problem. To check quotes, and to have Gutspell work properly, the WWers have to use Unitame to generate a temporary Latin1 text file from the UTF8 file, and Gutcheck/Jeebies/Gutspell the Latin1 file. As a WWer, I don't mind UTF8 text files that contain necessary Unicode characters, e.g. Greek, etc, but UTF8 text files that contain only Unicode quotes, apostrophes, em-dashes, and similar NON-necessary Unicode characters, are just a nuisance. If a text file doesn't need Unicode characters, submit a Latin1 text file. It's a lot easier on the WWers. Al -----Original Message----- From: gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Walt Farrell Sent: Thursday, February 26, 2015 11:07 AM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck On 2/26/2015 1:30 PM, James Adcock wrote: Huh. Not obvious to me what he is doing in bookloupe. Bookloupe was intended to be the UTF-8-aware replacement for gutcheck. However, if I remember correctly, at the time the author wrote it PG wanted it to report all the same errors as gutcheck would or they would not agree to switch over to using it. So it ended up only partially UTF-8-aware, as with full UTF-8 awareness it wouldn't report some errors that gutcheck would have reported. But as far as I know Bookloupe is what PG now uses, not gutcheck. -- Walt

Depends on the character(s). I occasionally encounter Unicode characters in my projects, and their handling is a judgement call. For example, I render oe-ligature as either "oe" or "[oe]". Greek words or short phrases, I transliterate. (Lots of Greek, I avoid.) I recently did a book that contained symbols for the sun and planets, but the book didn't explain what the symbols meant--I had to research which symbol meant which planet. I decided to render the symbols as words, e.g. [Sun], [Earth], etc, figuring that would be easier on the average reader. PG's posting software, as opposed to its text-checking software, can handle any UTF8 file (text and HTML) in any language. I've posted all manner of Greek, Chinese, Japanese, and even a couple of Tagalog submissions. As for "upgrading the toolset", none of the WWers has the programming skills to upgrade software like Gutcheck/Jeebies/Gutspell to properly handle UTF8 files, check curly quotes, etc, etc. Al -----Original Message----- From: gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of James Adcock Sent: Thursday, February 26, 2015 5:21 PM To: 'Project Gutenberg Volunteer Discussion' Subject: Re: [gutvol-d] Unicode UTF-8 Compatible Version of Gutcheck
If a text file doesn't need Unicode characters, submit a Latin1 text file.
Whenever I've checked, I find I have a couple chars that fall outside of Latin1. Do I pimp the submission then, or god-forbid should the WW'ers upgrade their toolset?

On Thu, Feb 26, 2015 at 5:43 PM, Al Haines <ajhaines@shaw.ca> wrote:
As for "upgrading the toolset", none of the WWers has the programming skills to upgrade software like Gutcheck/Jeebies/Gutspell to properly handle UTF8 files, check curly quotes, etc, etc.
Nobody needs text files any longer. Drop the requirement and drop the software needed to check the text files. Ask submitters to send well-formed XHTML, which can be created with many flavors of software. Format as epub and mobi. That's really all that is needed these days. ebooks@adelaide is doing it right. The original point of the text files was that they were supposed to be a fallback for people who only had old, slow, desktop computers. Perhaps running CP/M :) No longer the case. Poor Third-World users of PG files want epubs they can read on their phones. They cannot afford desktops, or laptops, but they can afford cheap refurbished phones. Distributed Proofreaders could drop the arcane rules of the formatting stage (required to format text files to PG requirements). They could stop requiring PPers to install outdated software that may be a huge hassle to install on modern operating systems. I could take P3 output, check it with Perfectit, run it through NoteTab Pro for the XHTML, drop it into an epub structure in Oxygen, and get a book out in a few days. Less time if I had a partner to do the HTML. Perhaps if we could just fork the workflow and see how a new, modern process would work. -- Karen Lofstrom

As for "upgrading the toolset", none of the WWers has the programming skills to upgrade software like Gutcheck/Jeebies/Gutspell to properly handle UTF8 files, check curly quotes, etc, etc.
Granted, Windows-only right now, but I did gutcheck_u [Unicode version of gutcheck], and now I am sending Al a gutspell_u [ask me if anyone else wants it.] I don't think it's hard to get you better tools if you ask for them on this forum. Just don't say for example "it's got to work exactly like gutcheck except on Unicode" - because that description is an oxymoron. Firstly it's hard to track down and exactly duplicate all the bugs that are already in there, and secondly, the checks one would want on a Unicode file are not identical to the checks one would want on an ASCII file.

Huh. Not obvious to me what he is doing in bookloupe.
What I meant was that it was not obvious to me what kind of coding approach was being used in bookloupe. What I did in gutcheck_u was a straightforward translation of the programming from being an 8-bit char program, to being a 16-bit widechar program. There was some places in gutcheck where valid glyph-range assumptions were hard-wired in for the 127-255 range – an amalgamation of code pages, and those needed to be changed to something more “reasonable” assuming that people are using Unicode, and using Unicode, hopefully, for some sensible reason. And there was a large number of implied-handiness tests for straight-quote and straight-double-quote and/or straight-apostrophe, and those tests all become somewhat easier and more sensible on the Unicode curly handedness versions. In any case gutcheck_u attempts to check the sanity of both the straight and curly versions – whatever it finds, and if it finds a mix [at least within a paragraph], it should report that too. The straights tests are the same as before, assuming I didn’t accidentally break something. Probably what gutcheck_u ought to have is some switches to say which of the Latin code pages are being used, vs. not used, so that the testing can be more reasoned. It should also probably test better passing common non-alphas, and querying uncommon non-alphas – since there are a lot more non-alpha out there in Unicode world which PG submitters don’t actually intentionally use very often. In any case I’m thinking that submitters are going to be wanting to submit curlies more and more often. The straights are beginning to look increasingly anachronistic. Best. Jim A.
participants (6)
-
Al Haines
-
J. Ali Harlow
-
James Adcock
-
Karen Lofstrom
-
Marc D'Hooghe
-
Walt Farrell