
jon said:
What type of global corrections were these?
the type that is made easy by my tool. that's all i'll say for now.
One area is how to handle hyphenation, and whether there was a short dash in the compound word in the first place before the typesetter hyphenated the word.
as i said, i ignored the issue of hyphenation for the time being. my tool will give a number of ways to deal with hyphenation, but the routines haven't been brought into the current version. but i can give a general overview. end-line hyphenation is removed. the hyphen in compound words is retained. to tell the difference, when there is ambiguity, you look at the rest of the text, to see if the word was handled consistently there. if it was, you match that. if not, you have more work to do. that's where it gets interesting. to go any further is to give too much information for here and now.
Hopefully you used the 600 dpi bitonal which should OCR the best.
i did.
Antialiasing actually causes problems (notwithstanding the much lower resolution.)
right. i first thought the periods misrecognized as commas were the effect of anti-aliasing, but i used the 600-dpi scans. so it must be something else causing that problem.
One thing you could do is to look at the 600 dpi pages at 100% size for which the punctuation was not correctly discerned. You probably will see some errant pixels that fooled the OCR into thinking it was some other punctuation mark than it is.
i didn't care that much, really. the post-o.c.r. software can solve the problem well enough. i mentioned it for the record, for the sake of full disclosure, and to see if anybody knew why.
punctuation is a toughie for OCR to exactly get right,
even if the recognition is admittedly somewhat difficult, i expect abbyy to correct "mr," and "mrs,", for instance. but even if abbyy doesn't, that's easy for me to program.
Resolving this usually requires a human being to go over, especially for Works from the 18th and 19th century where compound words with dashes were much more common
if you want to retain those arcane spellings, it's difficult. if you wanna update them, the computer does it very easily. "to-day" and "to-morrow" become "today" and "tomorrow". instantly.
Sometimes one has to see what the author did elsewhere in the text.
is there some reason you think the computer can't do that?
In a few cases a guess is necessary based on understanding what the author did in similar cases in the text.
oh, i see. it takes "understanding". one of those rare precious human-being things. well then, i guess there's no way to program it.
Some of this can be automated. In other cases it requires a human being to make a final decision. I followed the UNL Cather Edition here.
it's always easier to let other people make the decision, isn't it? ;+)
Whether or not it is an "unnecessary" distraction, it is better to preserve the original text in the master etext version.
well see, jon, that's where i differ with you. and other people do too. but like i said, as long as it's just one global change away, no big deal. i see lots of other cases, as well, where you diverge from the paper. a good many of the quotation-marks are set apart from their words. you're making editorial decisions whether you acknowledge it or not.
My thinking is that if someone wants to produce a derivative "modern reader" edition of "My Antonia", they are welcome to do so and add it to the collection because the original faithful rendition is *already* there.
whose "collection" are we talking about here jon? yours? do you have any intention of adding more "my antonia" editions? specifically a "derivative modern reader"? if so, i will submit mine. but surely you don't mean michael hart's project gutenberg collection? because, according to you anyway, he doesn't have a "faithful" rendition in his library, not even one, not *already* anyway. just a mangled one. another difference between your collection and michael's is you have 1 book in your collection and he has 10-15 thousand in his collection, depending on who is in charge of defining how the official counting is tabulated these days, it appears. whether you like it or not, that's a comment on the philosophies.
indicating this was more of a typesetter's convention rather than something Cather specified.
well that's a convenient dodge, isn't it? and of course you have no real _evidence_ that this is the case, do you? so you _really_ should enter each case as it _appears_, shouldn't you? at least if you want to stick to your philosophy?
In addition, the UNL Cather Edition closed off all the apostrophe s (no spaces), but kept the space for many of " n't" words. So here again I followed the UNL Cather Edition.
and that's the difficulty with following an authority, ain't it? there are often so many, it's hard to know which one to follow! i know i can't keep up even with the editions of this one book! so how would a person possibly keep up on tens of thousands! and before you know it, you're having arguments about _that_! and not reading the book, or digitizing it, or playing at the park. and i don't know about you, jon, i don't think you're being consistent. you said you were reproducing what is right there in black-and-white on the page itself, even made high-resolution scans to prove it to us, and now you're making judges that are easy to spot. and to justify it, you're quoting some other figure of "authority". that's inconsistent. but heck, i have to be honest here. even if you _were_ consistent, and kept all of those quirks from the paper-book that _i_ consider to be distracting, the first thing i'm gonna do is global-change 'em. so all that hard work you did was for no good purpose to me.
Cather wanted the line length to be fairly short, so this puts extra pressure on typesetters who will either have to extend character spacing for a particular line or scrunch it up more than usual, depending upon the situation with the rest of the typesetting on the page, and whether certain words can be hyphenated or not.
oh!, hold it!, wait!, did i just hear you say what you just said? i think i did! yes, i'm quite sure i did! "cather wanted the line length to be fairly short". wow. you mean author-intent can go to _the_length_of_lines_? do you realize how significant that is to your philosophy, jon? it means you will need to respect willa's wishes on the matter. none of the long lines you might get in a web-browser! no sir! willa wanted short lines! (is that why the book looks so narrow?)
You mean accented characters?
if they aren't in the lower-128 of the ascii range ("true ascii"), yes.
Accented characters are *always* important to preserve under all situations.
according to you, maybe. according to me, it depends. in this case, i say no. that's my prerogative as an editor. (and i _do_ consider myself an editor, not just a copyist.)
There's no need anymore, in these days of Unicode and the like to stick with 7-bit ASCII.
until unicode works flawlessly on every machine used by all the people i know, for texts like this that have only the occasional character outside the lower-127, where the meaning isn't changed, i'll stick to plain ascii.
I sense that you don't want to properly deal with accented characters
first of all, jon, i define what "properly" means for me, you don't. you can define it for yourself. but i won't let you define it for me.
I sense that you don't want to properly deal with accented characters since this poses extra problems with OCRing and proofing,
nope. it's just that i see them as _unnecessary_ to this book. if a reader thinks it _is_ necessary, make the global-change.
something you are trying to avoid in your zeal to get everything to automagically work. To me, that's going too far in simplifying.
i'm not "simplifying". i'm consciously making a choice to use something that will work on the broad range of machines out there, as opposed to something that -- in far too many cases -- fails badly. it's a pragmatic decision based on real-life knowledge of the actual infrastructure of machines that exist out here in our real world. it's the same pragmatic decision that michael made when he crafted the philosophy guiding the building of this library of 10,000+ e-texts, in sharp contrast to your philosophy, which has built a 1-book library.
Preserving accented characters are important.
in some cases, i'd agree with you. in others, not. in this case, not.
punctuation changes can sometimes subtly affect the meaning.
you know, as a writer, i'd really like to think that's possible. as a person who uses a lot of commas, i _want_ to believe it. but i'll be darned if i can think of that many good examples. if you can, i would _love_ to hear them. and if you can show me _any_ in "my antonia", any at all, i'd give you extra bonus points. as it is, though, i just have to resign myself to the position that o.c.r punctuation errors are a distraction, but make no difference. i'll still root them out, due to my sense of professionalism, but i sure wish it felt _fun_, instead of feeling like _doing_chores_. and to the extent that i can automate the chores, i'll be _happy_.
They are hopefully caught by human proofers/readers when grammar checkers don't (I do use Word to help find both spelling and punctuation errors -- when they find something, I then manually check it in the page scans and the master XML.)
oh, so you _do_ use an assist from your tools at times. that's good.
They are "sometimes" easy to spot. Other times the automatic routines will not catch errors
maybe the automatic routines you are using are just inferior. use my tool. if it doesn't spot something it should, let me know.
Usually true, but there are some rare exceptions where an abbreviation can be mistaken for an end of a sentence.
not if your routines are as smart as mine are.
Then there's the ellipsis issue
i'm three-dozen layers deep on some of these issues, and you want to talk about level 2. i'm not interested. use my tool. if it doesn't give you the results you want, let me know.
This is also true, but as found in "My Antonia", there are exceptions to pure nesting, such as when a quotation spills over into several paragraphs where the intermediate paragraphs are not terminated by an end quotation mark (whether single or double.)
is it really your considered opinion that i don't know this? that i haven't factored it into my thinking _and_ my tools? maybe you're grandstanding to the lurkers, but my goodness, jon, do you really think that _they_ are that stupid too?
Also, apostrophes are sometimes confused with single right quote marks.
ditto.
With a smart enough grammar and parser, the above might be properly parsed and the
blah blah blah. use my tool. if it doesn't figure out your stuff, let me know.
But still, real-world texts tend to throw a lot of curve balls that are sometimes hard to correctly machine process.
i know how to hit 87 different pitches, from both sides of the plate, and you're telling me to "watch out for the curve balls". i laugh at you.
OCR is quite fast. It's making and cleaning up the scans which is the human and CPU intensive part.
wait! i thought you said _proofreading_ and _mark-up_ were the steps that take up the most time. didn't you? or do i have you confused with someone else?
Well, not all of the pages have been doubly proofed. The team is not finished, and I plan to post a plea somewhere for more eyeballs to go over it.
have you heard about distributed proofreaders? might be able to find some people there... (ok, now you see what it feels like.)
I would like to receive error reports as well for this text,
i'll tell you the same thing i told michael about project gutenberg: set up a system for the checking, reporting, correction, and logging of errors, a system that is transparent to the general public, and i will be more than happy to report errors to you, and help you out. otherwise, you waste my time, as i figure someone else can do it. which, by the way, is what everyone else is thinking. which is why errors in the texts are not being reported at nearly the frequency that they should be being reported. but i've got another message sitting here waiting to be sent where i discuss that topic in more detail, so i'll stop here now.
since Brewster wants highly proofed texts for some experiments he plans to run similar to yours.
i'll have to ask him about his tests.
But if I have to use the version you donate to PG, so be it. :^)
probably, yep. if michael wants it. they say he'll take just about anything...
I did find one error in my text based on the list you gave. Thanks.
you're welcome. but that's not the one i was talking about. :+)
I assume you discovered the several different paragraph breaks in the PG edition?
nope. i didn't even evoke the routines to examine paragraph-breaks. i considered doing so, once you said that there were differences, but decided it was just too inconsequential to even bother with it. it's another one of those things i would very much like to see a case where it made a difference, because i'd love to believe it _could_, but in the absence of a case (or even an _imaginary_ possibility, which i confess i can't come up with, not off the top of my head), i am forced to relegate it to the "too trivial to think about" pile. as above, i'll make the corrections, but i ain't gonna sweat 'em... -bowerbird

Bowerbird wrote:
jon said:
Some of this can be automated. In other cases it requires a human being to make a final decision. I followed the UNL Cather Edition here.
it's always easier to let other people make the decision, isn't it? ;+)
It's always *smarter* to leverage the experience and knowledge of others. The idea behind the "Trusted Edition" concept is to mobilize the help of both professional scholars and amateur enthusiasts, using community- oriented tools and processes, to assist with understanding the specific and unique bibliographic details of any particular Work. (Interestingly, when it comes to the more obscure public domain Works, which is the vast majority of them, they were only published once in one printing, so with respect to figuring out which edition is "acceptable" or "authoritative", it is pretty cut-and-dried. It's the famous classics, especially the much older ones which are written in some archaic fashion or in another language, where it can get quite complicated as to what is/are the acceptable editions to use as source(s). Neverthess, for the classics most of this has already been hashed out, and where there is no agreement between any two, do *both* of them!)
Whether or not it is an "unnecessary" distraction, it is better to preserve the original text in the master etext version.
well see, jon, that's where i differ with you. and other people do too.
And there are people who also agree, at least in general, with my position. I'm not alone on this. You make it out to be like I'm alone on this, like a "John the Baptist" in the desert.
but like i said, as long as it's just one global change away, no big deal.
The problem is that sometimes global changes are easy to do in one direction, and much harder to do the other. When information is removed, such as converting accented characters to 7-bit ASCII with no traceback information, it is harder to go in the other direction because information has been lost.
i see lots of other cases, as well, where you diverge from the paper. a good many of the quotation-marks are set apart from their words. you're making editorial decisions whether you acknowledge it or not.
There's not lots, but a few. The focus is to produce a *textually* accurate rendition which is presentationally-agnostic wherever possible. We took to heart a lot of the information provided by the UNL Cather Edition online information because that is the smart thing to do. We *are* in contact with a couple scholars of Willa Cather's works besides the UNL folks. To ignore expert advice is, to put it bluntly, stupid. And we are putting together a preliminary list of the top 500/1000 classic public domain works, and should the project launch, we plan to get these rigorously converted along the lines of "My Antonia", and to mobilize the help of the professional *and amateur* enthusiasts to help guide the process.
My thinking is that if someone wants to produce a derivative "modern reader" edition of "My Antonia", they are welcome to do so and add it to the collection because the original faithful rendition is *already* there.
whose "collection" are we talking about here jon?
yours?
The collection (of one so far, and it is essentially a working demo for learning purposes) does not state to be "Jon Noring's" collection. Go to http://www.openreader.org/myantonia/ and tell me what it says there, and if it prominently mentions my name. There's another name given to it. Just because I'm the most visible person with regards to it here, does not mean it is mine. It is not. It is part of a fairly visible project mobilizing a group of people (but not visible on the particular forums you frequent, and not by the specific name, which doesn't matter.) Should this project go into production mode, what is produced will belong to the world. It's not going to be elitist or exclusive as some other etext projects are (I'm not talking about PG obviously) -- all work product will be made publicly available, as it should be since it is from the Public Domain. Anyway, what is this strange obsession with ownership and competition? Why do you keep talking about PG being Michael Hart's (more on this below)?
do you have any intention of adding more "my antonia" editions? specifically a "derivative modern reader"? if so, i will submit mine.
Sure. So long as the changes from the original acceptable source are sufficiently noted in the text file, such as an "Editor's Introduction", some boilerplate, or whatever you want to add. I'm not sure if you have an interest in taking the time to provide such editorial information, but we'll be happy to take your edited version and mark it the "Bowerbird Modernized Edition" or whatever. I am thinking of providing my own modernized edition as well (which will have very few changes in the case of "My Antonia".) ("Sufficiently noted" does not mean to spell in gory detail each and every change, but enough info so the reader will have a good general idea of how it was "modernized". Readers will appreciate the thoroughness expended to modernize a text for them, and will have warm fuzzies that it is "accurate" when the editor *takes the time* to explain what they did. This builds *trust* with the reader.)
but surely you don't mean michael hart's project gutenberg collection?
So? What do you care? Is there a law saying any digital text version of a public domain work *must* be submitted to PG? Does PG have a government monopoly on the Public Domain? Of course not. And about this strange fixation you have on "ownership", PG is no longer "Michael Hart's". You seem to fail to understand that PG now belongs to the hundreds/thousands who have materially contributed in building it. (DP has greatly increased the ownership of PG several fold by its cool way at mobilizing thousands of volunteers.) Michael Hart is the pioneer and founder of the PG idea, but PG has gone well beyond him. He can die tomorrow (hopefully not!), and what he has started will continue unabated. If it were still his, it may die with him. [An outside example is the World Wide Web -- does it still "belong" to Tim Berners-Lee because he invented the general idea and some of the early standards and tools for it? If Tim Berners-Lee dies tomorrow, will the plug be pulled on the Web?] When you produce this magical "toolset" of yours and give it away to others to use (or do you plan to sell it?), it will no longer be "yours". So, should you die tomorrow (hopefully not!), will there be a community of people who will take all your ideas and code and continue on where you are now? Or will it die with you? So much for the benefits of ownership and control. This is why just about everything we've done for "My Antonia" is *already* online and downloadable, even though it is still an early beta/demo shake out things. There is more to put up. The Bible mentions "casting one's bread upon the waters, and it will be returned to you." The complementary logic of this is that those who develop their tools in secret, who don't strive to build partnerships with other like-minded folk, who are not transparent, etc., etc., are not casting their bread upon the water, and thus may not find the kinds of rewards they seek. Interestingly, Michael Hart cast his bread upon the water, and it has returned more than a hundred-fold. Of all the great contributions Michael Hart has made, it is to inspire a volunteer movement. I do have problems with how the earlier PG collection has been assembled (which DP has mostly, but not completely, resolved), but I recognize that Michael Hart has accomplished a lot *because he cast his bread upon the waters*. He not did do his thing in secret, and he welcomed volunteers from the beginning. DP is a result of his vision, of his casting his bread upon the waters. Even his PGII concept (which I think is ill-conceived for various reasons not germane to this particular discussion) is an attempt to expand the PG collection by embracing other collections into one big happy tent. And he talks about giving away trillions and trillions of etexts for free. I like this attitude. He is giving away, not taking. He is open and transparent -- he does not keep everything secret. If he were developing software, he would immediately open source it and ask for others to help write it. It will be free for all from the start. He does not keep his light kept underneath a blanket. So if Michael Hart is your hero, then consider emulating his example. I think you catch the drift. That's why I keep asking when you plan to start a SourceForge or similar open soruce project to develop your system.
because, according to you anyway, he doesn't have a "faithful" rendition in his library, not even one, not *already* anyway. just a mangled one.
With respect to PG's current "My Antonia". Yes, it is mangled. More importantly, it is not trustworthy, which goes beyond just errors or differences. I discussed this on TeBC, which I know you've read (either from an anonymous account or a friend who forward messages. I don't really care.) And of course, in my discussion of the whole PG corpus, I carefully differentiate between the DP and the non-DP portions of it -- I've done this from the beginning. How convenient you ignore this important fact.
another difference between your collection and michael's is you have 1 book in your collection and he has 10-15 thousand in his collection, depending on who is in charge of defining how the official counting is tabulated these days, it appears. whether you like it or not, that's a comment on the philosophies.
Hmmm. <laugh/> Sounds a lot like a school yard taunt: "Let's compare yours and mine and we'll see whose is bigger -- drop your pants..." So what? How many etexts did Michael have in "his" collection in 1991? Every journey starts with the first step. And why do you say "my collection" (in reference to the LibraryCity "My Antonia" project? Why this obsession with possession and ownership: "My tool", "My idea", "My whatever"? And why do you view everything in a competitive color, rather than complementary and collaborative? In these days of open source development, collaborative efforts, etc., your approach to do everything in secret is really odd and out-of-synch. Why don't you cast your bread upon the waters and see what happens? Or are you afraid your bread won't return to you multiplied?
indicating this was more of a typesetter's convention rather than something Cather specified.
well that's a convenient dodge, isn't it?
No.
i know i can't keep up even with the editions of this one book! so how would a person possibly keep up on tens of thousands!
The idea of "Trusted Editions" as an archetype is that it won't rely on any one person. It is part of a bigger picture of building communities around noted etexts. To mobilize people. To not only bring digital texts to people (as PG has been doing), but to also bring people and community to digital texts (which PG is NOT doing now.) But so far I don't see much interest in your "calculus" to understand the important role people play in etexts, from creation to final use. And that the most viable contributions to Mankind come from when people are mobilized in a cooperative/community way (either in a non-profit open source approach, or in a private for-profit approach using employees and contractors.) Technology is to provide tools to make a community of people work better together for a common end-goal, not to replace community. And the word "trust" is an important core human concept -- society works only when there is sufficient trust between people, and trust in the various products of their labors. So any human endeavor which does not put "trust" as #1 is prone to eventually fail.
Cather wanted the line length to be fairly short, so this puts extra pressure on typesetters who will either have to extend character spacing for a particular line or scrunch it up more than usual, depending upon the situation with the rest of the typesetting on the page, and whether certain words can be hyphenated or not.
oh!, hold it!, wait!, did i just hear you say what you just said? i think i did! yes, i'm quite sure i did!
"cather wanted the line length to be fairly short".
wow. you mean author-intent can go to _the_length_of_lines_?
do you realize how significant that is to your philosophy, jon?
*rolls eyes*
it means you will need to respect willa's wishes on the matter.
none of the long lines you might get in a web-browser! no sir!
willa wanted short lines! (is that why the book looks so narrow?)
You really need to read less selectively. I've used the phrase "textually faithful" many times the last couple weeks for a reason. The reason? Because it is important that texts transcend the visual as much as possible, to become agnostic with respect to presentation type, yet contain sufficient structure and semantics so quite authentic visual presentation is possible. This is necessary not only for accessibility, but repurposeability and usability. (And this helps Michael Hart's long-term vision in universal language translations of digital texts.) With the right style sheet, most of Cather's stated preferences are possible to duplicate. There's a reason why the texts are marked up in XML. With one tiny change in the CSS for our "My Antonia" demo, we can duplicate quite well Willa Cather's apparent preference in visual presentation of her book. Interestingly, the UNL Cather Edition (the print version published by UNL's publishing house) uses longer line lengths and smaller print than Cather specified. They did not deem the exact visual presentation of the content to be as important as much as the textual faithfulness, even though they discuss it on their web site.
Accented characters are *always* important to preserve under all situations.
according to you, maybe. according to me, it depends. in this case, i say no. that's my prerogative as an editor. (and i _do_ consider myself an editor, not just a copyist.)
Sure, you can call yourself an editor, and do what editors do. But to throw away the richness of the expanded Western character set which many, many public domain books use -- is simply bizarre. This richness is what adds to the aesthetics of the text, and builds a better reading experience. It also *adds* trust because people will see the care you took in doing this -- in sweating out the details.
There's no need anymore, in these days of Unicode and the like to stick with 7-bit ASCII.
until unicode works flawlessly on every machine used by all the people i know, for texts like this that have only the occasional character outside the lower-127, where the meaning isn't changed, i'll stick to plain ascii.
I believe this is a copout. You can convert most of the western-based Unicode characters to ISO-8859 (the "8-bit ASCII") if you want, and to other encoding schemes, so you have even more encoding options to handle just about everything everyone uses. Today's web browsers handle Unicode very well. And since you are building your own ebook viewer, you can implement Unicode in it quite trivially (at least be able to handle, to start out with, the Latin-1, Latin-Extended and Greek character sets.) The problem with throwing away the higher-characters is that, contrary to what you say, it is not easy to reinsert them as they appeared in the original, unless you re-OCR the texts and the OCR accurately finds them. I can tell you that OCR, even Abbyy, still has some problems with accented characters, especially those which use very subtle accent marks that can easily be mistaken for serifs. As an example, I'm curious to know if Abbyy 7 will correctly recognize *all* the accented characters in the current "My Antonia" scans -- I listed them in my prior message. If you want, I will be happy to go through and list the actual page numbers they are found on. For example, the unlauted "i" in "naïve" and "naïvety" -- this is a particularly difficult character to recognize (it is often incorrectly recognized as a capital 'Y'), and it is often (as are most accented characters) used in words which will not be found in some lookup dictionary.
I sense that you don't want to properly deal with accented characters
first of all, jon, i define what "properly" means for me, you don't. you can define it for yourself. but i won't let you define it for me.
A lot of people consider accented characters important to preserve. Since, as you say, it is easy to translate from accented characters to non-accented characters (but not vice-versa), then you can meet more people's needs (including those odd few who prefer *not* to read accented characters) by recognizing and preserving these characters. I'd like feedback from the DP folk as to their policy regarding reproducing the non-ASCII characters (Latin 1, Latin Extended, Greek, etc.) It would not surprise me if DP, as a matter of policy, reproduces them.
I sense that you don't want to properly deal with accented characters since this poses extra problems with OCRing and proofing,
nope. it's just that i see them as _unnecessary_ to this book. if a reader thinks it _is_ necessary, make the global-change.
How? Unless you somehow record that information on accented characters in some master document, you can't go in the other direction. You are assuming all the words using accented characters are found in some dictionary, which is not true.
something you are trying to avoid in your zeal to get everything to automagically work. To me, that's going too far in simplifying.
i'm not "simplifying". i'm consciously making a choice to use something that will work on the broad range of machines out there, as opposed to something that -- in far too many cases -- fails badly.
Yes, but this is the fundamental flaw. You appear to be taking short-cuts to try to prove that people don't matter in the process to produce high-quality etexts that are repurposeable and trustworthy. Certainly it is much preferred to have better and more accurate tools, and hopefully the tools you are producing will make life easier for many *people* involved in creating structured digital texts of public domain works.
it's the same pragmatic decision that michael made when he crafted the philosophy guiding the building of this library of 10,000+ e-texts, in sharp contrast to your philosophy, which has built a 1-book library.
As previously discussed, did Michael immediately go from 1 text to 10,000 etexts in two weeks? And did this growth occur solely by his own sweat of the brow? And note that almost half of the PG collection is done mostly right because Distributed Proofreaders *does* follow "my philosophy" fairly closely (or maybe better put I follow their philosophy fairly closely.) There's another purpose behind the "Trusted Editions" project. It is not intended to be a competitor to PG or other text projects, but to further benefit the various users of public domain texts. More options are better than fewer options.
Preserving accented characters are important.
in some cases, i'd agree with you. in others, not. in this case, not.
Can you explain how you decide when accented characters are to be reproduced? Or is this impossible to explain using an unambiguous, objective rule? (And will your toolset handle the full Western portion of the Unicode set? If so, then why not process *all* texts using the full character set? Why the need to reduce some of them to irreversible 7-bit ASCII?)
as it is, though, i just have to resign myself to the position that o.c.r punctuation errors are a distraction, but make no difference. i'll still root them out, due to my sense of professionalism, but i sure wish it felt _fun_, instead of feeling like _doing_chores_. and to the extent that i can automate the chores, i'll be _happy_.
What's interesting is that there are lots of people who *enjoy* doing this. That's what makes DP so successful, because it brings together people with different interests. Does DP do what it does the best possible way at this time? Of course not. Is DP as good as it could ever be? Of course not. Charles himself noted that to me last year. DP is still a "beta" in progress, or maybe a version 1.0. But DP recognizes that mobilizing people is a critical requirement of success. Juliet could talk for hours about how important the people side of producing etexts really is. And note that there are millions of texts that *cannot* be handled by your toolset, such as handwritten records, horribly tabulated data with poor and ambiguous structure, etc. These texts are held by historical and genealogical societies, local governments, etc., etc. DP, or a DP-like process, properly cloned, is the best way to convert these texts to useful structured digital texts. Not only that, these local groups have a lot of enthusiastic supporters who will volunteer to scan and proof these texts. It will be done by people power, enabled by technology, and not solely by machine power -- unless, of course, someone soon invents truly sentient AI machines with real human intelligence, personalities and even emotions.
They are hopefully caught by human proofers/readers when grammar checkers don't (I do use Word to help find both spelling and punctuation errors -- when they find something, I then manually check it in the page scans and the master XML.)
oh, so you _do_ use an assist from your tools at times. that's good.
Of course! I use tools when I can, but I don't blindly use them. Do you think I use 3x5 cards for everything I do? <laugh/>
They are "sometimes" easy to spot. Other times the automatic routines will not catch errors
maybe the automatic routines you are using are just inferior.
*Shrug* After all, I put together "My Antonia" for the project by kludging together sub-optimum tools, hardware and processes (e.g., not having a high-quality sheet feed scanner). "My Antonia" is simply a pre-beta to test out several (but not all) of the important concepts, to shake down various things for the next stage effort. It is showing us the kind of tools and applications we will need to go into production (this includes the high-quality scanning and image preparation processes.) The discussion, both here, and on TeBC, both critical and supportive, both public and private, has been extremely useful at helping us to better understand various things. This feedback has shown things we've done wrong, things that could be improved, and different ways of looking at the various issues. So your asumption that we've finalized the "formula" and the "process" is incorrect. We feel comfortable in "casting our bread upon the waters", so we can inspire many people, supporters and critics, to provide valuable feedback. We obviously inspired you to reply -- your feedback has been very valuable.
This is also true, but as found in "My Antonia", there are exceptions to pure nesting, such as when a quotation spills over into several paragraphs where the intermediate paragraphs are not terminated by an end quotation mark (whether single or double.)
is it really your considered opinion that i don't know this? that i haven't factored it into my thinking _and_ my tools?
maybe you're grandstanding to the lurkers, but my goodness, jon, do you really think that _they_ are that stupid too?
You seem to have blind faith that you will be able to sufficiently cover most every important "exception" found in most texts, and I don't believe it is yet possible. If you do, that'll be wonderful. But your apparent dismissal of the importance of universal handling of extended character sets is alone a show-stopper, in my opinion. Now if you do plan to soon universally support the Unicode character set (or at least the European subset of it), then I believe it will greatly make your toolset much more valuable.
Well, not all of the pages have been doubly proofed. The team is not finished, and I plan to post a plea somewhere for more eyeballs to go over it.
have you heard about distributed proofreaders? might be able to find some people there...
I should have written "to post a plea to a few places", because yes, I plan to post a message to the DP forums about "My Antonia". But I want to do some more preliminary assessments before approaching them. Anyway, I've already posted here for some help, and have done some back channel chatting, so a few DPers already know about "My Antonia". :^)
I would like to receive error reports as well for this text,
i'll tell you the same thing i told michael about project gutenberg: set up a system for the checking, reporting, correction, and logging of errors, a system that is transparent to the general public, and i will be more than happy to report errors to you, and help you out.
Now, I agree with you on this. Part of the community aspect of the bigger vision is a system for follow-on proofing. But we also, for the short-term, want to improve the "My Antonia" text the old-fashioned way of manual error report submissions. Properly designing the error feedback and updating system has to be integrated with the other community aspects of the digital texts since these are inextricably linked -- in addition, the "manual" process helps in better understanding the community-based system.
which, by the way, is what everyone else is thinking. which is why errors in the texts are not being reported at nearly the frequency that they should be being reported. but i've got another message sitting here waiting to be sent where i discuss that topic in more detail, so i'll stop here now.
I agree with you on this. And the error reporting system is an important aspect of building user trust in any etext collection.
since Brewster wants highly proofed texts for some experiments he plans to run similar to yours.
i'll have to ask him about his tests.
brewster@archive.org Not sure what his current status is on this.
I did find one error in my text based on the list you gave. Thanks.
you're welcome. but that's not the one i was talking about. :+)
*shrug*. It will be found, unless it's something that you believe is an error in how we transcribed the original first edition, and we do not consider it to be an error. You alluded to that in your prior message (such as mentioning the small space that precedes a few question marks -- inspection of a large number of pages where question marks appear strongly supports my contention that this is a typesetting issue and not anything specified by Willa. Anyway, the original communications by Cather on her many preferences for "My Antonia" *exist* and scholars have poured over them with a fine-toothed comb. The UNL Cather Edition does not place any spaces before any question marks, nor do they place a space anywhere before an apostrophe s used in contractions.) However, the " 's" contraction issue is one I'm going to look at again today. One of my proofers noted this to me the other day, so with hers and your feedback, it will be looked at again. See, the system, primitive as it is at present, *is* working (even if it is currently a manual, short-term hack.) Jon

On Sat, 5 Mar 2005 11:35:52 -0700, Jon Noring <jon@noring.name> wrote:
The collection (of one so far, and it is essentially a working demo for learning purposes) does not state to be "Jon Noring's" collection. Go to http://www.openreader.org/myantonia/ and tell me what it says there, and if it prominently mentions my name.
The comment about DjVu and IE6 seems out of place; there's plugins for Netscape there too. It seems like an interesting project. I'm not sure I have the time or ability to help, but I willing to make the offer.
Readers will appreciate the thoroughness expended to modernize a text for them, and will have warm fuzzies that it is "accurate" when the editor *takes the time* to explain what they did. This builds *trust* with the reader.)
I got into a bit of a flame war on bookpeople by suggesting that a translation might stand a few words on why.
So? What do you care? Is there a law saying any digital text version of a public domain work *must* be submitted to PG? Does PG have a government monopoly on the Public Domain? Of course not.
I've cared because a central library makes it easier to find a work, instead of having to search in several places. Also, Project Gutenberg has a long history, indicating it will be around tomorrow and the day after that, and it's decentralized, meaning that if it's not, everything won't just disappear.
And the word "trust" is an important core human concept -- society works only when there is sufficient trust between people, and trust in the various products of their labors. So any human endeavor which does not put "trust" as #1 is prone to eventually fail.
I don't agree. PG has not put "trust" as an explicit concept, but people being as they are, they trust that the PG works are done competently. When I gave my sister a copy of "A Doll's House", I didn't check editions and quality of translation; I just bought a random copy. You want works to be verifiable, but most people just don't worry about that; they "trust" others to do a good job.
I'd like feedback from the DP folk as to their policy regarding reproducing the non-ASCII characters (Latin 1, Latin Extended, Greek, etc.) It would not surprise me if DP, as a matter of policy, reproduces them.
We mangle the Greek via transliteration still, but we always get Latin-1 right, and we more or less get Latin Extended correct. (OE is usually broken, but accents are recorded, and I assume most PMs are aware enough to catch the weird characters.) Hebrew, Arabic and friends are usually, hopefully, handled by the PPer.
nope. it's just that i see them as _unnecessary_ to this book. if a reader thinks it _is_ necessary, make the global-change.
Why judge that on a book for book basis? In fact, you can't, since your programs don't tend to support "accented" characters in any texts. Certainly, the majority of pre-1850 works have at least one Greek quote that ASCII will horribly and irrevocably mangle. French quotes are exactly uncommon in our era of books, either.
participants (3)
-
Bowerbird@aol.com
-
David Starner
-
Jon Noring