
ok, let me get to jon's "challenges" before i'm banned here... :+) yes, it's long. as the saying goes, "the difficult we do immediately; the impossible takes a little longer." jon says that what i am doing is "impossible". as you'll see, it's just a determined application of some good old common sense... but explaining it all made this long. sorry. if you don't feel like reading it, just go on to the next post. but then don't come around crying later, saying i didn't explain it... *** jon said:
For an example of this "multiple uses of the same typography", the venerable "italics" is a good one. Italics are used for: linguistic emphasis literal emphasis (not the same as linguistic emphasis!) names of ships titles of certain types of books and documents word used as a word foreign phrases sometimes used for headers etc., etc., etc., etc., etc. ,etc., etc., etc.
well, my first reaction is that you've got a pretty good list going there, jon, i'm impressed. italics certainly seem to be a workhorse, don't they? my second reaction is, wow, this list must just be the tip of the iceberg! because _look_ at all those etc.s at the end of the list! even one etc. means that "there are more items in this list", so _8_ (count 'em, 8!) must mean that there are _lots_and_lots_ of more items in this list! so now you've got my curiosity inflamed, jon! please, could you give us the _complete_ list, the one that has all of those etc.s fully delineated? considering how overloaded italics are, how do humans figure them out? but you know what? we do. personally, i can't remember a single time -- not a single one -- where i was ever _confused_ about why something was italicized. so of course, my response is to "just italicize them as the p-book did, and let the reader sort 'em out, just like the readers of the p-book did". so the full list isn't necessary, jon, except you've really got me curious! i can't even _remember_ the last time i say 8 etc.s in a row like that! as for "accessibility" concerns, i think those readers are just as capable as sighted readers at filling in the semantic inferences of the story, and i will not give them less credit than they deserve. it would be insulting. if sight-impaired readers tell me they want elucidation of these things, i'll be responsive. but in the absence of that, i will assume capability... for the intellectually curious, though, yeah, it would certainly be possible to write routines that could make a good prediction about which one of the possible meanings were the one that was responsible for the italics. and with the right set of look-up tables, the predictions might be sharp. in that regard, aside from "signs" in the current case, a block element would include: notes, invitations, death-threats, placards, notices, flyers, correspondence, diary-entries, suicide-notes, love-letters, cease-and-desist notices, warnings, equations, eviction notices, lyrics, poems, newspaper-clippings, quotations, out-of-order signs, magazine/journal articles, pull-quotes, ledgers, tables, and what else? *** so this italics example is like other variables that _freak_jon_out_, and which would be terribly expensive and time-consuming to code, but for which there is _simply_no_demand_from_the_users_ to have, so it makes no sense for us to worry about doing the coding of them. for the few people -- academics or dilettantes or pedants or whatever -- who are concerned with this specificity of detail, let _them_ do the work. *** in spite of lots more verbiage about how the whole idea is impossible, jon's only other real "challenge" was involving the detection of headers. first of all, let me say that people are very good at recognizing headers, just as they are very good at figuring out the specific reason for italics within the context of a story. so one form of my reaction could be to say, "don't worry about it, because the readers be able to figure it out fine." however, the headers in a book _do_ need to be detectable by a viewer-app, to make them big and bold, create a hotlinked table of contents, and so on. so, for headers, we _do_ need to think about how they can be signified. fortunately, this is very easy to do. zen markup language call for headers to be indicated unequivocally by a set of 4 or more empty lines preceding them, so detection is simple. use more and fewer blank lines to indicate headers of different levels. problem solved. a hotlinked table of contents is created automatically, and placed in a "contents" menu on the menubar for further convenience; headers are displayed in big and bold text, just like readers expect, and sections begin on fresh screens in the best practice of typography, etc. of course, this requires that the person who creates the digitized text takes a concrete action on each header -- i.e., puts in those blank lines. since that's such a dirt-simple task, i don't consider it to be a big deal. it shouldn't take but just a few minutes, even in a rather-complex book; time well spent in the _4_hours_ i allotted to digitize the average book. besides, in the grand tradition of "zen markup language" philosophy, these blank lines are _something_that_people_already_do_anyway_, for the most part. it's only natural to give a header "breathing room". it feels like the right thing to do. indeed, one of the formatting rules for project gutenberg e-texts is to precede a header with blank lines. although i don't recall what the guidelines specify about the _number_ of blank lines to use -- especially with headers of different levels -- most e-texts end up being able to be evaluated by using this variable... *** jon _might_ have been referring to auto-recognition in o.c.r. output. no problem. because headers _are_ so important, i spent some time on recognizing them, and it ends up that i can now handle that very well too. if you've scanned the book with the right settings, the information about the "space above" a specific line -- in terms of number of blank lines -- might well be available to your routines, so they can judge on that basis. but even if that information is missing -- as, for instance, it is when you copy text out of a .pdf, which often loses all info about blank lines -- it's still possible to evaluate a line to determine whether it's a header. my program has a 30-item checklist to determine if a line is a header, and -- even though i haven't optimized it yet -- it works like a charm... i'll eventually run it against the p.g. library, and check its performance. maybe i'll find i don't even need to optimize it. but if i do, then i will. (the optimization would involve a multiple-regression on the 30 items; in the non-optimized checklist, most items are weighted fairly equally.) in regard to jon's made-up example, i don't usually bother to deal with fabricated tests, because they are easily manipulated, and my concern is how well my routines work on actual e-texts, not on fake examples. nonetheless, jon's examples look to be representative of real-life text, so i will answer them. i won't give full details on the whole checklist, because i'm still debating whether or not to keep it as "a trade secret". but i can reveal enough of it to deal with the two examples jon gives. so let's look at jon's first example:
Here's one very small example to illustrate what I'm saying:
(exhibit 1 -- I made this up, inspired by Sherlock Holmes)
******************************************************
... she walked up to the door, and on the door was a small sign with a message in stark, bold black letters which read:
NO SOLICITORS OR SALES PEOPLE
Ignoring the sign as if it wasn't there, she knocked on the door, intending to make the sale...
*********************************************************
this example -- which is not a header, but a sign -- is recognizable to my checklist as not-a-header, because the paragraph that precedes it ends in a colon. if a paragraph ends with a colon, that indicates that some kind of block-text follows, so we can be assured that that section has not been terminated. real paragraphs simply do not end with a colon. i made this point over on one of the forums at distributed proofreaders, when people were wondering what to do in the absence of typographical indication of a block-quote. although the item -- a diary entry -- was clearly understood by a reader to be distinct from the regular body-text, it was set in the same type, so the question was whether it should be marked as a block-quote. it ends up the typographical indication _was_ indeed present -- and big as day in that colon -- they just didn't see it, because they weren't knowledgeable about this typographical nicety... ***
(exhibit 2; adapted from an Encylopaedia Britannica article)
*********************************************************
...Government weakness allowed the mutiny to spread; and although order was eventually restored in Istanbul and more quickly elsewhere, a force from Macedonia (the Action Army) led by Mahmud Sevket Pasa marched on Istanbul and occupied the city (April 24).
DISSOLUTION OF THE EMPIRE
Abdulhamid was deposed and replaced by Sultan Mehmed V (ruled 1909-18), son of Abdulmecid. The constitution was amended to transfer real power to the Parliament...
*********************************************************
jon reports this is a header, and it would almost certainly be recognized as such by my 30-item checklist, since it fits very many of the criteria. there does remain a possibility that it would be judged not-a-header, depending on if the other lines in the file that are judged to be headers exhibited similarities to this one. my thrust is to look not just at a line in isolation, but within the context of all lines (and headers) in the file, since headers almost always bear some kind of similarity to each other, particularly headers at the same level. however, given the information that i have, for this line in isolation, yes, i would likely call it a header. there is another consideration here as well that might have relevance. given an outline-structure that includes several _levels_ of headers, some minor-level headings might not get full "header" treatment, e.g., they might not start on a new page, or be listed in the table of contents. (a table of contents that stretches over more than 4 pages starts to lose its value as a tool that gives the reader an overview of the book. in that case, the main table of contents would include major sections, and then each major section would start with its sub-contents page. sorry to get so picayune here, but remember that i was "challenged".) since jon "adapted" this from an encyclopedia article, that is likely to be the case here. this line certainly isn't a major-level heading, but it's impossible to know where it fits in the hierarchy of levels, not without solid reference to a fuller analysis of the entire file. whether or not this line might be one of those minor-level headings is something i cannot say without knowing the outline of the whole book. but i include it here to give you an idea of the depth of my analyses... back to the line, though, and the things that help identify it as a header. some obvious indicators from the checklist that i can mention here are the all-caps nature of the line, the absence of trailing punctuation, and -- as you should've gleaned -- a properly-closed paragraph preceding it. *** i'll end with an observation. when i set out to create my header checklist, i expected there might be a half-dozen indicators. i found two-and-a-half. thus i found the task is surprisingly simple. in retrospect, i realized why. headers are -- by their very essence -- _intended_to_be_conspicuous_. if a header doesn't "stick out", it is -- by definition -- not doing its job. because they are _designed_ to be obvious, they become very easy to find. indeed, this allowed me to develop a brain-dead way to find the headers -- and in the process of doing so, create an outline of the publication -- one that works with surprising efficiency with all kinds of documents: delete all the lines from the book with any non-bold characters, and the lines that remain will likely constitute the book's outline. try it some time, with any kind of document -- paper or electronic; for instance, although i couldn't get to the site to confirm it, in the past i have found that this approach works perfectly on noring's own website:
pulling out the oversized text is another good way to go about the task. of course, once you've thought about it, it's rather obvious -- is it not? -- that headers tend to be big and/or bold. in retrospect, it's all too simple. that observation made me think that structures which were trying to be _inconspicuous_ might be the ones that were really the hard ones to find. but i was wrong about this. it ends up that, if you consciously _look_for_ the lines that are trying to be "inconspicuous", they're easy to find as well. a good example is _footnotes_. they _try_ to hide, putting themselves at the bottom of the page, and rendering themselves in a smaller point-size. pretty sneaky little devils, aren't they? but not too smart, it ends up. because if you _look_ for inconspicuous -- small type, at page-bottom -- it's remarkably easy to find footnotes too. even if you have stripped out the fontsize information from your o.c.r. file -- which, just by the way, is an _incredibly_stupid_ thing to do, even though many "experts" do it, including the good people over at distributed proofreaders -- you can _still_ recognize the footnotes easily, since their smaller type meant that their lines end up having more characters than a line of body-text. so if you count the characters, to look for longish lines, there they are! (this assumes that you've kept the original linebreaks, which the good people over at distributed proofreaders _do_, but other "experts" do not. what can i say?, except that if you make _enough_ stupid decisions, yes, you too _can_ make the digitization job _difficult_, if you _want_ to...) *** anyway, jon, i hope that answers your "challenge" on header-recognition. it's an easy task. i know you've spent years telling people that it is hard, that it's a job requiring "artificial intelligence" that will not be available for the next two or three decades -- "if then" -- but you are plain wrong. indeed, i put up an entry in my blog some time back on this very topic, announcing "the fastest-growing quiz sensation in america", which is the game "am i a header?", where i ask people to submit lines to me and have me guess whether or not that line is a header. come play sometime! my blog is at: http://journals.aol.com/bowerbird/bowerbirdseyeview oh, and jon, if you have any more "challenges" for me, let me know, but hurry up, because you never know when i'll be banned from here. oh, and also, if you ever want to take up my "challenge" to you, and mark up that test-suite of mine, you should -- of course -- feel free. and i'm still waiting for that list, from you and/or networker, on what kinds of semantic structures you think need to be marked up. -bowerbird p.s. it wasn't my "logic" that john ockerbloom had a problem with. he just didn't want me to say that "my resolution for 2005 is to get off of the jon noring merry-go-round, because it's so senseless to keep on debating the same old stuff over and over and over and over." but since people _here_ want you and i to get off the merry-go-round, i'm guessing their only objection is that i haven't done a better job of keeping my resolution...
participants (1)
-
Bowerbird@aol.com