lee said:
> The question is a bit ambiguous.
only if you haven't been following the drama for the last year-and-a-half.
welcome to this listserve, lee. have you dropped the handle for good now?
it's been some time since we chatted, especially frontchannel...
> What are you trying to detect headers _from_?
> AFAICT, Gutenberg e-texts don't have big and don't have bold,
> so neither can be the hallmark of a header in Gutentexts.
that's right. so for that i need to call on some of the other items
in my 30-item checklist. the very best way to detect headers in
a p.g. e-text is to test for blank-lines above the line in question.
three blank lines will grab almost all of the headers, as well as
a dose of false-alarms. the job then is to toss the false-alarms,
and to do the best job possible of discerning the missed headers.
and actually, in perhaps 25%-30% of project gutenberg's e-texts,
pulling lines that start with "chapter" will net most headers. :+)
> Presumably, therefore, you are trying to detect headers in some
> marked-up text that uses some sort of presentational markup.
"markup" doesn't usually enter into the equation. it can, of course,
but if something has been marked up, a good way to find the headers
is to examine the markup. nonetheless, i _can_ use my system on
the _presentation_ of text that has been marked-up; many of my
examples will be just that. as such, it can be used in cases where
the mark-up is not available, for one reason (print) or another (.pdf),
but its presentation is. but of more direct concern to this listserve,
however, is its application toward the task that many people here do,
which for the most part is to digitize text from scans of paper-books.
a routine that recognizes headers in o.c.r. output -- because they are
relatively big and/or set in bold -- saves the digitizer from that chore.
i haven't discussed the importance revolving around header-recognition,
so that might not seem like a big deal. but it is indeed rather important.
(any e-book programmer, like yourself, lee, knows why it's important.)
and, getting back again to the existing e-texts -- some 16,000+ now --
a routine for determining the headers in them would be quite valuable...
if you're looking for a general overview, i focus on 3 distinct arenas:
1. strict z.m.l., where header-structure is defined by certain rules.
2. "fuzzy" mode, where texts are somewhat consistent, but not always.
3. "wild" texts, where all bets are off and you do the best that you can.
project gutenberg's e-texts generally fall in the second category.
as the examples i give will show, it would be relatively easy for me
to make software that inputs text from the second category and then
modifies it and outputs a file conforming to the strict first category.
but nobody from project gutenberg took me up on my offer to do that...
i've done enough work on arena #3 to know that it will be possible,
although you can't expect perfect output from the tool on a wild text.
i largely abandoned arena #2 when project gutenberg people passed,
although there will be wide-ranging applicability of this arena on
texts with some kind of regularity in them, such as listserve digests.
but my main focus now is on spreading the gospel of arena #1 -- z.m.l.
in z.m.l., headers are indicated simply by having blank lines above them.
(and the more blank lines, the higher the priority-level of the heading,
so it's a cinch to handle even the most complex of heading-structures.)
this simplicity means that it's easy to write fast code to find headers
in a z.m.l. file, and it's simple for users to understand how to make 'em.
there is still a big explosion of self-publishing that will be happening,
and i want to spare all those new writers the pains of doing mark-up.
i'd much rather have them concentrating on their _content_ instead!
once i've got all the tools in place to do what i want with arena #1,
i'll return to arena #3. being able to take text "from the wild" and
ascertain its underlying structure, and then output it in strict z.m.l.,
so it can be handled with my tools, will be an awesome achievement.
again, this is an arena where markup is impractical, perhaps impossible.
consider all the content that is being generated _every_single_day_ on
yahoogroups. nobody's going to mark-up all that content, so we need to
have a way of pulling it into our e-books and have it be nicely formatted.
> Given your assumption that headers are
> 1. conspicuous,
> 2. hard to miss, and
> 3. easy to find
> (all variations on a theme)
thanks for noticing the theme... ;+)
but it's not really an _assumption_. (nice try to spin it that way, though.)
it's actually an _observation_ on the very _nature_ of _being_ a _header_,
one of those things that seems totally obvious once realized and verbalized.
and of course, once you have realized that headers are _hard-to-miss_,
it becomes very silly to maintain that it is "impossible" to detect them.
of course you can detect them -- because they stick out like sore thumbs!
> it seems to me that the best way to detect a header is to
> determine the general characteristics of the majority of
> all paragraphs in a document (size, indentation, amount of
> punctuation, location of punctuation, capitalization, etc.) and
> identify as headers any "paragraphs" which fall way outside the mean.
now you're thinking.
looks like you're on your way to replicating my 30-item checklist.
> I presume you have a reliable way to identify paragraphs
> (not always possible when using text derived from PDF files).
well, yes.
and the fact that text copied out of a .pdf loses its blank lines --
which then makes paragraph-detection exceedingly more difficult,
-- does indeed make the detection of headers more difficult as well.
which means you have to solve the paragraph-detection problem first,
as best as you can, anyway, with text that you've copied out of a .pdf.
restoring the paragraphs is a much bigger task than detecting headers.
if you can't perform that hard task for end-users, why do the easy one?
but the solution isn't as hard as you might think, although it's not 100%.
when i'm done discussing headers, if you want to discuss this, we can...
and besides, dealing with text copied out of a .pdf is not a high priority.
the best way to deal with _that_ kind of text is to go to the producer
and say, "can i instead have the file that you used to produce the .pdf?"
but even without having solving this .pdf paragraph-detection problem,
-- i.e., with all blank-lines removed -- my checklist does pretty well...
> Consider the shortest verse of the Bible: "Jesus wept."
> Biblical verses are merely numbered paragraphs.
> Can your algorithm determined that it is a paragraph and not a header?
um yeah. "headers" in the bible are "paragraphs" that are not numbered.
and -- as you yourself just pointed out -- the actual verses are. voila.
> This is the problem of the false positive:
> it is as important to identify not-headers as it is to identify headers.
yes it is. and much of the 30-item checklist is attuned to that issue.
once you've accepted that this is part of the job, it's not all that hard.
> You would be much more likely to
> increase your list of special cases
> if you would share the thirty-odd
> special cases you have already identified.
i haven't identified "thirty-odd special cases".
i've abstracted 30 rules that act in combination
to answer the question at hand -- is this a header?
and it wasn't that hard. you can probably come up with 10-15
right off the top of your head, without even thinking too much.
and if you subjected those to empirical testing on lots of e-texts,
as i have over the course of the last 2-3 years, you would probably
discover the rest of my 30 items. and then you too would be saying,
"it's not impossible, folks, and in fact, it's not even all that difficult."
there's no magic here. just hard work...
-bowerbird