Re: [gutvol-d] header detection revisited

it's been one month since my post about "detecting headers", in response to jon noring's "challenge" in that specific regard.
in case you've forgotten, here's a quick recap:
big and bold. that's what headers look like. conspicuous. real hard to miss. easy to find.
as i said last month, i have developed a 30-item checklist. that's how many ways a header can make itself conspicuous. but the main way -- by far -- is simply to be big and/or bold.
so it's time now for part 2.
but first, any questions? don't be shy, step right up, because headers are the first step toward detecting all types of things. (which is why we need to discuss them in some more detail.)
-bowerbird
The question is a bit ambiguous. What are you trying to detect headers _from_? AFAICT, Gutenberg e-texts don't have big and don't have bold, so neither can be the hallmark of a header in Gutentexts. Presumably, therefore, you are trying to detect headers in some marked-up text that uses some sort of presentational markup. Given your assumption that headers are 1. conspicuous, 2. hard to miss, and 3. easy to find (all variations on a theme), it seems to me that the best way to detect a header is to determine the general characteristics of the majority of all paragraphs in a document (size, indentation, amount of punctuation, location of punctuation, capitalization, etc.) and identify as headers any "paragraphs" which fall way outside the mean. I presume you have a reliable way to identify paragraphs (not always possible when using text derived from PDF files). Consider the shortest verse of the Bible: "Jesus wept." Biblical verses are merely numbered paragraphs. Can your algorithm determined that it is a paragraph and not a header? This is the problem of the false positive: it is as important to identify not-headers as it is to identify headers. You would be much more likely to increase your list of special cases if you would share the thirty-odd special cases you have already identified.
participants (1)
-
Lee Passey