a concrete example might help...
here's the table of contents from
"free culture" by lawrence lessig,
in zen markup language format,
generated automatically from a
simple straightforward analysis
in about one-half of a second...
even though there are 3 levels
of headers, they are very clear,
indicated by varying indentation
(which represents, at the headers
themselves, a varying number of
preceding blank lines, of course.)
text-structures even more complex
than the one shown in this outline
can be communicated easily by the
number of preceding blank lines --
_if_ the rule is followed _consistently_
-- and grokked by routines consisting
of just a few lines of dirt-simple code...
by the way, just to say something "obvious"
that lee probably had not considered before,
one of the many ways my routines determine
the headers in a digitized text is to look for a
"table of contents" section -- usually toward
the start of the file, and usually marked with
"contents" or "table of contents" as a header --
and then examine that section quite carefully.
ends up it does a very good job of telling you
what specific phrases "might be" header-lines.
and if you're cleaning up the o.c.r. of a p-book,
for instance, there are usually _page-numbers_
there too, telling what _page_ each header is on.
pretty handy, eh? indeed, in the .pdf of this book,
which you can download at http://www.lessig.org,
you will see that the page-numbers _are_ there, and
chapter 11, chimera, for instance, starts on page 177.
like i said, if you know what a header is likely to be,
and on what page it is located, it's fairly easy to find.
indeed, people have been using the "table of contents"
for precisely that reason for several hundred years now.
this is just one of the reasons why it ain't that hard
to write routines to ascertain the headers in a book.
like i said, it sounds very obvious when you hear it.
but have you ever heard anyone say it here before?
-bowerbird
---------------------------------------------
TABLE OF CONTENTS
Free Culture
Table of Contents
License
Publisher Page
Library of Congress Cataloging
Dedication
Preface
Introduction
'Piracy'
Chapter 1: Creators
Chapter 2: "Mere Copyists"
Chapter 3: Catalogs
Chapter 4: "Pirates"
Film
Recorded Music
Radio
Cable TV
Chapter 5: "Piracy"
Piracy I
Piracy II
'Property'
Chapter 6: Founders
Chapter 7: Recorders
Chapter 8: Transformers
Chapter 9: Collectors
Chapter 10: "Property"
Why Hollywood Is Right
Beginnings
Law: Duration
Law: Scope
Law and Architecture: Reach
Architecture and Law: Force
Market: Concentration
Together
Puzzles
Chapter 11: Chimera
Chapter 12: Harms
Constraining Creators
Constraining Innovators
Corrupting Citizens
Balances
Chapter 13: Eldred I
Chapter 14: Eldred II
Conclusion
Afterword
Us, Now
Rebuilding Freedoms Previously Presumed: Examples
Rebuilding Free Culture: One Idea
Them, Soon
More Formalities
Shorter Terms
Free Use Vs. Fair Use
Liberate the Music -- Again
Fire Lots of Lawyers
Footnotes
Hyperlinks
Acknowledgments
Index
About the Author
Jacket
Typos Corrected
Permissions
The Dead-Tree Hardback Version of this Work
zero markup language -- z.m.l. -- the future of electronic-books
---------------------------------------------
p.s. extra points for everyone who realized that
-- since the lines in the table of contents section
are not to be rewrapped -- that is the reason that
all are prefaced with at least one leading space...