and here's today's post...
again, i'm talking about the scan-set that exists at:
> http://www.openreader.org/myantonia
those scans were subjected to o.c.r. by finereader v7,
with the resultant text-file located here:
> http://snowy.arsc.alaska.edu/bowerbird/myantonia1.txt
one of the things you should notice about this text-file
is that the original linebreaks were retained by the o.c.r.
linebreak retention is very important when it comes to
comparing the text-file against the page-scans because
it makes that comparison many times easier to accomplish.
one of the hassles with those original linebreaks, however,
is with the words that were hyphenated at the end of a line.
you can't just rejoin all words that have end-line hyphens,
because some of 'em might be words that are hyphenated
even when they're in the middle of a line. you _could_ use
a dictionary, to determine whether the word is a compound,
a better way is to use the actual text of this particular book,
in case the author had some idiosyncrasies. so, in order to
identify such words, we need a computer routine that finds
all of the words that were hyphenated in the middle of a line.
it's not hard to write such a routine, and the results from it
-- for the "my antonia" text-file -- are appended to this post...
once you have these "truly-hyphenated" words, you can then
write a routine to correctly rejoin end-of-line hyphenated words.
of course, you would only do this _after_ you are done proofing.
(it is _here_ that you would do a compare against a dictionary,
to make sure that you were doing the rejoins "correctly" and
to "flag" any exceptions, so they could be checked more closely,
to ensure that they were indeed idiosyncratic to the book/author.)
you probably should also retain a copy with the original linebreaks,
just in case it will come in handy for someone else down the line.
another option would be to code the original linebreaks so that
a viewer-app could keep them if the end-reader wants them, or
delete them (i.e., convert 'em to spaces) if that's what's desired.
after all, isn't the best general attitude to _retain_information_
that can't be reproduced, and not throw it away, if at all possible?
there has been a growing realization that page-break info should
be saved. why isn't the same realization extended to linebreaks?
some people will want to reproduce the original p-book as closely
as possible, and linebreaks would be an important aspect of that.
distributed proofreaders keeps linebreaks up until the last minute --
because they understand that linebreaks help proofing immensely
-- but then they throw them away, seemingly oblivious to the fact
that they also could be extremely useful to someone else later on...
don't throw this valuable information away! someone might want it!
***
as your bonus for today, i've uploaded the .rtf file that finereader
generates automatically out of the o.c.r. process. as i remarked
wednesday, _styling_ information -- which is included in the .rtf --
can be very valuable in helping ascertain the book's structure...
the .rtf can be found at:
> http://snowy.arsc.alaska.edu/bowerbird/myantonia1.rtf
for those who prefer a .zip, it's at:
> http://snowy.arsc.alaska.edu/bowerbird/myantonia1-rtf.zip
as illustration of today's lesson, however, i have uploaded an .rtf
that does _not_ retain the original linebreaks. if you compare it
to the .txt version -- to which it is identical -- you'll see just how
difficult the comparison can be when that linebreak information is
discarded, and all unnecessarily so, because it could easily be kept.
***
so, was anyone able to "figure out" the heuristics that i was using
in wednesday's routines to ferret out the headers in "my antonia"?
speak up, folks, don't be shy, or you won't get any source-code...
***
see you on monday, with part 4...
-bowerbird
p.s. here is the list of "truly-hyphenated" words in "my antonia":
(this is the list before some unwanted one-of-a-kind exceptions
have been culled fro it, such as the first two listed. a double-dash is
an end-of-line hyphen that was part of an otherwise hyphenated word.)
An'-ton-ee-ah
An-tonia
Ar-r-r-mond
Good-bye
Good-evening
Good-morning
High-School
Indian-like
Jake-y
May-basket
Preserving-time
Snow-White
Sunday-School
Te-e-ach
To-day
Wash-day
a-goin'
a-going
a-quiver
absent-minded
alligator-skin
ashy-gray
awkward-looking
ball-gloves
bar-room
bar-tender
barbed-wire
bath-water
bay-window
be-frogged
bed-ticking
bee-bush
black--and-white-check
bluff-top
boarding-house
boarding-houses
boot-tops
bottle-tomatoes
bow-legged
box-couch
box-elder
brand-new
bread-board
breast-pocket
bright-flowered
bringing-up
broad-backed
broad-minded
bronze-green
buffalo-peas
buggy-riding
bull-snake
bull-snakes
bunk-bed
butter-maker
cabinet-maker
cabinet-work
cake-cut--ters
candle-moulds
car-window
car-windows
cared-for
carpenter-shop
case-hardened
cattle-pond
cave-house
cheek-bones
cheerful-looking
chicken-bone
chicken-house
cigar-stand
clap-trap
clasp-knife
close-clipped
close-cropped
close-growing
cloud-hung
coat-tail
coffee-cake
coffee-pot
cologne-scented
cone-flower
cone-flowers
cook-stove
copper-red
corn-knife
court-plaster
cow-pumpkin
crab-apples
cross-roads
cuff-buttons
curly-headed
cutting-tables
day-coaches
deep-seated
deep-seeing
deep-set
developing-room
dining-room
dog-town
dog-towns
door-bell
door-crack
down-feathers
down-hearted
draft-horse
dragon-slayer
draw-bank
draw-bottom
draw-head
draw-side
dressing-room
dress--ing-room
drug-store
dry-goods
dumb-bell
dun-shaded
dusty-smelling
earth-owls
easy-blowing
easy-chair
elder-hunting
ever-falling
every-day
faint-hearted
fair-skinned
fancy-work
far-away
far-seeing
farm-boy
farm-work
fashion-plates
faun-like
feather-bed
feather-beds
feed-bag
fellow-countryman
fellow-countrymen
fifty-five
finger-bowls
finger-exercises
finger-tips
fire-box
first-cabin
flat-chested
flat-topped
forget-me-nots
full-grown
fur-worker
gaming-tables
garment-makers'
ghost-moon
gold-green
gold-headed
gold-washed
golden-rod
good-bye
good-byes
good-looking
good-nature
good-naturedly
good-night
goods-box
grain-cars
grain-sack
grammar-school
grape-arbor
green-topped
griddle-cakes
ground-cherries
ground-cherry
grown-up
guinea-hens
half-buried
half-ear
half-fares
half-grown
half-mile
half-past
half-section
half-story
half-swooning
half-window
half-windows
hand-painted
hand-sheller
hard-and-fast
hard-worked
hay-cave
hay-fights
head-first
heart-broke
heavy-odored
hide--and-seek
high-collared
high-heeled
hitch-bar
hitching-bar
hollow-chested
home-coming
hoo-hoo
horse-collar
horse-pond
horse-sense
house-cleaning
husking-gloves
ice-cold
ice-cream
ink-smeared
iron-gray
ironing-boards
jack-rabbits
kawn-tree
kawntree-man
ketch-on
lady-teacher
lard-pail
large-minded
level-headed
liberal-minded
lifting-up
light-hearted
lightning-flashes
lining-silk
living-room
lodging-house
lone--some-like
long-winded
low-branching
lunch-basket
merry-go-around
merry-go-round
merry-making
mid-day
mid-ocean
milk--ing-pails
mixing-bowl
money-lender
moth-like
mountain-ash
mournful-looking
music-master
natural-born
natural-like
nev-er
never-ending
newly-created
non-existent
nose-bleed
nut-cake
oil-can
oil-cloth
oilcloth-covered
oil--cloth-covered
old-
old-fashioned
old-maidishness
one-night
open-grazing
orange-colored
out-of-door
over-considerate
over-delicate
owl-feather
pagoda-like
pale-blue
pale-gold
paper-hanger
paper-rack
passers-by
pea-vines
peck-measure
picture-books
pig-yards
pillow-cases
plough-handles
plum-patch
pocket-mirror
pocket-money
porch-posts
post-office
prairie-dog
prayer-book
prayer-meeting
pump-handle
quick-changing
quick-footed
rabbit-skin
race-course
raw-boned
reading-table
ready-made
reap--ing-hook
red-lined
ripple-marks
rocking-chair
roll-call
round-collared
round-eyed
round-shouldered
run-over
rye-field
saddle-bags
sand-blast
school-teachers
scroll-work
scrub-oaks
sea-coast
section-lines
self-asser--tiveness
self-possessed
self-possession
self-sacrifice
shaving-mug
sheep-fold
sheet-iron
shirt-sleeves
shooting-coat
side-road
silver-rimmed
sing-song
sitting-room
sky-line
slack-wire
sleigh-bells
sleigh-ride
slim-waisted
small-town
snake-cane
snake-skins
snow-banks
snow-blocked
snow-covered
snow-field
snow-men
snow-on-the-mountain
snow-white
soft-hearted
stage-driver
steel-gray
stop-watch
stopping-place
store-boxes
storm-win--dows
straw-colored
strong-minded
sturdy-looking
sugar-cakes
sun-down
sun-up
sun-warmed
sunflower-bordered
supper-table
sway-backed
te-e-ach
tea-kettle
ten-dollar
thick-set
thoughtful-look--ing
thrashing-machines
title-page
to-day
to-morrow
to-night
train-crew
tree-tops
trom--bone-player
trotting-horse
trot--ting-buggy
trout-fishing
trumpet-blowing
turning-lathe
twenty-dollar
twenty-four
twenty-six
un-wishful
under-side
up-bringing
violin-teacher
violin-teachf
waffle-irons
wagon-box
wagon-seat
wagon-tracks
wagon-trail
wall-paper
war-times
warm-blooded
warm-hearted
wash-basin
wash-boiler
watch-case
watch-charm
water-pipes
water-spaniel
water-tap
water-tin
weak-minded
wedding-ring
week-day
well-behaved
well-dressed
well-fed
well-grown
well-made
well-planted
well-prepared
well-set-up
well-to-do
west-bound
wheel-rim
wheel-ruts
whinny-laugh
white-handed
white--and-gold
wild-eyed
wild-looking
win-ter
windmill-frame
window-blinds
window-shade
wine-colored
wine-stains
wo-man
wolf-cry
women-folk
work-basket
work-horse
work-horses
work-room
work-table
work-team
work-teams
working-man
working-slippers
yard-measure
yoke-mate