re: [gutvol-d] ok then, let's go to work (part 1)

marcello said:
what an original idea ...
yeah, i thought so. strangely, jon noring didn't do that, though. he grabbed an existing version of the text, and proofread _that_ against the scans... that's why he's still got _errors_ in his text, errors i found by comparing that text with the output that i got from finereader's o.c.r. (thus, the .pdf does not contain jon's errors...) crosschecking independently-derived e-texts in this manner is a great way to do proofing.
Now, as you yourself said, the pdf format is useless for further editing.
well, actually, i didn't say that. because there are ways now to edit a .pdf. it's still clumsy, though. plus, if you were to correct the errors under the finereader interface, and _then_ generate the .pdf, the output would be fine. mostly i don't care much for the .pdf format. but some people seem to like it, even prefer it. truth be told, in this case, where each page of the book will fit comfortably on a monitor -- a two-page spread fills the screen nicely -- .pdf is a perfectly good way to read this book. and as the world moves toward tablet p.c.'s, some disadvantages of acrobat will fade away. (except for its inability to copy out clean text, which might be a problem that adobe finds is extremely difficult to overcome, given its m.o.) but the reason i commented on the fact that finereader does an excellent job of capturing the _formatting_ is that -- believe it or not -- too many digitization efforts, including d.p.!, often just _throw_away_ all that formatting, outputting o.c.r. results to raw-ascii plain-text. this is such a waste! there is extremely valuable information in that formatting, information about the _structure_ of the e-text. indeed, in a good many ways, the art of typesetting is the communication of information about the text's underlying structure. (in a phrase, "some content is _in_ the presentation".)
So why don't you try to output something useful, like, say, an XML file.
since you are one x.m.l. booster here, why don't you? and then take that x.m.l. output and run it through all your xslt conversions, and we'll see what we get! i'm curious to see how robust your methodology is... i highly suspect, though, that you will find that the x.m.l. that finereader outputs is a kind that you are unable to use. and _that_ box of problems is one you don't want to open. it's much more amenable to your x.m.l. hype machine to continue to pretend that any kind of "x.m.l." can be mapped to any other kind of "x.m.l." of course, if that was _really_ true, y'all wouldn't be having the .tei vs. .xhtml vs. .docbook discussions, you'd just use one format and convert to the other.
If you were able to get a TEI file out of your finereader, you could actually
and if pigs could fly, we could use them for air transportation, and not worry about muslim terrorists hijacking the "planes"... (i hope muslims will forgive me for that bit of pig humor...) ;+)
If you were able to get a TEI file out of your finereader, you could actually proofread the thing and produce a pdf file out of it that looks a *lot* better than the one you posted.
well, as one of the .tei boosters here, why don't _you_ do that? then we'll compare your .pdf with my .pdf -- the one i create after having proofed the text and created a .zml file out of it -- and we'll see which .pdf looks the best of all... my .pdf will be up within the next two weeks, depending on how quickly i get through the thread i have just created, so you've got a little time to do your .tei markup. be my guest...
Besides you would get an html and a plain text file into the bargain, ready for posting.
finereader will create an .html file too, automatically. i'll be uploading that in a few days as well for you to see. and it does a pretty good job of creating an html-book, at least from the standpoint of how the thing _looks_. and of course, it will output a plain-text file. that one will be uploaded tomorrow. it'll also create an .rtf file, which is important because that retains much of the formatting information that i just noted is very important in ascertaining structure. so i'll upload that eventually as well. and i'll be making commentary on all of these uploads. (heck, finereader will even create a powerpoint-book. if only all these formats weren't plagued by scannos.) so fasten your seatbelts, folks, it's gonna be a ride... -bowerbird p.s. i probably won't bother responding to marcello much more, since most of his comments are usually not just not worthwhile, but an actual waste of time. so if he makes a point that you want me to address, it would probably be best if you'd reiterate that point, in your own words. in which case, i'll be happy to reply.
participants (2)
-
Bowerbird@aol.com
-
Marcello Perathoner