
carlo said:
How do you include the information in the files if it has been removed?
go back to a source and get it, that's how. where applicable, information about page-breaks can be obtained from the d.p. proofed text-files; it's a simple matter of matching up image-scans with the text they contained. (see the page that contains a table of the scans with their text-files.) that's why i offered a demo using that specifically. (nonetheless, it's positively _criminal_ that we should even have to do _anything_ to re-gain this information, since it was _willfully_ discarded. when is this bad practice going to be halted?) for books not done by distributed proofreaders, it's as easy as loading the text-file into my viewer and clicking on each word that starts a new page as you get that information by viewing a paper-copy. (my viewer will then save an updated copy of the file.) this process can be facilitated by setting the leading so the lines-per-page is equivalent to the paper-copy, making the task almost trivially easy (but still useful!).
And moreover, how do you find the correct page when some material (e.g. the footnotes) has been moved, and the page contents are no longer consecutive?
footnotes are easy. (my viewer displays them on the page where they are called anyway, so there's no problem there.) and if you point me to some examples of the other "material" that is moved, i'll be happy to tell you how i'd deal with that.
I have a solution of both problems for DP-produced books using the files output by DP before the post-processing stage;
right.
these files correspond to individual pages of the original book, and you can find the image corresponding to a fragment of text through a grep on the DP-file.
that's one way of doing it. but why not run the process systematically, one time, restoring the page-break information in the text-files, and incorporating the ability to grab the image-scans -- automatically and simply -- using that information. i'm sure you know that the eyes of most users glaze over when you start talking about "grep". besides, what needs to be done is to _thoroughly_incorporate_ the error-reporting process _into_ the end-user's reading-experience, so as to maximize the eyeballs of all the people reading the e-texts. it's just a shame that -- at the same time readers are condemning the e-texts because "they are full of errors" -- practically _nothing_ is being done to harness their ability to _catch_ and _report_ errors.
The concept has been implemented recently by a student, and a test of 300 recently posted PG ebooks should be publicly available before the end of this week. This is a part of a system for ebook maintenance (an user can submit a proposal of correction of a text through a web page, after consulting the original images, and an administrator later can accept - or reject - the proposals and obtain automatically a corrected version).
sounds like a process i described in great detail months ago here. i'm glad somebody is programming it for you guys, because i'll be leaving here shortly. but i intend to write the app anyway, because users who want to grab content from the million-book-project will need it to turn those scans into nicely-proofed and formatted text... -bowerbird
participants (1)
-
Bowerbird@aol.com