re: [gutvol-d] Copyright Verification?

jon said:
My focus on scanning goes beyond just OCR purposes -- I think if substantial work is being expended to acquire and scan a book, it takes only a little extra effort to scan at archival quality, which is at least 600 dpi optical (and 256 color greyscale for bitonal and even better 24-bit color.)
i used to agree with you on this, jon. when a page was originally printed as black ink on white paper, i do not see why to scan it as anything but b/w. you're just increasing the difficulty. it was your example project of scanning "my antonia" that convinced me of this. :+) the huge size of those scans made them unnecessarily hard to create and process, in terms of time, difficulty, and resources. (far more than "only a little extra effort".) the realization struck home more fully with your "kama sutra" scans, when you asked for advice in what was essentially converting them _back_ to black/white. color pages? certainly. scan them in color. black-and-white pages? black-and-white scans. and 300 d.p.i. is probably a high enough resolution. if the future wants 'em higher, let the future scan 'em.
http://www.geoffhorton.com/pictureocr/instructions.html has an explanation of what I do.
geoff, i certainly didn't mean to negate your efforts at using a digital camera to scan. as i've said before, i think if you've figured out the factors that will lead to better results when taking that approach, you've learned some very important information, and i applaud you for stepping forward and sharing it with people... :+) -bowerbird

Hi Everybody, I am enjoying this thread. But, do not forget you only 144dpi for good scanning and OCR. Whether you believe or not even if the original is poor you may get scans and less errors at lower resolution. In general the more resolution the poorer OCR becomes unless you tweak the scans and or the OCR Software, so digital cameras will work fine for general OCR needs of PG. With that said. Do not forget PG does want to create commerical quality or professional quality archives. regards Keith.

I think it depends on the text and the OCR software. In my experience, one does much better at 300 or 400 DPI than 150, especially when there are footnotes or accents. And I personally would not accept less than 600 DPI as the original scan resolution for illustrations, even if the web version is only 75 DPI. R C (If anyone saw my handful of micro-card projects, those were scanned at roughly the equivalent of 130 DPI; even with resampling at 4x size and edge enhancement, the scans are very poor... and so was the OCR.) On 7/13/05, Keith J. Schultz <schultzk@uni-trier.de> wrote:
Hi Everybody,
I am enjoying this thread.
But, do not forget you only 144dpi for good scanning and OCR. Whether you believe or not even if the original is poor you may get scans and less errors at lower resolution. In general the more resolution the poorer OCR becomes unless you tweak the scans and or the OCR Software, so digital cameras will work fine for general OCR needs of PG.
With that said. Do not forget PG does want to create commerical quality or professional quality archives.
regards Keith.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

Robert Cicconetti wrote:
I think it depends on the text and the OCR software. In my experience, one does much better at 300 or 400 DPI than 150, especially when there are footnotes or accents. And I personally would not accept less than 600 DPI as the original scan resolution for illustrations, even if the web version is only 75 DPI.
This is my understanding as well. Very fine print (such as 4-5 point) requires a little higher resolution to get a reasonable OCR. And then, as Robert mentions, there is proper legibility of the "fine stuff", such as punctuation and accent marks. I've thought of performing the experiment myself, finding some page with very fine print and lots of accents/small punctuation, and then scanning that at various rez (as well as resampling down from hi-res for further comparison) and running it through OCR. Regarding illustrations, min. 600 dpi is necessary to catch the finer detail of the "dots" simply to make the picture look nicer "as is", as well as to improve the quality of merging the dots during "demoire" processing. ***** As a final note, the possibility that the scans have usefulness beyond just OCR just doesn't register to many. That is, to them, the purpose of the scans is solely for OCR -- the scans themselves are ephemeral. However, the scans themselves have value, including for direct reading. Scanning at higher resolution gives more to play with respect to image restoration for direct reading uses. For the "My Antonia" demo project I chose 600 dpi. I've already put up the cleaned-up scans (but not the original scams which are full color, 600 dpi and thus huge) at http://www.openreader.org/myantonia/ in a few forms, including merged in a DjVu file (working on a PDF as well.) It is certainly possible, as Bowerbird lamented about, to produce other resolutions and color-depths as there is need for. If one starts with high-resolution, one can always sample downwards to fit different needs. But if one starts with low-resolution, one can't recover the higher-rez information -- it is lost forever. I've always believed that every produced structured digital text (SDT) should have the page scans available side-by-side (need not be the original scans, but something readable such as DjVu or PDF). This serves to make it easier for people to verify transcription errors; to provide authentication of the SDT (that 1. the source is Public Domain and 2. it was transcribed properly); in a few cases to resolve content ambiguities introduced in going from rich typography to simple text; and to provide a good feel of the original typography as well as assisting those who may wish to produce a rich typography edition which emulates the original. There are probably a couple other benefits as well in making the scans available. For ease of reading, it does help to have made the scans at higher-rez so image processing magic can be done to maintain the highest quality readability even when reduced in resolution. All of these purposes are better served by starting out with higher-resolution scans. Certainly, a few years ago it was more impractical to save the scans for thousands of books, much less scan at 600 dpi or higher. But today with higher-speed Internet, archives willing to hold scans (such as IA) much greater personal disk capacities, personal backup on DVD, etc., it has become eminently practical to keep the scans and to scan at higher resolution. Maybe a few of us need to band together to accept the scans from the activities of PG and DP, so they can be separately preserved, and have its own catalog/metadata system -- sort of a scan repository for safe-keeping. For the time being the scans can simply be dumped on a 300 gig RAID system (with DVD temporary offsite backup or whatever makes sense), as well as uploaded to IA in a special area with uniform metadata. Later we could make them available. We'd call for those at PG and DP who are scanning to consider upping the resolution to 600 dpi. We could even assist with resampling them down for people. Just some crazy ideas to ponder.... Jon
participants (5)
-
Bowerbird@aol.com
-
Geoff Horton
-
Jon Noring
-
Keith J.Schultz
-
Robert Cicconetti