
On Jan 5, 2012, at 2:30 AM, Keith J. Schultz wrote:
I believe I can help you on setting up the heuristics that you will be needing as this kind of work is right up my alley ... What I am uncertain about is if you using scans or text files.
Hi Keith. I'm using a single text file. In particular, I made it work with the concatenated text file that comes out of DP. My goal is to make it easier for someone to get into post-processing. I know many want to but are overwhelmed by PPing--especially the formats other than text parts. The format recognition code actually works pretty well already and I'm surprised how fast I can go from DP text file to HTML. I haven't written back-end generators for other formats, but that would be next. Since it is a two step process and since I can manually mark up the infrequent special cases, it's not that important to work on the hueristics to get it all just right automagically. It would be important if this were going to be applied as part of a script to a whole catalog collection, which I won't be doing anytime soon. So thanks for the offer to improve the first pass (text to 'p-code' equivalent). I could send you the source code for this (Python 3) anytime; I would just have to document the intermediate form and you could make it more accurate. However, unless that is just academically interesting to you, it's probably not necessary. --Roger