Re: [gutvol-d] Fwd: Re: epubeditor.sourceforge.net

i'm not directing _any_ of this at you, alex... none... these are things i've said before, and will say again, here and elsewhere in cyberspace, many many times. you seem swell, alex, all reasonable and everything. i only wish there were more like you at archive.org... especially the people who _decide_ on the workflow. and yes, one-off corrections don't impress me any. until the entire library is fixed, no one can develop apps that will add value to the library qua library... just in case anyone is unclear about what i've been talking about, i've appended the x.m.l. data saved for a page in "the art of the book" that has nothing more than a heading on it that says "great britain".
http://www.archive.org/stream/artofbook00holm#page/n12/mode/1up
i think we can agree that that's a whole lot of mud that we need to scrape off a page that has 2 words. -bowerbird p.s. i changed the angle-brackets to square-brackets, so as not to confuse them with any .html processing... ---------------------------------------------------- [page width="2935" height="4285" resolution="300" originalCoords="true"] [block blockType="Text" l="594" t="802" r="2082" b="966"] [region] [rect l="594" t="802" r="2082" b="966"] [/rect] [/region] [text] [par lineSpacing="128"] [line baseline="958" l="608" t="832" r="2064" b="960"] [formatting lang="EnglishUnitedStates" ff="Times New Roman" fs="44." bold="true" spacing="-16"] [charParams l="608" t="836" r="736" b="960" wordStart="true" wordFromDictionary="true" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="100" serifProbability="40" wordPenalty="0" meanStrokeWidth="205"] G [/charParams] [charParams l="748" t="832" r="852" b="956" wordStart="false" wordFromDictionary="true" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="100" serifProbability="100" wordPenalty="0" meanStrokeWidth="205"] R [/charParams] [charParams l="864" t="832" r="964" b="960" wordStart="false" wordFromDictionary="true" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="100" serifProbability="100" wordPenalty="0" meanStrokeWidth="205"] E [/charParams] [charParams l="972" t="836" r="1096" b="956" wordStart="false" wordFromDictionary="true" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="69" serifProbability="98" wordPenalty="0" meanStrokeWidth="205"] A [/charParams] [charParams l="1100" t="832" r="1232" b="956" wordStart="false" wordFromDictionary="true" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="205"] T [/charParams] [charParams l="1232" t="832" r="1304" b="958" wordStart="false" wordFromDictionary="false" wordNormal="false" wordNumeric="false" wordIdentifier="false" charConfidence="255" serifProbability="255" wordPenalty="0" meanStrokeWidth="0"] [/charParams] [charParams l="1304" t="836" r="1404" b="956" wordStart="true" wordFromDictionary="true" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="100" serifProbability="98" wordPenalty="0" meanStrokeWidth="236"] B [/charParams] [charParams l="1416" t="832" r="1524" b="956" wordStart="false" wordFromDictionary="true" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="100" serifProbability="100" wordPenalty="0" meanStrokeWidth="236"] R [/charParams] [charParams l="1536" t="836" r="1584" b="956" wordStart="false" wordFromDictionary="true" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="100" serifProbability="100" wordPenalty="0" meanStrokeWidth="236"] I [/charParams] [charParams l="1596" t="832" r="1728" b="956" wordStart="false" wordFromDictionary="true" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="236"] T [/charParams] [charParams l="1728" t="836" r="1852" b="956" wordStart="false" wordFromDictionary="true" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="100" serifProbability="98" wordPenalty="0" meanStrokeWidth="236"] A [/charParams] [charParams l="1864" t="836" r="1916" b="956" wordStart="false" wordFromDictionary="true" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="72" serifProbability="100" wordPenalty="0" meanStrokeWidth="236"] I [/charParams] [charParams l="1932" t="836" r="2064" b="960" wordStart="false" wordFromDictionary="true" wordNormal="true" wordNumeric="false" wordIdentifier="false" charConfidence="82" serifProbability="100" wordPenalty="0" meanStrokeWidth="236"] N [/charParams] [/formatting] [/line] [/par] [/text] [/block] [/page]

On 24 October 2011 23:53, <Bowerbird@aol.com> wrote:
http://www.archive.org/stream/artofbook00holm#page/n12/mode/1up
i think we can agree that that's a whole lot of mud that we need to scrape off a page that has 2 words.
Sure - if you're only interested in the text content, it's quite useless. It is useful for OCR research to have that data, so I'm glad they provide it - not as useful as corrected text, granted, but I think the clearest example of the value of such data is reCAPTCHA, which (in part of its operation) compares the output of two OCR systems, and extracts images from the coordinates where they disagree. -- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you
participants (2)
-
Bowerbird@aol.com
-
Jimmy O'Regan