it's monday, after noon, let's get the crowdsourcing started

older
Crowdsourcing (Re: Producing epub...

Bowerbird＠aol.com

30 Jan 2012 30 Jan '12

10:14 p.m.

greg- i'm ready with my crowdsourcing thing. are you gonna provide the pagescans? or do we have to dig 'em up ourselves? too bad all the p.g. e-texts are rewrapped... that's gonna make re-proofing very difficult. any book(s) in particular you want to start with? -bowerbird

Attachments:

attachment.html (text/html — 513 bytes)

Show replies by date

Al Haines

30 Jan 30 Jan

10:46 p.m.

My suggestion, start with etext03 (etexts 3601-4800). So far as I know, they were all done by independents (I think DP's productions didn't start until etext04). A lot of them contain no publisher/copyright info, and they're nasties that David Widger and I gave up on during our repost project of a couple of years ago, I'd further suggest leaving alone any copyrighted texts. Al -----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Bowerbird@aol.com Sent: Monday, January 30, 2012 2:15 PM To: gutvol-d@lists.pglaf.org; bowerbird@aol.com Subject: [gutvol-d] it's monday, after noon, let's get the crowdsourcing started greg- i'm ready with my crowdsourcing thing. are you gonna provide the pagescans? or do we have to dig 'em up ourselves? too bad all the p.g. e-texts are rewrapped... that's gonna make re-proofing very difficult. any book(s) in particular you want to start with? -bowerbird

don kretz

11:21 p.m.

I'll pull out just the tables... On Mon, Jan 30, 2012 at 2:46 PM, Al Haines <ajhaines@shaw.ca> wrote:

...

** My suggestion, start with etext03 (etexts 3601-4800).

So far as I know, they were all done by independents (I think DP's productions didn't start until etext04). A lot of them contain no publisher/copyright info, and they're nasties that David Widger and I gave up on during our repost project of a couple of years ago,

I'd further suggest leaving alone any copyrighted texts.

Al

-----Original Message----- *From:* gutvol-d-bounces@lists.pglaf.org [mailto: gutvol-d-bounces@lists.pglaf.org] *On Behalf Of *Bowerbird@aol.com *Sent:* Monday, January 30, 2012 2:15 PM *To:* gutvol-d@lists.pglaf.org; bowerbird@aol.com *Subject:* [gutvol-d] it's monday, after noon, let's get the crowdsourcing started

greg-

i'm ready with my crowdsourcing thing.

are you gonna provide the pagescans? or do we have to dig 'em up ourselves?

too bad all the p.g. e-texts are rewrapped... that's gonna make re-proofing very difficult.

any book(s) in particular you want to start with?

-bowerbird

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

don kretz

11:33 p.m.

Where do we find page images? On Mon, Jan 30, 2012 at 3:21 PM, don kretz <dakretz@gmail.com> wrote:

...

I'll pull out just the tables...

On Mon, Jan 30, 2012 at 2:46 PM, Al Haines <ajhaines@shaw.ca> wrote:

...
** My suggestion, start with etext03 (etexts 3601-4800).

So far as I know, they were all done by independents (I think DP's productions didn't start until etext04). A lot of them contain no publisher/copyright info, and they're nasties that David Widger and I gave up on during our repost project of a couple of years ago,

I'd further suggest leaving alone any copyrighted texts.

Al

-----Original Message----- *From:* gutvol-d-bounces@lists.pglaf.org [mailto: gutvol-d-bounces@lists.pglaf.org] *On Behalf Of *Bowerbird@aol.com *Sent:* Monday, January 30, 2012 2:15 PM *To:* gutvol-d@lists.pglaf.org; bowerbird@aol.com *Subject:* [gutvol-d] it's monday, after noon, let's get the crowdsourcing started

greg-

i'm ready with my crowdsourcing thing.

are you gonna provide the pagescans? or do we have to dig 'em up ourselves?

too bad all the p.g. e-texts are rewrapped... that's gonna make re-proofing very difficult.

any book(s) in particular you want to start with?

-bowerbird

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Lee Passey

31 Jan 31 Jan

5:09 a.m.

New subject: it's monday, after noon, let's get the crowdsourcing started

On 1/30/2012 4:33 PM, don kretz wrote:

...

Where do we find page images?

I pull them from Internet Archive. Find the book your interested in and on the page you should see a little link on the left which is "All Files: HTTP." The link will take you to the list of files making up that entry. The zip file containing the page scans will usually be "*jp2.zip" or "*jp.zip", depending on whether the jpegs are the old format or JPEG2000. Once in a while you can find TIFF files, which tend to be more useful.

don kretz

5:37 a.m.

I didn''t realize they have us so completely covered. Including matching edition? I'm a bit familiar with TIA sources. Check this out and click on a page number at the far right. http://eb.tbicl.org/fouriers-series/ On Mon, Jan 30, 2012 at 9:09 PM, Lee Passey <lee@novomail.net> wrote:

...

On 1/30/2012 4:33 PM, don kretz wrote:

Where do we find page images?

...
I pull them from Internet Archive. Find the book your interested in and on the page you should see a little link on the left which is "All Files: HTTP." The link will take you to the list of files making up that entry. The zip file containing the page scans will usually be "*jp2.zip" or "*jp.zip", depending on whether the jpegs are the old format or JPEG2000.

Once in a while you can find TIFF files, which tend to be more useful.

______________________________**_________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/**mailman/listinfo/gutvol-d<http://lists.pglaf.org/mailman/listinfo/gutvol-d>

don kretz

5:40 a.m.

That's accomplished by embedding sd ssdsd ssd khkjkj [pgnum]1153[/pgnum] khjh Ssd wwShk You can do more powerful stuff if you don't embed immutable html markup. at exactly the right spot in the text flow. On Mon, Jan 30, 2012 at 9:37 PM, don kretz <dakretz@gmail.com> wrote:

...

I didn''t realize they have us so completely covered. Including matching edition?

I'm a bit familiar with TIA sources.

Check this out and click on a page number at the far right.

http://eb.tbicl.org/fouriers-series/

On Mon, Jan 30, 2012 at 9:09 PM, Lee Passey <lee@novomail.net> wrote:

...
On 1/30/2012 4:33 PM, don kretz wrote:

Where do we find page images?

...
I pull them from Internet Archive. Find the book your interested in and on the page you should see a little link on the left which is "All Files: HTTP." The link will take you to the list of files making up that entry. The zip file containing the page scans will usually be "*jp2.zip" or "*jp.zip", depending on whether the jpegs are the old format or JPEG2000.

Once in a while you can find TIFF files, which tend to be more useful.

______________________________**_________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/**mailman/listinfo/gutvol-d<http://lists.pglaf.org/mailman/listinfo/gutvol-d>

Lee Passey

7:29 p.m.

New subject: it's monday, after noon, let's get the crowdsourcing started

On Mon, January 30, 2012 10:40 pm, don kretz wrote:

...

That's accomplished by embedding

sd ssdsd ssd khkjkj [pgnum]1153[/pgnum] khjh Ssd wwShk

You can do more powerful stuff if you don't embed immutable html markup. at exactly the right spot in the text flow.

This is virtually the same thing I do, except I place the page number in the "title" attribute instead of in phrasing content. This leads me to my first rule of HTML construction (mirrored in the DP "Proofreader's Guide to EPUB") which is: 1. Do not use markup that requires CSS. Style sheets are encouraged and recommended, but every HTML file must be acceptably legible when CSS is disabled. In your case, a CSS style of "a.pgnum {display:none}" would cause the inline page number to disappear, but it would still disrupt the text when CSS is disabled. I don't know if KindleGen supports the "display:none" style, but if it doesn't you would have page numbers throughout the text. I haven't given it a lot of thought, but I believe you could get a page number display similar to your example by using JavaScript and reading the "title" attribute of the <a> objects. Fancy browsers that support JavaScript could get the floating page number display, but they wouldn't be obvious in a less capable User Agent.

Jim Adcock

8:46 p.m.

...

"a.pgnum {display:none}"

What I have seen is that neither display:none nor visibility:hidden work on small machines -- the page numbers stay frustratingly there -- in the wrong places, meaning the body text -- until you get out your regex editor and physically excise all the page numbers. Out out damned page number! Again, consider an XML spec instead of an "HTML" one, but then consider *in practice* what that would require.

James Adcock

4:35 a.m.

...

too bad all the p.g. e-texts are rewrapped... that's gonna make re-proofing very difficult.

I don't understand why it would be very difficult. I told this forum some years ago that I have written software that will take rewrapped e-texts and unwrap them back to match the original or reconstructed OCR, even if that OCR is very messy. Find an example of the reconstruction of the original line breaks from an extremely scanno txt file at: <http://freekindlebooks.org/Dev/Huck.txt> http://freekindlebooks.org/Dev/Huck.txt Where the text-part is PG 76, and the linebreaks-part are taken from the extremely-scanno file at AI adventureshuckle00twaiiala The results looks like a mess only because adventureshuckle00twaiiala is a real mess of an OCR (like a lot at IA) - but the linebreaks are pretty much at the correct positions if you check it out (it looks messy mainly just that the IA OCR includes a tremendous amount of vacuous vertical whitespace) Also, the line lengths jump around - because the line lengths DO jump around in the original text, as the original text wraps text around "floating" images. If I was going to do this "for real" I would probably make the effort to re-OCR the IA posting, since its usual to be able to get a much cleaner OCR easily compared to what IA posts. This linebreak reconstruction took me literally about 10 minutes, complicated only by the fact that the start of the 76.txt is *extremely* unfaithful to the original text. (And in general 76 is pretty unfaithful to the original text.) Granted, life would be more simple if PG were to request that submitters retain the original line break locations in the txt and html submissions, rather than asking people to rewrap the text files at 70 chars, and to run the html through tidy. But linebreak reconstruction isn't *that* hard. Not sure why you think you want to do this in the first place though? I had imaged linebreak reconstruction for the case where DP wants to take an old crusty PG book and run it all the back through their system again - perhaps skipping a round or two. But why would any of you want to do this?

don kretz

4:46 a.m.

Which brings up another point. Perhaps we should reinsert page breaks as part of the exercise. **

...

Not sure why you think you want to do this in the first place though? I had imaged linebreak reconstruction for the case where DP wants to take an old crusty PG book and run it all the back through their system again – perhaps skipping a round or two. But why would any of you want to do this? ****

****

** **

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Lee Passey

5:03 a.m.

New subject: it's monday, after noon, let's get the crowdsourcing started

On 1/30/2012 9:46 PM, don kretz wrote:

...

Which brings up another point. Perhaps we should reinsert page breaks as part of the exercise.

That would be among the very first things I would do. My strategy is to use <a class="page" id="pg0nnn" title="nnn" /> to indicate a new page. I do this for four reasons: 1. it allows me to extract a page at a time from an HTML file for display/editing; 2. it allows the HTML to be viewed without page breaks when there is no break in the narrative; 3. it provides an anchor for any indexes or concordances; and 4. it can be used to build a <pageList> in .ncx files when building ePubs.

Jim Adcock

3:06 p.m.

...

<a class="page" id="pg0nnn" title="nnn" >

A lot of tools will automatically throw away unref'ed anchors, and I believe "legal placement locations" of anchors becomes much more problematic in HTML5.

Lee Passey

5:08 p.m.

New subject: it's monday, after noon, let's get the crowdsourcing started

On Tue, January 31, 2012 8:06 am, Jim Adcock wrote:

...

...
<a class="page" id="pg0nnn" title="nnn" >

A lot of tools will automatically throw away unref'ed anchors,

If so (IIRC this is the case with the CKEditor) that is the fault of the tool, not of the markup language. The tool /I/ have written not only allows this construct, but relies on it to extract "pages" from an HTML file.

...

and I believe "legal placement locations" of anchors becomes much more problematic in HTML5.

First of all, it is important to note that while HTML5 is on the horizon, it is not yet even an official specification. Second, while I have seen much discussion tangential to this issue, I have seen nothing to definitively indicate that empty anchors which have 'class' and 'id' attributes have been deprecated or abolished in HTML 5 (although I seem to recall that it /does/ specifically require end tags for <a> even when empty). One web site recommends not using <a> as a target, instead adding an 'id' attribute to the next nearest element. This of course makes it impossible to specify a precise point in "phrasing content." Others have suggested that to achieve this precise targeting an empty <span> element could be inserted having the same 'id', 'class', and 'title' attributes as would have otherwise been attached to the <a> tag. In my opinion, this "cure" is worse than the disease. In any case, my goal is to develop an HTML "master" format that could then be used to derive other formats. Even if <a> becomes deprecated as an "anchor" in HTML 5, it will still serve a useful purpose in the "master" format, and HTML 5 compliant User Agents will be compelled to at worst ignore it which would be completely acceptable at that point in its life-cycle.

Jim Adcock

8:12 p.m.

Lee>In any case, my goal is to develop an HTML "master" format that could then be used to derive other formats. If specifying an "HTML" "master" format which for a variety of reasons "doesn't work" in practice then it begs the question why not just specify in XML and make that specification do what you want in the first place -- and make a specification which is easier for PG tools to process? For example XML the 100 tags/rules DP already has in use. In that case you live and die by having PG around and continuing to provide tool support for that particular XML, but, that is the same basic problem as your "HTML" "master" format in any case, since you are claiming that you intend to derive from that "HTML" "master" format in any case. ???

4877

Age (days ago)

4878

Last active (days ago)

List overview

Download

14 comments

6 participants

participants (6)

Al Haines
Bowerbird＠aol.com
don kretz
James Adcock
Jim Adcock
Lee Passey