these grapes are sweet -- lesson #06

this thread explains how to quickly and easily digitize a book. i did the book "books and culture", by mabie, which you can find by searching archive.org for "booksculture00mabiuoft". and i will share my secrets about how to do the job _fast_... *** i've done this lotsa times in the past, but the difference now is that i pledge to do most of the changes with just a text-editor, whereas usually i custom-program tools to help myself out... this time, if i do need a custom tool, i pledge that i will code it and furnish you the code so that you can run the tool yourself. don't fret if you're not a programmer, or not a very good one. i'm not a very good one either, but i can get the job done, and remember -- i'll be coding the tool, and handing you the code. moreover, to force myself to keep things as simple as possible, i will code any tools in _python_, which i've never used before... so since i'm learning it from scratch, i can teach it from scratch, and answer any just-a-beginner questions that you might have. *** so, to get our toes wet, i've appended our first program... it grabs the mabie text that we're using from archive.org and lists it inside your browser, with handy-dandy line-numbers... you can run this program here:
you an also get the source-code at a similar address:
(i'll follow this naming convention throughout this thread.) as you can see when you run this program, the o.c.r. had quite a severe problem with floating semicolons. there are _444_ of them. but who's counting? ;+) -bowerbird #!/usr/bin/python import urllib import re f = urllib.urlopen("http://ia700300.us.archive.org /1/items/booksculture00mabiuoft/booksculture00mabiuoft_djvu.txt") s = f.read() f.close() s = re.sub("\r\n","\n",s) s = re.sub("\r\n","\n",s) s = re.sub("\r","\n",s) s = re.sub("\r","\n",s) s = re.sub(" \n","\n",s) s = re.sub(" \n","\n",s) s="\n"+s s=s+"\n" print "Content-type: text/html\n\n" print "<html><head><title>" print "grapes101.py" print "</title></head><body><pre>" pg = s.split("\n") maxpg=len(pg) startat=1 # will be 236 endat=maxpg # will be 7764 for i in range(startat,endat): if i < 10: print "000"+str(i)+" "+pg[i] elif i < 100: print "00"+str(i)+" "+pg[i] elif i < 1000: print "0"+str(i)+" "+pg[i] elif i > 999: print ""+str(i)+" "+pg[i] print "</pre><hr></pre></body></html>"

I have never used Python, either. (I use Groovy --- http://groovy.codehaus.org.) If I may ask: What have you been writing your tools in until now? -- b Sent from my iPhone On Sep 9, 2011, at 7:36 PM, Bowerbird@aol.com wrote:
this thread explains how to quickly and easily digitize a book.
i did the book "books and culture", by mabie, which you can find by searching archive.org for "booksculture00mabiuoft".
and i will share my secrets about how to do the job _fast_...
***
i've done this lotsa times in the past, but the difference now is that i pledge to do most of the changes with just a text-editor, whereas usually i custom-program tools to help myself out...
this time, if i do need a custom tool, i pledge that i will code it and furnish you the code so that you can run the tool yourself.
don't fret if you're not a programmer, or not a very good one.
i'm not a very good one either, but i can get the job done, and remember -- i'll be coding the tool, and handing you the code.
moreover, to force myself to keep things as simple as possible, i will code any tools in _python_, which i've never used before...
so since i'm learning it from scratch, i can teach it from scratch, and answer any just-a-beginner questions that you might have.
***
so, to get our toes wet, i've appended our first program...
it grabs the mabie text that we're using from archive.org and lists it inside your browser, with handy-dandy line-numbers...
you can run this program here:
you an also get the source-code at a similar address:
(i'll follow this naming convention throughout this thread.)
as you can see when you run this program, the o.c.r. had quite a severe problem with floating semicolons. there are _444_ of them. but who's counting? ;+)
-bowerbird
#!/usr/bin/python
import urllib import re
f = urllib.urlopen("http://ia700300.us.archive.org/1/items/booksculture00mabiuoft/booksculture00...")
s = f.read() f.close()
s = re.sub("\r\n","\n",s) s = re.sub("\r\n","\n",s) s = re.sub("\r","\n",s) s = re.sub("\r","\n",s) s = re.sub(" \n","\n",s) s = re.sub(" \n","\n",s) s="\n"+s s=s+"\n"
print "Content-type: text/html\n\n" print "<html><head><title>" print "grapes101.py" print "</title></head><body><pre>"
pg = s.split("\n")
maxpg=len(pg)
startat=1 # will be 236 endat=maxpg # will be 7764
for i in range(startat,endat): if i < 10: print "000"+str(i)+" "+pg[i] elif i < 100: print "00"+str(i)+" "+pg[i] elif i < 1000: print "0"+str(i)+" "+pg[i] elif i > 999: print ""+str(i)+" "+pg[i]
print "</pre><hr></pre></body></html>" _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
participants (2)
-
Benjamin Klein
-
Bowerbird@aol.com