these grapes are sweet -- lesson #06

10 Sep 2011

      this thread explains how to quickly and easily digitize a book.

i did the book "books and culture", by mabie, which you can
find by searching archive.org for "booksculture00mabiuoft".

and i will share my secrets about how to do the job _fast_...

***

i've done this lotsa times in the past, but the difference now is
that i pledge to do most of the changes with just a text-editor,
whereas usually i custom-program tools to help myself out...

this time, if i do need a custom tool, i pledge that i will code it
and furnish you the code so that you can run the tool yourself.

don't fret if you're not a programmer, or not a very good one.

i'm not a very good one either, but i can get the job done, and
remember -- i'll be coding the tool, and handing you the code.

moreover, to force myself to keep things as simple as possible,
i will code any tools in _python_, which i've never used before...

so since i'm learning it from scratch, i can teach it from scratch,
and answer any just-a-beginner questions that you might have.

***

so, to get our toes wet, i've appended our first program...

it grabs the mabie text that we're using from archive.org and
lists it inside your browser, with handy-dandy line-numbers...

you can run this program here:
...
http://zenmarkuplanguage.com/grapes101.py
you an also get the source-code at a similar address:
...
http://zenmarkuplanguage.com/grapes101.txt
(i'll follow this naming convention throughout this thread.)

as you can see when you run this program, the o.c.r.
had quite a severe problem with floating semicolons.
there are _444_ of them.   but who's counting?         ;+)

-bowerbird

#!/usr/bin/python

import urllib
import re

f = urllib.urlopen("http://ia700300.us.archive.org
/1/items/booksculture00mabiuoft/booksculture00mabiuoft_djvu.txt")

s = f.read()
f.close()

s = re.sub("\r\n","\n",s)
s = re.sub("\r\n","\n",s)
s = re.sub("\r","\n",s)
s = re.sub("\r","\n",s)
s = re.sub(" \n","\n",s)
s = re.sub(" \n","\n",s)
s="\n"+s
s=s+"\n"

print "Content-type: text/html\n\n"
print "<html><head><title>"
print "grapes101.py"
print "</title></head><body><pre>"

pg = s.split("\n")

maxpg=len(pg)

startat=1 # will be 236
endat=maxpg # will be 7764

for i in range(startat,endat):
     if i < 10:
         print "000"+str(i)+"    "+pg[i]
     elif i < 100:
         print "00"+str(i)+"    "+pg[i]
     elif i < 1000:
         print "0"+str(i)+"    "+pg[i]
     elif i > 999:
         print ""+str(i)+"    "+pg[i]

print "</pre><hr></pre></body></html>"

Bowerbird＠aol.com

Benjamin Klein

tags

participants (2)