this thread explains how to quickly and easily digitize a book.
i did the book "books and culture", by mabie, which you can
find by searching archive.org for "booksculture00mabiuoft".
and i will share my secrets about how to do the job _fast_...
***
i've done this lotsa times in the past, but the difference now is
that i pledge to do most of the changes with just a text-editor,
whereas usually i custom-program tools to help myself out...
this time, if i do need a custom tool, i pledge that i will code it
and furnish you the code so that you can run the tool yourself.
don't fret if you're not a programmer, or not a very good one.
i'm not a very good one either, but i can get the job done, and
remember -- i'll be coding the tool, and handing you the code.
moreover, to force myself to keep things as simple as possible,
i will code any tools in _python_, which i've never used before...
so since i'm learning it from scratch, i can teach it from scratch,
and answer any just-a-beginner questions that you might have.
***
so, to get our toes wet, i've appended our first program...
it grabs the mabie text that we're using from archive.org and
lists it inside your browser, with handy-dandy line-numbers...
you can run this program here:
> http://zenmarkuplanguage.com/grapes101.py
you an also get the source-code at a similar address:
> http://zenmarkuplanguage.com/grapes101.txt
(i'll follow this naming convention throughout this thread.)
as you can see when you run this program, the o.c.r.
had quite a severe problem with floating semicolons.
there are _444_ of them. but who's counting? ;+)
-bowerbird
#!/usr/bin/python
import urllib
import re
f = urllib.urlopen("http://ia700300.us.archive.org/1/items/booksculture00mabiuoft/booksculture00mabiuoft_djvu.txt")
s = f.read()
f.close()
s = re.sub("\r\n","\n",s)
s = re.sub("\r\n","\n",s)
s = re.sub("\r","\n",s)
s = re.sub("\r","\n",s)
s = re.sub(" \n","\n",s)
s = re.sub(" \n","\n",s)
s="\n"+s
s=s+"\n"
print "Content-type: text/html\n\n"
print "<html><head><title>"
print "grapes101.py"
print "</title></head><body><pre>"
pg = s.split("\n")
maxpg=len(pg)
startat=1 # will be 236
endat=maxpg # will be 7764
for i in range(startat,endat):
if i < 10:
print "000"+str(i)+" "+pg[i]
elif i < 100:
print "00"+str(i)+" "+pg[i]
elif i < 1000:
print "0"+str(i)+" "+pg[i]
elif i > 999:
print ""+str(i)+" "+pg[i]
print "</pre><hr></pre></body></html>"
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d