roger, your code _could_ scale just fine, if only you'd...

roger said:
Even if I agreed with all the technical points, there is what feels like a showstopper to me that's equally if not more important. Running a cooperative workflow of volunteers involves motivating people. A leader of such an effort can't just tell them what to do, or sometimes even how to do it. They need to gain their good will, evoke their interests, stimulate their creativity, and communicate with them in an open manner.
this is, of course, very astute. and it reminds me of something that juliet said in her recent post, which i will return to shortly... but for now, i want to concentrate on this, instead:
I coded it myself and ran it on a non-commercial, personal server. Neither the quality of the code nor the bandwidth to the hardware were scalable, and it wasn't my intention to run a full production site anyway.
i do believe you are mistaken about "the quality of the code", roger, in terms of whether it could _scale_ to full production. or rather, you're correct, but you fail to see the real problem. you followed "the conventional wisdom" which tells coders that they should _not_ use the file-system as their database, but should instead use a "proper" database, typically mysql... but for your purposes, that was a mistake, roger. trying to funnel all of your users, and all their transactions, through a single "gate" -- a.k.a., your mysql database -- is _what_ complicated your code and caused the scaling issues. what you should have done instead is to write a program that can be dropped into a folder _with_ the files for a single book, which then _administers_ the transactions involving those files. since most of the files are small, changes are not a big deal... besides, the fact is that most of the files will never be changed. even ones that _are_ changed will only undergo a few iterations. so there is nothing that's going to overwhelm the file-system... thus with -- maybe -- _ten_ people on a book simultaneously (at the _most!_, and more typically probably just one to three), and few files being changed, there _are_ no "scaling" issues... load is distributed across separate scripts in different folders. that was your main mistake, roger, especially vis-a-vis scaling. (your other big mistake, which increased your complexity load unnecessarily, was your failure to use intelligent file-naming.) your system can easily be changed to accommodate a site that simultaneously digitized hundreds of books, even a thousand... (of course, it'd be smarter to set the number to a few hundred, and not let a new book come in until a current one is finished, so as to avoid the d.p. situation where thousands sit undone.) but you carried away the wrong lesson if you think your code "won't scale", roger... -bowerbird

Am 18.10.2011 um 22:28 schrieb Bowerbird@aol.com:
you followed "the conventional wisdom" which tells coders that they should _not_ use the file-system as their database, but should instead use a "proper" database, typically mysql…
This wisdom is absolutely a must! the emphasis is on "[PROPER]". read below.
but for your purposes, that was a mistake, roger.
trying to funnel all of your users, and all their transactions, through a single "gate" -- a.k.a., your mysql database -- is _what_ complicated your code and caused the scaling issues.
what you should have done instead is to write a program that can be dropped into a folder _with_ the files for a single book, which then _administers_ the transactions involving those files.
Basically, you point out the main problem of the database. It is not layered and modular. What most do not understand that in order to have a "proper" database for any non-trivial task, you must add layers of abstraction in order to administer all transactions and function needed. Without them a system will eventually bog down and will not scale well. As proof of concept I refer to content management systems, which do exactly what is needed. regards Keith
since most of the files are small, changes are not a big deal... besides, the fact is that most of the files will never be changed. even ones that _are_ changed will only undergo a few iterations. so there is nothing that's going to overwhelm the file-system...
thus with -- maybe -- _ten_ people on a book simultaneously (at the _most!_, and more typically probably just one to three), and few files being changed, there _are_ no "scaling" issues... load is distributed across separate scripts in different folders.
that was your main mistake, roger, especially vis-a-vis scaling.
[snip, snip]

On Tue, October 18, 2011 2:28 pm, Bowerbird@aol.com wrote:
what you should have done instead is to write a program that can be dropped into a folder _with_ the files for a single book, which then _administers_ the transactions involving those files.
since most of the files are small, changes are not a big deal... besides, the fact is that most of the files will never be changed. even ones that _are_ changed will only undergo a few iterations. so there is nothing that's going to overwhelm the file-system...
I agree with BowerBird here. We programmers have a tendency to over-complicate things, and reinvent wheels when the old ones are working just fine. Sometimes it's a result of trying to use the tools we're most familiar with, trying to learn new tools that we find interesting, or just an expression of the NIH syndrome (Not Invented Here). My rule of thumb is to use the simplest tool that is appropriate to the job. In this case we appear to have a simple repository of texts, with perhaps some tracking of differences. We would also need a simple network interface that allowed CRUD on the files (Create, Read, Update, Delete). No cross-table joins, complex data structures, or fancy reporting. It seems to me that using the file system as the repository with a simple revision control system (RCS, CVS, etc.) managing access is completely adequate to the job. This has the added advantage of making it simpler for any sort of web server to serve the files for read-only access without having to have database access middleware. Now I'm not prepared to say that Mr. Frank was wrong in how he set up his site; heck, I can't even say that I know /how/ he set things up. He may have had requirements that were totally different than what I envision. But if I were setting up a site to simply allow read/write access to files while tracking changes, I'd use CVS and the file system unless other requirements forced me to reconsider.

Am 21.10.2011 um 15:52 schrieb Lee Passey:
My rule of thumb is to use the simplest tool that is appropriate to the job. In this case we appear to have a simple repository of texts, with perhaps some tracking of differences. We would also need a simple network interface that allowed CRUD on the files (Create, Read, Update, Delete). No cross-table joins, complex data structures, or fancy reporting. It seems to me that using the file system as the repository with a simple revision control system (RCS, CVS, etc.) managing access is completely adequate to the job. This has the added advantage of making it simpler for any sort of web server to serve the files for read-only access without having to have database access middleware. Lee, I agree completely, that one should use the simplest tool. Yet, what does the simplest tool mean. For me, it means that the user/client of the system only sees and gets what they need. The complexity of the "back bone" is of no interest to them. It just works. A revision control system could work. You mention, also tracking differences. Well, if you just look at the individual tasks this kind of simplistic approach will suffice, but once you try to bring very thing together from the scans over OCR to proofing, and finally produce the ebook, you a have a pretty mix of scripts for handling things which cause problem when scaling or changes in the system is needed.
Also, RVS/CRUD system are not for the casual user. How do the users easily access the different version, branches or even find these. Sure you can offer up scripts to make it easy to understand, yet the task in general can be done in a database system, which already knows the structure of things and can automatically offer up the information and files needed. Please, understand the database system has to be designed for administering the task. That is not only handling the work of those that do proofing, production of ebooks, but also, those that must administer the sites offering up the files and books. regards Keith.
participants (3)
-
Bowerbird@aol.com
-
Keith J. Schultz
-
Lee Passey