
I have converted the PG catalog into a mysql databases. There are some interesting anomalies. For a first example, the following ebooks have no title: | id | +------+ | 1070 | | 1071 | | 1072 | | 8817 | | 9116 | | 9117 | | 9118 | | 9119 | | 9120 | | 9121 | | 9122 | | 9123 | | 9124 | | 9125 | | 9126 | | 9127 | | 9128 | | 9129 | | 9130 | | 9131 | | 9132 | | 9133 | | 9134 | | 9135 | | 9136 | | 9137 | | 9138 | | 9139 | | 9140 | | 9141 | | 9142 | | 9144 | +------+ This should allow us to answer some of the statistical type questions about PG etexts. Don

1070-1072 - refer to gutindex.1997 - it notes that 1070-1073 were combined to form 1069. 8817 - catalog corrected (audio book of Jack London's "The Game") 9144 - number noted in gutindex.2003 as being unavailable. 9116-9142 - all were copyrighted audio books, that somehow/somewhen went missing. I mentioned this to Greg in August/2012. At the time, I looked for them on a number of mirror sites, but none had them. Al -----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of don kretz Sent: Monday, October 08, 2012 10:30 AM To: Project Gutenberg Volunteer Discussion Subject: [gutvol-d] PG catalog I have converted the PG catalog into a mysql databases. There are some interesting anomalies. For a first example, the following ebooks have no title: | id | +------+ | 1070 | | 1071 | | 1072 | | 8817 | | 9116 | | 9117 | | 9118 | | 9119 | | 9120 | | 9121 | | 9122 | | 9123 | | 9124 | | 9125 | | 9126 | | 9127 | | 9128 | | 9129 | | 9130 | | 9131 | | 9132 | | 9133 | | 9134 | | 9135 | | 9136 | | 9137 | | 9138 | | 9139 | | 9140 | | 9141 | | 9142 | | 9144 | +------+ This should allow us to answer some of the statistical type questions about PG etexts. Don

From decomposing the PG catalog, here is what appears to be the structure of the data.
The basic unit, of course, is the etext, which is identified by the ID number. The etext has the following associate fields, provided as shown. 1. Download count - exactly one value. 2. Date created - exactly one value. 3. An indicator if it's an audio file. 4. Zero or more titles. 5. Zero or more "creators" (authors). 6. Zero or more contributors. 7. Zero or more descriptions. 8. One or more languages. 9. Zero or more subjects (about 1/4 of the etexts have no subject). 10. Zero or more "alternatives" - probably alternate titles. 11. Zero or one "friendly title". Plus an indicator for each output formatted file provided.

Here is a count of the titles listed more than 10 times. There are obviously some anomalous titles. | L | 14 | | ée | 14 | | ????????? | 14 | | Mosses from an Old Manse | 14 | | What Will He Do with It? | 13 | | My Novel | 13 | | ??????? | 13 | | The Poetical Works of Oliver Wendell Holmes | 13 | | The Last of the Barons | 13 | | A Thorny Path | 13 | | ës | 13 | | History of the Decline and Fall of the Roman Empire | 13 | | The Bride of the Nile | 13 | | The Confessions of J. J. Rousseau | 13 | | Original Short Stories | 13 | | A General History and Collection of Voyages and Travels | 13 | | Recollections of the Private Life of Napoleon | 13 | | Harold : the Last of the Saxon Kings | 13 | | The Parisians | 13 | | ählungen | 12 | | Noites de insomnia, offerecidas a quem n | 12 | | The Bay State Monthly | 12 | | A Medium of Inter-communication for Literary Men, Artists, Antiquaries, Genealogists, etc. | 12 | | The Wandering Jew | 12 | | K | 12 | | Alice, or the Mysteries | 12 | | The Gutenberg Webster's Unabridged Dictionary | 12 | | ???????? | 12 | | Cleopatra | 12 | | Journal des voyages et des voyageurs; 2. sem. 1860 | 12 | | Uarda : a Romance of Ancient Egypt | 11 | | Po | 11 | | Op | 11 | | — Volume 7 | 11 | | Slave Narratives: a Folk History of Slavery in the United States | 11 | | From Interviews with Former Slaves | 11 | | The Emperor | 11 | | Barbara Blomberg | 11 | | The Dor | 11 | | F | 11 | | The Principal Navigations, Voyages, Traffiques and Discoveries of the English Nation | 11 | | S | 11 | | ä | 11 | | Histoire de la R | 11 | | An Egyptian Princess | 11 | | De Aarde en haar Volken, 1907 | 11 | +--------------------------------------------------------------------------------------------+----------+ 111 rows in set

I think those are mostly anomalies. Many of these are simply multi-volume sets (such as, Decline and Fall, and Dictionary). Assuming it's not much harder, maybe you could list the eBook #s of the duplicates. That will let us find the *real* duplicates. And, in some cases, we'll see why they are ambiguous duplicates. For example, "Alice's Adventures in Wonderland" has title duplicates, but none of them are actually the same. -- Greg On Mon, Oct 08, 2012 at 09:55:56PM -0700, don kretz wrote:
Here is a count of the titles listed more than 10 times. There are obviously some anomalous titles.
| L | 14 | | ée | 14 | | ????????? | 14 | | Mosses from an Old Manse | 14 | | What Will He Do with It? | 13 | | My Novel | 13 | | ??????? | 13 | | The Poetical Works of Oliver Wendell Holmes | 13 | | The Last of the Barons | 13 | | A Thorny Path | 13 | | ës | 13 | | History of the Decline and Fall of the Roman Empire | 13 | | The Bride of the Nile | 13 | | The Confessions of J. J. Rousseau | 13 | | Original Short Stories | 13 | | A General History and Collection of Voyages and Travels | 13 | | Recollections of the Private Life of Napoleon | 13 | | Harold : the Last of the Saxon Kings | 13 | | The Parisians | 13 | | ählungen | 12 | | Noites de insomnia, offerecidas a quem n | 12 | | The Bay State Monthly | 12 | | A Medium of Inter-communication for Literary Men, Artists, Antiquaries, Genealogists, etc. | 12 | | The Wandering Jew | 12 | | K | 12 | | Alice, or the Mysteries | 12 | | The Gutenberg Webster's Unabridged Dictionary | 12 | | ???????? | 12 | | Cleopatra | 12 | | Journal des voyages et des voyageurs; 2. sem. 1860 | 12 | | Uarda : a Romance of Ancient Egypt | 11 | | Po | 11 | | Op | 11 | | ? Volume 7 | 11 | | Slave Narratives: a Folk History of Slavery in the United States | 11 | | From Interviews with Former Slaves | 11 | | The Emperor | 11 | | Barbara Blomberg | 11 | | The Dor | 11 | | F | 11 | | The Principal Navigations, Voyages, Traffiques and Discoveries of the English Nation | 11 | | S | 11 | | ä | 11 | | Histoire de la R | 11 | | An Egyptian Princess | 11 | | De Aarde en haar Volken, 1907 | 11 | +--------------------------------------------------------------------------------------------+----------+
111 rows in set
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

"The Last of the Barons" and "Harold: the Last of the Saxon Kings" are both 12-volume sets. "Mosses from an Old Manse" is a portion of the title of 15 different ebooks, e.g. "The Christmas Banquet (From "Mosses from an Old Manse")" "A Thorny Path" is a 13-volume set, as is "The Parisians". Two of the others are serial publications, e.g. "Bay State Monthly", and "Notes and Queries" (subtitled "A Medium of Inter-communication for Literary Men, Artists, Antiquaries, Genealogists, etc.") etc., etc.... Al
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Greg Newby Sent: Monday, October 08, 2012 10:03 PM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] PG catalog
I think those are mostly anomalies. Many of these are simply multi-volume sets (such as, Decline and Fall, and Dictionary).
Assuming it's not much harder, maybe you could list the eBook #s of the duplicates. That will let us find the *real* duplicates.
And, in some cases, we'll see why they are ambiguous duplicates. For example, "Alice's Adventures in Wonderland" has title duplicates, but none of them are actually the same.
-- Greg
On Mon, Oct 08, 2012 at 09:55:56PM -0700, don kretz wrote:
Here is a count of the titles listed more than 10 times. There are obviously some anomalous titles.
| L | 14 | | ée | 14 | | ????????? | 14 | | Mosses from an Old Manse | 14 | | What Will He Do with It? | 13 | | My Novel | 13 | | ??????? | 13 | | The Poetical Works of Oliver Wendell Holmes | 13 | | The Last of the Barons | 13 | | A Thorny Path | 13 | | ës | 13 | | History of the Decline and Fall of the Roman Empire | 13 | | The Bride of the Nile | 13 | | The Confessions of J. J. Rousseau | 13 | | Original Short Stories | 13 | | A General History and Collection of Voyages and Travels | 13 | | Recollections of the Private Life of Napoleon | 13 | | Harold : the Last of the Saxon Kings | 13 | | The Parisians | 13 | | ählungen | 12 | | Noites de insomnia, offerecidas a quem n | 12 | | The Bay State Monthly | 12 | | A Medium of Inter-communication for Literary Men, Artists, | Antiquaries, Genealogists, etc. | 12 | | The Wandering Jew | 12 | | K | 12 | | Alice, or the Mysteries | 12 | | The Gutenberg Webster's Unabridged Dictionary | 12 | | ???????? | 12 | | Cleopatra | 12 | | Journal des voyages et des voyageurs; 2. sem. 1860 | 12 | | Uarda : a Romance of Ancient Egypt | 11 | | Po | 11 | | Op | 11 | | ? Volume 7 | 11 | | Slave Narratives: a Folk History of Slavery in the United States | 11 | | From Interviews with Former Slaves | 11 | | The Emperor | 11 | | Barbara Blomberg | 11 | | The Dor | 11 | | F | 11 | | The Principal Navigations, Voyages, Traffiques and Discoveries of | the English Nation | 11 | | S | 11 | | ä | 11 | | Histoire de la R | 11 | | An Egyptian Princess | 11 | | De Aarde en haar Volken, 1907 | 11 |
+------------------------------------------------------------- -------------------------------+----------+
111 rows in set
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol> -d

Greg, I'll send you the list off-line. I don't post this to be critical - any data set collected over the history behind PG is going to have issues. This is implemented on your server, by the way, so you can get at it directly too. Don On Mon, Oct 8, 2012 at 10:03 PM, Greg Newby <gbnewby@pglaf.org> wrote:
I think those are mostly anomalies. Many of these are simply multi-volume sets (such as, Decline and Fall, and Dictionary).
Assuming it's not much harder, maybe you could list the eBook #s of the duplicates. That will let us find the *real* duplicates.
And, in some cases, we'll see why they are ambiguous duplicates. For example, "Alice's Adventures in Wonderland" has title duplicates, but none of them are actually the same.
-- Greg
On Mon, Oct 08, 2012 at 09:55:56PM -0700, don kretz wrote:
Here is a count of the titles listed more than 10 times. There are obviously some anomalous titles.
| L | 14 | | ée | 14 | | ????????? | 14 | | Mosses from an Old Manse | 14 | | What Will He Do with It? | 13 | | My Novel | 13 | | ??????? | 13 | | The Poetical Works of Oliver Wendell Holmes | 13 | | The Last of the Barons | 13 | | A Thorny Path | 13 | | ës | 13 | | History of the Decline and Fall of the Roman Empire | 13 | | The Bride of the Nile | 13 | | The Confessions of J. J. Rousseau | 13 | | Original Short Stories | 13 | | A General History and Collection of Voyages and Travels | 13 | | Recollections of the Private Life of Napoleon | 13 | | Harold : the Last of the Saxon Kings | 13 | | The Parisians | 13 | | ählungen | 12 | | Noites de insomnia, offerecidas a quem n | 12 | | The Bay State Monthly | 12 | | A Medium of Inter-communication for Literary Men, Artists, Antiquaries, Genealogists, etc. | 12 | | The Wandering Jew | 12 | | K | 12 | | Alice, or the Mysteries | 12 | | The Gutenberg Webster's Unabridged Dictionary | 12 | | ???????? | 12 | | Cleopatra | 12 | | Journal des voyages et des voyageurs; 2. sem. 1860 | 12 | | Uarda : a Romance of Ancient Egypt | 11 | | Po | 11 | | Op | 11 | | ? Volume 7 | 11 | | Slave Narratives: a Folk History of Slavery in the United States | 11 | | From Interviews with Former Slaves | 11 | | The Emperor | 11 | | Barbara Blomberg | 11 | | The Dor | 11 | | F | 11 | | The Principal Navigations, Voyages, Traffiques and Discoveries of the English Nation | 11 | | S | 11 | | ä | 11 | | Histoire de la R | 11 | | An Egyptian Princess | 11 | | De Aarde en haar Volken, 1907 | 11 |
+--------------------------------------------------------------------------------------------+----------+
111 rows in set
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

"don" == don kretz <dakretz@gmail.com> writes:
don> Here is a count of the titles listed more than 10 times. There are don> obviously some anomalous titles. It seems to me that most anomalies come from bad handling of non-ASCII characters. For example, "Histoire de la R" is clearly truncated at an eacute. And L is probably followed by an apostrophe. Carlo

Here is a count of the titles listed more than 10 times. There are obviously some anomalous titles.
Well, out of curiosity checking out "History of the Decline and Fall of the Roman Empire" I find that these are multiple volumes in the set where your query is not capturing that "multiple volumes" distinction for some reason. Times two because apparently two somewhat different editions have been submitted.

It's probably not catchng it because it's a two-art title, and each part is in a different title on the same project. I'm not suggesting it's an error, it's just a result showing how many titles are associated with how many projects. If you break the title into two parts, you probably will end up with results that need a person involved to make the final distinctions; which is fine in many circumstances. It makes querying problematic, but I find querying on the PG site problematic, and this design may be part of the reason. If you want to find out how many actual duplicates of the same book there are, it's hard to figure out how to ask the question. On Tue, Oct 9, 2012 at 6:41 AM, James Adcock <jimad@msn.com> wrote:
Here is a count of the titles listed more than 10 times. There are obviously some anomalous titles.
****
Well, out of curiosity checking out “History of the Decline and Fall of the Roman Empire” I find that these are multiple volumes in the set where your query is not capturing that “multiple volumes” distinction for some reason. Times two because apparently two somewhat different editions have been submitted.****
****
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

The "creators" with more than 50 etexts: +-----------------------------------------------------+----------+ | creator | count(1) | +-----------------------------------------------------+----------+ | Various | 2493 | | Anonymous | 633 | | Shakespeare, William, 1564-1616 | 291 | | Lytton, Edward Bulwer Lytton, Baron, 1803-1873 | 220 | | Twain, Mark, 1835-1911 | 212 | | Ebers, Georg, 1837-1898 | 166 | | Dickens, Charles, 1812-1870 | 158 | | Unknown | 143 | | Verne, Jules, 1828-1905 | 140 | | Parker, Gilbert, 1862-1932 | 133 | | Kingston, William Henry Giles, 1814-1880 | 133 | | Balzac, Honor | 128 | | é de, 1799-1850 | 128 | | Fenn, George Manville, 1831-1909 | 127 | | Doyle, Arthur Conan, Sir, 1859-1930 | 116 | | Jacobs, W. W. (William Wymark), 1863-1943 | 112 | | Meredith, George, 1828-1909 | 110 | | Motley, John Lothrop, 1814-1877 | 103 | | Howells, William Dean, 1837-1920 | 102 | | Ballantyne, R. M. (Robert Michael), 1825-1894 | 100 | | Stevenson, Robert Louis, 1850-1894 | 96 | | Hawthorne, Nathaniel, 1804-1864 | 94 | | Dumas, Alexandre, 1802-1870 | 92 | | Henty, G. A. (George Alfred), 1832-1902 | 92 | | Pepys, Samuel, 1633-1703 | 87 | | James, Henry, 1843-1916 | 79 | | Wells, H. G. (Herbert George), 1866-1946 | 74 | | Human Genome Project | 74 | | Trollope, Anthony, 1815-1882 | 74 | | Lang, Andrew, 1844-1912 | 71 | | Sand, George, 1804-1876 | 70 | | Conrad, Joseph, 1857-1924 | 70 | | MacDonald, George, 1824-1905 | 70 | | Baum, L. Frank (Lyman Frank), 1856-1919 | 68 | | Goethe, Johann Wolfgang von, 1749-1832 | 68 | | Scott, Walter, Sir, 1771-1832 | 68 | | M | 67 | | Library of Congress. Copyright Office | 67 | | ène, 1804-1857 | 66 | | Zola, | 66 | | Émile, 1840-1902 | 66 | | Hope, Laura Lee | 66 | | Sue, Eug | 66 | | Churchill, Winston, 1871-1947 | 64 | | London, Jack, 1876-1916 | 64 | | J | 63 | | Defoe, Daniel, 1661?-1731 | 63 | | Alger, Horatio, 1832-1899 | 63 | | Stratemeyer, Edward, 1862-1930 | 63 | | Yonge, Charlotte M. (Charlotte Mary), 1823-1901 | 62 | | Harte, Bret, 1836-1902 | 62 | | Haggard, H. Rider (Henry Rider), 1856-1925 | 60 | | Thackeray, William Makepeace, 1811-1863 | 60 | | Plato, 427? BC-347? BC | 59 | | Cervantes Saavedra, Miguel de, 1547-1616 | 58 | | Kipling, Rudyard, 1865-1936 | 57 | | Burroughs, Edgar Rice, 1875-1950 | 57 | | Dante Alighieri, 1265-1321 | 57 | | Reid, Mayne, 1818-1883 | 56 | | Hardy, Thomas, 1840-1928 | 55 | | Oppenheim, E. Phillips (Edward Phillips), 1866-1946 | 55 | | Lever, Charles James, 1806-1872 | 55 | | Warner, Charles Dudley, 1829-1900 | 54 | | Wharton, Edith, 1862-1937 | 53 | | Galsworthy, John, 1867-1933 | 52 | | Wilde, Oscar, 1854-1900 | 52 | | Abbott, Jacob, 1803-1879 | 51 | | Huxley, Thomas Henry, 1825-1895 | 51 | +-----------------------------------------------------+----------+ 68 rows in set (0.75 sec)
participants (5)
-
Al Haines
-
don kretz
-
Greg Newby
-
James Adcock
-
traverso@posso.dm.unipi.it