Created by: gwideman, Mar 1, 2012 4:44 pm
Revised by: gwideman, Mar 27, 2012 10:17 pm (20 revisions)

Overview: Food for thought

This page presents some statistics about Drupal's documentation pages, and is part of my study of this subject. This is a rather preliminary exploration of the characteristics of this documentation set. Food for thought.

Data set version

Date of documentation survey: Full spider of drupal.org/documentation on 2012-03-11.

The basic size of the documentation

DrupalVsBooks40.jpg
The challenge/problem/opportunity. :-)
I used a spider to retrieve all pages in the books (hierarchies of nodes of type 'book') that are found at or "below" the /documentation page, and are within the drupal.org domain (but not including api.d.o and groups.d.o). This does not include pages in the API documentation, but does include non-book pages linked from main book "top level" pages.
As of end of March 2012, this process retrieved slightly over 8100 pages, of which all but about 40 are book pages within 16 books.

Comparison to traditional books

Putting that quantity in context, the largest paper-based book on Drupal 7 (The Definitive Guide to Drupal 7) has approximately 1000 pages (of which about 880 are content), so drupal.org's documentation is on the order of eight to ten times as voluminous. (Clearly paper book pages are not directly comparable to web pages. Web pages be longer or than paper pages; stats on that below).
To put it another way, drupal.org's documentation has approximately three times as many pages as all the content in five prominent Drupal 7 books put together. This prompts at least some curiosity about whether drupal.org covers proportionately more territory than the books, or what else might explain this weight difference!
The sheer volume of the documentation suggests that there are special challenges for a system which presents this much material to users, and which is also called on to support authors and other maintainers trying to achieve quality and order.
The statistics which follow may be a start at assessing these issues.

Notes

  • In the following, where a table lists books, I usually present them in the order they are to be found on or from the /documentation page, as I took this to represent decreasing significance to the "documentation universe".
  • Browsing the tree views may be helpful in getting a feel for some of the statistics below.
  • Obscure-looking identifiers in [square brackets] are just a reminder to me about which database query is involved.

Quantity of content, by Book and Depth

Pages, by tiers

The table below shows the number of pages in each book, and then those page counts broken down according to depth in the book's hierarchy. (Also bear in mind that the top page for each book is already at least one click from drupal.org/documentation.)
arpt_Pages_110_PageCounts_ByBookXDepth.PNG
[arpt_Pages_110_PageCounts_ByBookXDepth] with Excel totals
This data is plotted below. The books have been sorted on order of total number of pages, and each plot has been scaled so that its peak number of pages corresponds to 4 minor units, with this value marked on the Y axis. You can see that in general, the more pages a book has, the more the peak is shifted "deeper". This is because, in general, individual pages "fan out" to only a limited number of child pages, so a larger book tends to get deeper, rather than wider at each tier.
arpt_Pages_110_PageCounts_ByBookXDepth_plot.PNG

Quantity of text, by tiers

This table shows the total amount of text (in 1000's of characters), per book, and then again broken down according to depth in the book's hierarchy. So, a total of about 17.7 Meg characters. This survey included text in the div.node-content area of the page, minus the "update" area (p.updated) and navigation list (div.book-navigation).
arpt_Pages_122_ContentK_ByBookXDepth.PNG
[arpt_Pages_122_ContentK_ByBookXDepth] with Excel totals.
arpt_Pages_122_ContentK_ByBookXDepth-chart01.PNG

Discussion

The distribution of pages and text shows a preponderance of material is at the 4th, 5th and even 6th level of the page hierarchy. I suspect that material that deep presents a substantial challenge for discoverability (or discoverability of neighboring topics), and calls for sophistication in the navigation facilities.

Typical page characteristics

Text

The following frequency-histogram plots give an idea of the range of sizes of page content text. (Again, this is from the main content of each page, minus the "updated" and "child pages" areas.)
These plots show the number of pages having a content size within each particular size range. The plot on the left shows the same data as on the right, but with finer granularity. In the left plot, the first (very small) bar shows the number of pages in the range of zero to 1000 characters. the next bar (with around 2600 pages) is for 1000-2000 characters. And so on. The plot on the right has a finer breakdown, with 200-character buckets. Here the peak is at around 580 pages in the range of 400-600 characters.arpt_Pages_901_ContentSizeFreqHist01.PNG
[Pages] -> Excel Freq hist

Range of sizes

These plots do not cover the entire range of page sizes. There are some outliers beyond the range shown, including a couple of pages over 100k.

Average content size

Average page content size: 2659 characters.
Mode (the size of page 4007 or 8115, sorted by size): 1599 characters.

Comparison to printed books

I did a very cursory examination of printed books, and came up with the following results:
arpt_books_chars-per-page.PNG
Counting the characters (including spaces) in a typical line of body text, and the number of such lines which could fit on a page, results in the "Max chars/page" figure. However, no pages are solid body text. All pages have liberal doses of other elements, such as headings, program listings, images, additional whitespace (as on first and last pages of chapters) and so on. I guesstimated that this might introduce a factor of, say, 67% to get to the actual average number of chars/page, though it might be as low as 50%. At any rate, the figure in the "67% column are not radically different from drupal.org average doc length of 2.7k.

Navigation and coarse structure

Children per page (links "down")

What sort of variation is there in the "width" of hierarchy?
First, of the approx 8120 pages, there are 6541 with no children. For the remaining 1578, the following frequency histograms shows the distribution of number-of-children per page. Again the plot to the right is a finer-grained view of the same data as on the left.
arpt_Links_100_Summary01-freq-hist.PNG
[arpt_Links_100_Summary01]-> Excel Freq hist

Range of child counts

Largest number of children is 108.

Average number of children

Of pages having at least one child, the average number of children is: 5.2

Additional hierarchy within pages

This data summarizes how "h" heading tags are used in documentation pages. Points of interest:
  • This reflects authors judgement that some hierarchy is needed within the scope of a page.
  • For some reason, perhaps font size, different authors create most-important headings starting at different h levels.
  • Hierarchy within-page is important "conceptual landscape structure" information, yet is hidden from Book's navigation trees (and mine) and consequently represents a way in which a page-granularity navigation tree falls short of being a satisfactorily informative Table of Contents.
arpt_Pages_311_hdt_combos_used_sorted.PNG
[arpt_Pages_311_hdt_combos_used_sorted]

Navigation tabs

I'm not sure the rationale behind which navigation tabs appear on which books, but this table shows the relationships. Listed in reverse order of pages-per-book.
arpt_Pages_157_Book-vs-Tabs-XTab.PNG
[arpt_Pages_157_Book-vs-Tabs-XTab]

Use of structure beyond narrative

I am interested in measures which relate to how far authors are venturing beyond unstructured narrative when discussing technical and procedural topics which often have some intrinsic conceptual structure. In other words, are they using structured descriptive devices that mirror the structure in the topics: sequence, hierarchy, 1-to-N and N-to-N relationships reflected in lists, nested lists, tables, diagrams and so on. In discussing software, there is an additional motivation to use images: screenshots with which to describe user interaction with the product.
There can be a variety of reasons why authors haven't used such devices, including:
  • Editing environment makes a particular device difficult (for example, tables or images)
  • Author is not familiar enough with the material to have grasped the relationships, and is just at the level of listing relatively disjointed facts. (needs to undergo "concept refactoring" :-) ).
  • Prohibitions on using certain devices for style reasons.
  • etc.
Anyhow, here is a start at examining some of that.

Images

A simple measure is "number of images per page". Here, the drupal.org docs look only a little less image-intensive than four popular Drupal books I surveyed, along side the count for drupal.org's pages ("Community" in the table).
arpt_images_cmty_vs_books01.PNG
Notes:
  • I excluded from both the paper book and drupal.org counts trivial images such as check-mark, or attention symbols
  • Incredibly, three of the paper books scored within 1% of each other, at 0.290
However, this isn't a very complete story, because the distribution of the images is very different. In paper books, pages with images usually have only one image, and have as many as three. So the pages-per-image is approximately the reciprocal of the average images-per-page, or around one image-equipped page per 4 pages, or 25% of pages.
In drupal.org's pages, a large fraction of the images are accounted for by a small number of pages with large numbers of images, as shown in the following frequency histograms. Indeed, only about 7.6% of drupal.org's doc pages have images.
arpt_Images_140_NonTrivImageCount_PageFreq.PNG
[arpt_Images_140_NonTrivImageCount_PageFreq]
The left plot shows all pages, with the majority (7500) having zero images, and one page having as many as 53 images. The right plot zooms in on the 7.6% of pages which have images, showing that 3.5% have one image, 1.3% have two images, and so on.
Does this vary by book, which might relate to certain topics calling for more images, or different authors?
arpt_Images_146_ImageCountFreqHisto_ByBook_Pct.PNG
[arpt_Images_146_ImageCountFreqHisto_ByBook_Pct]
The general trend for a particular book can be noted by looking in the zero column, which tells the percentage of pages in that book having no images.

Tables

This statistic looks at how commonly tables appear on pages. The short story is that about 4.7% of pages use tables, with most of these using only one.
arpt_Tables_110-TableCountByBook.PNG
[arpt_Tables_110-TableCountByBook]

Lists

[TODO: queries similar to images examples]

Page lifecycle and maintenance process

Age of content.

How recently have pages been edited? The following table shows counts of pages according to how recently they were edited. There are separate sections for counts by years (includes all pages), and for counts for the most recent 7 quarters.
arpt_PageDate_143_UpdateYrsAgo_ByBook.PNG
[arpt_PageDate_143_UpdateYrsAgo_ByBook]
This would be an interesting measure of overall "up-to-date-ness" of the documentation. Unfortunately, in this simple measure, any edit will cause the page to look recent, even if someone just moved a comma on a page last touched nine years ago. So something smarter would be useful here.

Status of content

Approximately 3800 pages have a Status indication. This table shows for each status how long since pages were last edited.
arpt_Pages_221_Status_ByLastEditQtr.PNG
This is of most interest in connection with the statuses that indicate additional attention is needed, here shaded in blue. (The TOTALS row also encompasses only the blue rows.)
[arpt_Pages_221_Status_ByLastEditQtr]

Editability of content

How easy or hard is it for would-be revisers to edit pages spontaneously? (Or conversely, what proportion of pages are protected from accidents or mischief?) This table shows counts and percentage of pages with edit permission enabled.
arpt_Pages_231_EditablePct.PNG
[arpt_Pages_231_EditablePct]

Page Comments

I was hoping that comments would provide an interesting guide to what pages generated a lot of interest, or at least discussion. However, since there's a policy to delete comments if and when they are incorporated into a page, it's not clear to me how to interpret stats in this area, or how to come up with better ones. Anyhow, here are some basic stats.

Comments per page

As expected, there are lots of pages with no comments, quite a few with one comment, then a declining number of pages with progressively more comments, with the current champion being a page with 174 comments. Here is table showing counts of pages having particular counts of comments, (up to 13 comments/page):
arpt_Comments_144_CommentCountFreqHist_ByBook.PNG..
[arpt_Comments_144_CommentCountFreqHist_ByBook]
A frequency histogram for whole body of documents (the TOTALS row in the table above) look like this:
arpt_Comments_144_CommentCountFreqHist_ByBook-freqhist.PNG
On the left is the straightforward plot, with the ~5500 pages having zero comments plotted at the top left, and the single page having 174 comments at the bottom right. The rest is hard to read, so on the right I plotted the same data using log scales for X and Y. (I fudged the X value for zero by adding 0.1, so it can appear on the log plot.)

Comments per author

At the time of this survey, there were comments from ~9800 authors, of which ~6500 have one comment (again, not counting comments that have been deleted).
arpt_Comments_925_CommentCountByAuthor_FreqHist.PNG
[arpt_Comments_925_CommentCountByAuthor_FreqHist]
Once again, the straightforward frequency histogram is on the left. Changing the axes to log-log makes the data into an almost straight line. Long Tail 101, I guess.

Comments since last edit

(TODO: what query would display this best?)