Moved doc.

2014-04-08 20:06:56 -07:00
parent 7f3674cbd8
commit b1c2bb6e25
1 changed files with 0 additions and 485 deletions
--- a/doc/jwz-threading.txt
+++ b/doc/jwz-threading.txt
@@ -1,485 +0,0 @@
-                              message threading.
-                  (C) 1997-2002 Jamie Zawinski <jwz@jwz.org>
-
-━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-
-In this document, I describe what is, in my humble but correct opinion, the
-best known algorithm for threading messages (that is, grouping messages
-together in parent/child relationships based on which messages are replies to
-which others.) This is the threading algorithm that was used in Netscape Mail
-and News 2.0 and 3.0, and in Grendel.
-
-Sadly, my C implementation of this algorithm is not available, because it was
-purged during the 4.0 rewrite, and Netscape refused to allow me to free the 3.0
-source code.
-
-However, my Java implementation is available in the Grendel source. You can
-find a descendant of that code on ftp.mozilla.org. Here's the original source
-release: grendel-1998-09-05.tar.gz; and a later version, ported to more modern
-Java APIs: grendel-1999-05-14.tar.gz. The threading code is in view/
-Threader.java. See also IThreadable and TestThreader. (The mailsum code in
-storage/MailSummaryFile.java and the MIME parser in the mime/ directory may
-also be of interest.)
-
-This is not the algorithm that Netscape 4.x uses, because this is another area
-where the 4.0 team screwed the pooch, and instead of just continuing to use the
-existing working code, replaced it with something that was bloated, slow,
-buggy, and incorrect. But hey, at least it was in C++ and used databases!
-
-This algorithm is also described in the imapext-thread Internet Draft: Mark
-Crispin and Kenneth Murchison formalized my description of this algorithm, and
-propose it as the THREAD extension to the IMAP protocol (the idea being that
-the IMAP server could give you back a list of messages in a pre-threaded state,
-so that it wouldn't need to be done on the client side.) If you find my
-description of this algorithm confusing, perhaps their restating of it will be
-more to your taste.
-
-I'm told this algorithm is also used in the Evolution and Balsa mail readers.
-Also, Simon Cozens and Richard Clamp have written a Perl version; Frederik
-Dietz has written a Ruby version; and Max Ogden has written a JavaScript
-version. (I've not tested any of these implementations, so I make no claims as
-to how faithfully they implement it.)
-
-                    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-
-First some background on the headers involved.
-
-In-Reply-To:
-
-    The In-Reply-To header was originally defined by RFC 822, the 1982 standard
-    for mail messages. In 2001, its definition was tightened up by RFC 2822.
-
-    RFC 822 defined the In-Reply-To header as, basically, a free-text header.
-    The syntax of it allowed it to contain basically any text at all. The
-    following is, literally, a legal RFC 822 In-Reply-To header:
-
-        In-Reply-To: thirty-five ham and cheese sandwiches
-
-    So you're not guaranteed to be able to parse anything useful out of
-    In-Reply-To if it exists, and even if it contains something that looks like
-    a Message-ID, it might not be (especially since Message-IDs and email
-    addresses have identical syntax.)
-
-    However, most of the time, In-Reply-To headers do have something useful in
-    them. Back in 1997, I grepped over a huge number of messages and collected
-    some damned lies, I mean, statistics, on what kind of In-Reply-To headers
-    they contained. The results:
-
-        In a survey of 22,950 mail messages with In-Reply-To headers:
-
-                  18,396   had at least one occurrence of <>-bracketed text.
-                   4,554   had no <>-bracketed text at all (just names and
-                           dates.)
-                     714   contained one <>-bracketed addr-spec and no message
-                           IDs.
-                       4   contained multiple message IDs.
-                       1   contained one message ID and one <>-bracketed
-                           addr-spec.
-
-        The most common forms of In-Reply-To seemed to be:
-
-                     31%   NAME's message of TIME <ID@HOST>
-                     22%   <ID@HOST>
-                      9%   <ID@HOST> from NAME at "TIME"
-                      8%   USER's message of TIME <ID@HOST>
-                      7%   USER's message of TIME
-                      6%   Your message of "TIME"
-                     17%   hundreds of other variants (average 0.4% each?)
-
-    Of course these numbers are very much dependent on the sample set, which,
-    in this case, was probably skewed toward Unix users, and/or toward people
-    who had been on the net for quite some time (due to the age of the archives
-    I checked.)
-
-    However, this seems to indicate that it's not unreasonable to assume that,
-    if there is an In-Reply-To field, then the first <>-bracketed text found
-    therein is the Message-ID of the parent message. It is safe to assume this,
-    that is, so long as you still exhibit reasonable behavior when that
-    assumption turns out to be wrong, which will happen a small-but-not-
-    insignificant portion of the time.
-
-    RFC 2822, the successor to RFC 822, updated the definition of In-Reply-To:
-    by the more modern standard, In-Reply-To may contain only message IDs.
-    There will usually be only one, but there could be more than one: these are
-    the IDs of the messages to which this one is a direct reply (the idea being
-    that you might be sending one message in reply to several others.)
-
-References:
-
-    The References header was defined by RFC 822 in 1982. It was defined in,
-    effectively, the same way as the In-Reply-To header was defined: which is
-    to say, its definition was pretty useless. (Like In-Reply-To, its
-    definition was also tightened up in 2001 by RFC 2822.)
-
-    However, the References header was also defined in 1987 by RFC 1036
-    (section 2.2.5), the standard for USENET news messages. That definition was
-    much tighter and more useful than the RFC 822 definition: it asserts that
-    this header contain a list of Message-IDs listing the parent, grandparent,
-    great-grandparent, and so on, of this message, oldest first. That is, the
-    direct parent of this message will be the last element of the References
-    header.
-
-    It is not guaranteed to contain the entire tree back to the root-most
-    message in the thread: news readers are allowed to truncate it at their
-    discretion, and the manner in which they truncate it (from the front, from
-    the back, or from the middle) is not defined.
-
-    Therefore, while there is useful info in the References header, it is not
-    uncommon for multiple messages in the same thread to have seemingly-
-    contradictory References data, so threading code must make an effort to do
-    the right thing in the face of conflicting data.
-
-    RFC 2822 updated the mail standard to have the same semantics of References
-    as the news standard, RFC 1036.
-
-In practice, if you ever see a References header in a mail message, it will
-follow the RFC 1036 (and RFC 2822) definition rather than the RFC 822
-definition. Because the References header both contains more information and is
-easier to parse, many modern mail user agents generate and use the References
-header in mail instead of (or in addition to) In-Reply-To, and use the USENET
-semantics when they do so.
-
-You will generally not see In-Reply-To in a news message, but it can
-occasionally happen, usually as a result of mail/news gateways.
-
-So, any sensible threading software will have the ability to take both
-In-Reply-To and References headers into account.
-
-Note: RFC 2822 (section 3.6.4) says that a References field should contain the
-contents of the parent message's References field, followed by the contents of
-the parent's Message-ID field (in other words, the References field should
-contain the path through the thread.) However, I've been informed that recent
-versions of Eudora violate this standard: they put the parent Message-ID in the
-In-Reply-To header, but do not duplicate it in the References header: instead,
-the References header contains the grandparent, great-grand-parent, etc.
-
-This implies that to properly reconstruct the thread of a message in the face
-of this nonstandard behavior, we need to append any In-Reply-To message IDs to
-References.
-
-                    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-
-                                 The Algorithm
-
-This algorithm consists of five main steps, and each of those steps is somewhat
-complicated. However, once you've wrapped your brain around it, it's not really
-that complicated, considering what it does.
-
-In defense of its complexity, I can say this:
-
-  • This algorithm is incredibly robust in the face of garbage input, and even
-    in the face of malicious input (you cannot construct a set of inputs that
-    will send this algorithm into a loop, for example.)
-
-  • This algorithm has been field-tested by something on the order of ten
-    million users over the course of six years.
-
-  • It really does work incredibly well. I've never seen it produce results
-    that were anything less than totally reasonable.
-
-Well, enough with the disclaimers.
-
-                    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-
-Definitions:
-
-  • A Container object is composed of:
-
-        Message message;           // (may be null)
-        Container parent;
-        Container child;           // first child
-        Container next;            // next element in sibling list, or null
-
-  • A Message object only has a few fields we are interested in:
-
-        String subject;          
-        ID message_id;            // the ID of this message
-        ID *references;           // list of IDs of parent messages
-
-    The References field is populated from the ``References'' and/or
-    ``In-Reply-To'' headers. If both headers exist, take the first thing in the
-    In-Reply-To header that looks like a Message-ID, and append it to the
-    References header.
-
-    If there are multiple things in In-Reply-To that look like Message-IDs,
-    only use the first one of them: odds are that the later ones are actually
-    email addresses, not IDs.
-
-    These ID objects can be strings, or they can be any other token on which
-    you can do meaningful equality comparisons.
-
-    Only two things need to be done with the subject strings: ask whether they
-    begin with ``Re:'', and compare the non-Re parts for equivalence. So you
-    can get away with interning or otherwise hashing these, too. (This is a
-    very good idea: my code does this so that I can use == instead of strcmp
-    inside the loop.)
-
-    The ID objects also don't need to be strings, for the same reason. They can
-    be hashes or numeric indexes or anything for which equality comparisons
-    hold, so it's way faster if you can do pointer-equivalence comparisons
-    instead of strcmp.
-
-    The reason the Container and Message objects are separate is because the
-    Container fields are only needed during the act of threading: you don't
-    need to keep those around, so there's no point in bulking up every Message
-    structure with them.
-
-  • The id_table is a hash table associating Message-IDs with Containers.
-
-  • An ``empty container'' is one that doesn't have a message in it, but which
-    shows evidence of having existed. For whatever reason, we don't have that
-    message in our list (maybe it is expired or canceled, maybe it was deleted
-    from the folder, or any of several other reasons.)
-
-    At presentation-time, these will show up as unselectable ``parent''
-    containers, for example, if we have the thread
-
-          -- A
-             |-- B
-             \-- C
-          -- D
-
-    and we know about messages B and C, but their common parent A does not
-    exist, there will be a placeholder for A, to group them together, and
-    prevent D from seeming to be a sibling of B and C.
-
-    These ``dummy'' messages only ever occur at depth 0.
-
-The Algorithm:
-
- 1. For each message:
-
-     A. If id_table contains an empty Container for this ID:
-          ● Store this message in the Container's message slot.
-        Else:
-          ● Create a new Container object holding this message;
-          ● Index the Container by Message-ID in id_table.
-
-     B. For each element in the message's References field:
-
-          ● Find a Container object for the given Message-ID:
-              ● If there's one in id_table use that;
-              ● Otherwise, make (and index) one with a null Message.
-
-          ● Link the References field's Containers together in the order
-            implied by the References header.
-              ● If they are already linked, don't change the existing links.
-              ● Do not add a link if adding that link would introduce a loop:
-                that is, before asserting A->B, search down the children of B
-                to see if A is reachable, and also search down the children of
-                A to see if B is reachable. If either is already reachable as a
-                child of the other, don't add the link.
-
-     C. Set the parent of this message to be the last element in References.
-        Note that this message may have a parent already: this can happen
-        because we saw this ID in a References field, and presumed a parent
-        based on the other entries in that field. Now that we have the actual
-        message, we can be more definitive, so throw away the old parent and
-        use this new one. Find this Container in the parent's children list,
-        and unlink it.
-
-        Note that this could cause this message to now have no parent, if it
-        has no references field, but some message referred to it as the
-        non-first element of its references. (Which would have been some kind
-        of lie...)
-
-        Note that at all times, the various ``parent'' and ``child'' fields
-        must be kept inter-consistent.
-
- 2. Find the root set.
-
-    Walk over the elements of id_table, and gather a list of the Container
-    objects that have no parents.
-
- 3. Discard id_table. We don't need it any more.
-
- 4. Prune empty containers.
-    Recursively walk all containers under the root set.
-    For each container:
-     A. If it is an empty container with no children, nuke it.
-
-        Note: Normally such containers won't occur, but they can show up when
-        two messages have References lines that disagree. For example, assuming
-        A and B are messages, and 1, 2, and 3 are references for messages we
-        haven't seen:
-
-            A has references: 1, 2, 3
-            B has references: 1, 3
-
-        There is ambiguity as to whether 3 is a child of 1 or of 2. So,
-        depending on the processing order, we might end up with either
-
-              -- 1
-                 |-- 2
-                     \-- 3
-                         |-- A
-                         \-- B
-
-        or
-
-              -- 1
-                 |-- 2            <--- non root childless container!
-                 \-- 3
-                     |-- A
-                     \-- B
-
-     B. If the Container has no Message, but does have children, remove this
-        container but promote its children to this level (that is, splice them
-        in to the current child list.)
-
-        Do not promote the children if doing so would promote them to the root 
-        set -- unless there is only one child, in which case, do.
-
- 5. Group root set by subject.
-
-    If any two members of the root set have the same subject, merge them. This
-    is so that messages which don't have References headers at all still get
-    threaded (to the extent possible, at least.)
-     A. Construct a new hash table, subject_table, which associates subject
-        strings with Container objects.
-
-     B. For each Container in the root set:
-
-          ● Find the subject of that sub-tree:
-              ● If there is a message in the Container, the subject is the
-                subject of that message.
-              ● If there is no message in the Container, then the Container
-                will have at least one child Container, and that Container will
-                have a message. Use the subject of that message instead.
-              ● Strip ``Re:'', ``RE:'', ``RE[5]:'', ``Re: Re[4]: Re:'' and so
-                on.
-              ● If the subject is now "", give up on this Container.
-              ● Add this Container to the subject_table if:
-                  ● There is no container in the table with this subject, or
-                  ● This one is an empty container and the old one is not: the
-                    empty one is more interesting as a root, so put it in the
-                    table instead.
-                  ● The container in the table has a ``Re:'' version of this
-                    subject, and this container has a non-``Re:'' version of
-                    this subject. The non-re version is the more interesting of
-                    the two.
-
-     C. Now the subject_table is populated with one entry for each subject
-        which occurs in the root set. Now iterate over the root set, and gather
-        together the difference.
-
-        For each Container in the root set:
-
-          ● Find the subject of this Container (as above.)
-          ● Look up the Container of that subject in the table.
-          ● If it is null, or if it is this container, continue.
-
-          ● Otherwise, we want to group together this Container and the one in
-            the table. There are a few possibilities:
-
-              ● If both are dummies, append one's children to the other, and
-                remove the now-empty container.
-
-              ● If one container is a empty and the other is not, make the
-                non-empty one be a child of the empty, and a sibling of the
-                other ``real'' messages with the same subject (the empty's
-                children.)
-
-              ● If that container is a non-empty, and that message's subject
-                does not begin with ``Re:'', but this message's subject does,
-                then make this be a child of the other.
-
-              ● If that container is a non-empty, and that message's subject
-                begins with ``Re:'', but this message's subject does not, then
-                make that be a child of this one -- they were misordered. (This
-                happens somewhat implicitly, since if there are two messages,
-                one with Re: and one without, the one without will be in the
-                hash table, regardless of the order in which they were seen.)
-
-              ● Otherwise, make a new empty container and make both msgs be a
-                child of it. This catches the both-are-replies and
-                neither-are-replies cases, and makes them be siblings instead
-                of asserting a hierarchical relationship which might not be
-                true.
-
-                (People who reply to messages without using ``Re:'' and without
-                using a References line will break this slightly. Those people
-                suck.)
-
-        (It has occurred to me that taking the date or message number into
-        account would be one way of resolving some of the ambiguous cases, but
-        that's not altogether straightforward either.)
-
- 6. Now you're done threading!
-    Specifically, you no longer need the ``parent'' slot of the Container
-    object, so if you wanted to flush the data out into a smaller, longer-lived
-    structure, you could reclaim some storage as a result.
-
- 7. Now, sort the siblings.
-    At this point, the parent-child relationships are set. However, the sibling
-    ordering has not been adjusted, so now is the time to walk the tree one
-    last time and order the siblings by date, sender, subject, or whatever.
-    This step could also be merged in to the end of step 4, above, but it's
-    probably clearer to make it be a final pass. If you were careful, you could
-    also sort the messages first and take care in the above algorithm to not
-    perturb the ordering, but that doesn't really save anything.
-
-                    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-
-You might be wondering what Netscape Confusicator 4.0 broke. Well, basically
-they never got threading working right. Aside from crashing, corrupting their
-databases files, and general bugginess, the fundamental problem had been
-twofold:
-
-  • 4.0 eliminated the ``dummy thread parent'' step, which is an absolute
-    necessity to get threading right in the case where you don't have every
-    message (e.g., because one has expired, or was never sent to you at all.)
-    The best explanation I was able to get from them for why they did this was,
-    ``it looked ugly and I didn't understand why it was there.''
-
-  • 4.0 eliminated the ``group similar unthreaded subjects'' step, which is
-    necessary to get some semblance of threading right in the absence of
-    References and In-Reply-To, or in the presence of mangled References. If
-    there was no References header, 4.0 just didn't thread at all.
-
-Plus my pet peeve,
-
-  • The 4.0 UI presented threading as a kind of sorting, which is just not the
-    case. Threading is the act of presenting parent/child relationships,
-    whereas sorting is the act of ordering siblings.
-
-    That is, 4.0 gives you these choices: ``Sort by Date; Sort by Subject; Sort
-    by message number; or Thread.'' Where they assume that ``Thread'' implies
-    ``Sort by Date.'' So that means that there's no way to see a threaded set
-    of messages that are sorted by message number, or by sender, etc.
-
-    There should be options for how to sort the messages; and then, orthogonal
-    to that should be the boolean option of whether the messages should be
-    threaded.
-
-I seem to recall there being some other problem that was a result of the thread
-hierarchy being stored in the database, instead of computed as needed in
-realtime (there were was some kind of ordering or stale-data issue that came
-up?) but maybe they finally managed to fix that.
-
-My C version of this code was able to thread 10,000 messages in less than half
-a second on a low-end (90 MHz) Pentium, so the argument that it has to be in
-the database for efficiency is pure bunk.
-
-Also bunk is the idea that databases are needed for ``scalability.'' This code
-can thread 100,000 messages without a horrible delay, and the fact is, if
-you're looking at a 100,000 message folder (or for that matter, if you're
-running Confusicator at all), you're doing so on a machine that has sufficient
-memory to hold these structures in core. Also consider the question of whether
-your GUI toolkit contains a list/outliner widget that can display a million
-elements in the first place. (The answer is probably ``no.'') Also consider
-whether you have ever in your life seen a single folder that has a million
-messages in it, and that further, you've wanted to look at all at once (rather
-than only looking at the most recent 100,000 messages to arrive in that
-newsgroup...)
-
-In short, all the arguments I've heard for using databases to implement
-threading and mbox summarization are solving problems that simply don't exist.
-Show me a real-world situation where the above technique actually falls down,
-and then we'll talk.
-
-Just say no to databases!
-
-━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-
-                                    [ up ]