diff --git a/doc/jwz-threading.txt b/doc/jwz-threading.txt new file mode 100644 index 0000000..4a65ef1 --- /dev/null +++ b/doc/jwz-threading.txt @@ -0,0 +1,485 @@ + message threading. + (C) 1997-2002 Jamie Zawinski + +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + +In this document, I describe what is, in my humble but correct opinion, the +best known algorithm for threading messages (that is, grouping messages +together in parent/child relationships based on which messages are replies to +which others.) This is the threading algorithm that was used in Netscape Mail +and News 2.0 and 3.0, and in Grendel. + +Sadly, my C implementation of this algorithm is not available, because it was +purged during the 4.0 rewrite, and Netscape refused to allow me to free the 3.0 +source code. + +However, my Java implementation is available in the Grendel source. You can +find a descendant of that code on ftp.mozilla.org. Here's the original source +release: grendel-1998-09-05.tar.gz; and a later version, ported to more modern +Java APIs: grendel-1999-05-14.tar.gz. The threading code is in view/ +Threader.java. See also IThreadable and TestThreader. (The mailsum code in +storage/MailSummaryFile.java and the MIME parser in the mime/ directory may +also be of interest.) + +This is not the algorithm that Netscape 4.x uses, because this is another area +where the 4.0 team screwed the pooch, and instead of just continuing to use the +existing working code, replaced it with something that was bloated, slow, +buggy, and incorrect. But hey, at least it was in C++ and used databases! + +This algorithm is also described in the imapext-thread Internet Draft: Mark +Crispin and Kenneth Murchison formalized my description of this algorithm, and +propose it as the THREAD extension to the IMAP protocol (the idea being that +the IMAP server could give you back a list of messages in a pre-threaded state, +so that it wouldn't need to be done on the client side.) If you find my +description of this algorithm confusing, perhaps their restating of it will be +more to your taste. + +I'm told this algorithm is also used in the Evolution and Balsa mail readers. +Also, Simon Cozens and Richard Clamp have written a Perl version; Frederik +Dietz has written a Ruby version; and Max Ogden has written a JavaScript +version. (I've not tested any of these implementations, so I make no claims as +to how faithfully they implement it.) + + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + +First some background on the headers involved. + +In-Reply-To: + + The In-Reply-To header was originally defined by RFC 822, the 1982 standard + for mail messages. In 2001, its definition was tightened up by RFC 2822. + + RFC 822 defined the In-Reply-To header as, basically, a free-text header. + The syntax of it allowed it to contain basically any text at all. The + following is, literally, a legal RFC 822 In-Reply-To header: + + In-Reply-To: thirty-five ham and cheese sandwiches + + So you're not guaranteed to be able to parse anything useful out of + In-Reply-To if it exists, and even if it contains something that looks like + a Message-ID, it might not be (especially since Message-IDs and email + addresses have identical syntax.) + + However, most of the time, In-Reply-To headers do have something useful in + them. Back in 1997, I grepped over a huge number of messages and collected + some damned lies, I mean, statistics, on what kind of In-Reply-To headers + they contained. The results: + + In a survey of 22,950 mail messages with In-Reply-To headers: + + 18,396 had at least one occurrence of <>-bracketed text. + 4,554 had no <>-bracketed text at all (just names and + dates.) + 714 contained one <>-bracketed addr-spec and no message + IDs. + 4 contained multiple message IDs. + 1 contained one message ID and one <>-bracketed + addr-spec. + + The most common forms of In-Reply-To seemed to be: + + 31% NAME's message of TIME + 22% + 9% from NAME at "TIME" + 8% USER's message of TIME + 7% USER's message of TIME + 6% Your message of "TIME" + 17% hundreds of other variants (average 0.4% each?) + + Of course these numbers are very much dependent on the sample set, which, + in this case, was probably skewed toward Unix users, and/or toward people + who had been on the net for quite some time (due to the age of the archives + I checked.) + + However, this seems to indicate that it's not unreasonable to assume that, + if there is an In-Reply-To field, then the first <>-bracketed text found + therein is the Message-ID of the parent message. It is safe to assume this, + that is, so long as you still exhibit reasonable behavior when that + assumption turns out to be wrong, which will happen a small-but-not- + insignificant portion of the time. + + RFC 2822, the successor to RFC 822, updated the definition of In-Reply-To: + by the more modern standard, In-Reply-To may contain only message IDs. + There will usually be only one, but there could be more than one: these are + the IDs of the messages to which this one is a direct reply (the idea being + that you might be sending one message in reply to several others.) + +References: + + The References header was defined by RFC 822 in 1982. It was defined in, + effectively, the same way as the In-Reply-To header was defined: which is + to say, its definition was pretty useless. (Like In-Reply-To, its + definition was also tightened up in 2001 by RFC 2822.) + + However, the References header was also defined in 1987 by RFC 1036 + (section 2.2.5), the standard for USENET news messages. That definition was + much tighter and more useful than the RFC 822 definition: it asserts that + this header contain a list of Message-IDs listing the parent, grandparent, + great-grandparent, and so on, of this message, oldest first. That is, the + direct parent of this message will be the last element of the References + header. + + It is not guaranteed to contain the entire tree back to the root-most + message in the thread: news readers are allowed to truncate it at their + discretion, and the manner in which they truncate it (from the front, from + the back, or from the middle) is not defined. + + Therefore, while there is useful info in the References header, it is not + uncommon for multiple messages in the same thread to have seemingly- + contradictory References data, so threading code must make an effort to do + the right thing in the face of conflicting data. + + RFC 2822 updated the mail standard to have the same semantics of References + as the news standard, RFC 1036. + +In practice, if you ever see a References header in a mail message, it will +follow the RFC 1036 (and RFC 2822) definition rather than the RFC 822 +definition. Because the References header both contains more information and is +easier to parse, many modern mail user agents generate and use the References +header in mail instead of (or in addition to) In-Reply-To, and use the USENET +semantics when they do so. + +You will generally not see In-Reply-To in a news message, but it can +occasionally happen, usually as a result of mail/news gateways. + +So, any sensible threading software will have the ability to take both +In-Reply-To and References headers into account. + +Note: RFC 2822 (section 3.6.4) says that a References field should contain the +contents of the parent message's References field, followed by the contents of +the parent's Message-ID field (in other words, the References field should +contain the path through the thread.) However, I've been informed that recent +versions of Eudora violate this standard: they put the parent Message-ID in the +In-Reply-To header, but do not duplicate it in the References header: instead, +the References header contains the grandparent, great-grand-parent, etc. + +This implies that to properly reconstruct the thread of a message in the face +of this nonstandard behavior, we need to append any In-Reply-To message IDs to +References. + + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + + The Algorithm + +This algorithm consists of five main steps, and each of those steps is somewhat +complicated. However, once you've wrapped your brain around it, it's not really +that complicated, considering what it does. + +In defense of its complexity, I can say this: + + • This algorithm is incredibly robust in the face of garbage input, and even + in the face of malicious input (you cannot construct a set of inputs that + will send this algorithm into a loop, for example.) + + • This algorithm has been field-tested by something on the order of ten + million users over the course of six years. + + • It really does work incredibly well. I've never seen it produce results + that were anything less than totally reasonable. + +Well, enough with the disclaimers. + + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + +Definitions: + + • A Container object is composed of: + + Message message; // (may be null) + Container parent; + Container child; // first child + Container next; // next element in sibling list, or null + + • A Message object only has a few fields we are interested in: + + String subject; + ID message_id; // the ID of this message + ID *references; // list of IDs of parent messages + + The References field is populated from the ``References'' and/or + ``In-Reply-To'' headers. If both headers exist, take the first thing in the + In-Reply-To header that looks like a Message-ID, and append it to the + References header. + + If there are multiple things in In-Reply-To that look like Message-IDs, + only use the first one of them: odds are that the later ones are actually + email addresses, not IDs. + + These ID objects can be strings, or they can be any other token on which + you can do meaningful equality comparisons. + + Only two things need to be done with the subject strings: ask whether they + begin with ``Re:'', and compare the non-Re parts for equivalence. So you + can get away with interning or otherwise hashing these, too. (This is a + very good idea: my code does this so that I can use == instead of strcmp + inside the loop.) + + The ID objects also don't need to be strings, for the same reason. They can + be hashes or numeric indexes or anything for which equality comparisons + hold, so it's way faster if you can do pointer-equivalence comparisons + instead of strcmp. + + The reason the Container and Message objects are separate is because the + Container fields are only needed during the act of threading: you don't + need to keep those around, so there's no point in bulking up every Message + structure with them. + + • The id_table is a hash table associating Message-IDs with Containers. + + • An ``empty container'' is one that doesn't have a message in it, but which + shows evidence of having existed. For whatever reason, we don't have that + message in our list (maybe it is expired or canceled, maybe it was deleted + from the folder, or any of several other reasons.) + + At presentation-time, these will show up as unselectable ``parent'' + containers, for example, if we have the thread + + -- A + |-- B + \-- C + -- D + + and we know about messages B and C, but their common parent A does not + exist, there will be a placeholder for A, to group them together, and + prevent D from seeming to be a sibling of B and C. + + These ``dummy'' messages only ever occur at depth 0. + +The Algorithm: + + 1. For each message: + + A. If id_table contains an empty Container for this ID: + ● Store this message in the Container's message slot. + Else: + ● Create a new Container object holding this message; + ● Index the Container by Message-ID in id_table. + + B. For each element in the message's References field: + + ● Find a Container object for the given Message-ID: + ● If there's one in id_table use that; + ● Otherwise, make (and index) one with a null Message. + + ● Link the References field's Containers together in the order + implied by the References header. + ● If they are already linked, don't change the existing links. + ● Do not add a link if adding that link would introduce a loop: + that is, before asserting A->B, search down the children of B + to see if A is reachable, and also search down the children of + A to see if B is reachable. If either is already reachable as a + child of the other, don't add the link. + + C. Set the parent of this message to be the last element in References. + Note that this message may have a parent already: this can happen + because we saw this ID in a References field, and presumed a parent + based on the other entries in that field. Now that we have the actual + message, we can be more definitive, so throw away the old parent and + use this new one. Find this Container in the parent's children list, + and unlink it. + + Note that this could cause this message to now have no parent, if it + has no references field, but some message referred to it as the + non-first element of its references. (Which would have been some kind + of lie...) + + Note that at all times, the various ``parent'' and ``child'' fields + must be kept inter-consistent. + + 2. Find the root set. + + Walk over the elements of id_table, and gather a list of the Container + objects that have no parents. + + 3. Discard id_table. We don't need it any more. + + 4. Prune empty containers. + Recursively walk all containers under the root set. + For each container: + A. If it is an empty container with no children, nuke it. + + Note: Normally such containers won't occur, but they can show up when + two messages have References lines that disagree. For example, assuming + A and B are messages, and 1, 2, and 3 are references for messages we + haven't seen: + + A has references: 1, 2, 3 + B has references: 1, 3 + + There is ambiguity as to whether 3 is a child of 1 or of 2. So, + depending on the processing order, we might end up with either + + -- 1 + |-- 2 + \-- 3 + |-- A + \-- B + + or + + -- 1 + |-- 2 <--- non root childless container! + \-- 3 + |-- A + \-- B + + B. If the Container has no Message, but does have children, remove this + container but promote its children to this level (that is, splice them + in to the current child list.) + + Do not promote the children if doing so would promote them to the root + set -- unless there is only one child, in which case, do. + + 5. Group root set by subject. + + If any two members of the root set have the same subject, merge them. This + is so that messages which don't have References headers at all still get + threaded (to the extent possible, at least.) + A. Construct a new hash table, subject_table, which associates subject + strings with Container objects. + + B. For each Container in the root set: + + ● Find the subject of that sub-tree: + ● If there is a message in the Container, the subject is the + subject of that message. + ● If there is no message in the Container, then the Container + will have at least one child Container, and that Container will + have a message. Use the subject of that message instead. + ● Strip ``Re:'', ``RE:'', ``RE[5]:'', ``Re: Re[4]: Re:'' and so + on. + ● If the subject is now "", give up on this Container. + ● Add this Container to the subject_table if: + ● There is no container in the table with this subject, or + ● This one is an empty container and the old one is not: the + empty one is more interesting as a root, so put it in the + table instead. + ● The container in the table has a ``Re:'' version of this + subject, and this container has a non-``Re:'' version of + this subject. The non-re version is the more interesting of + the two. + + C. Now the subject_table is populated with one entry for each subject + which occurs in the root set. Now iterate over the root set, and gather + together the difference. + + For each Container in the root set: + + ● Find the subject of this Container (as above.) + ● Look up the Container of that subject in the table. + ● If it is null, or if it is this container, continue. + + ● Otherwise, we want to group together this Container and the one in + the table. There are a few possibilities: + + ● If both are dummies, append one's children to the other, and + remove the now-empty container. + + ● If one container is a empty and the other is not, make the + non-empty one be a child of the empty, and a sibling of the + other ``real'' messages with the same subject (the empty's + children.) + + ● If that container is a non-empty, and that message's subject + does not begin with ``Re:'', but this message's subject does, + then make this be a child of the other. + + ● If that container is a non-empty, and that message's subject + begins with ``Re:'', but this message's subject does not, then + make that be a child of this one -- they were misordered. (This + happens somewhat implicitly, since if there are two messages, + one with Re: and one without, the one without will be in the + hash table, regardless of the order in which they were seen.) + + ● Otherwise, make a new empty container and make both msgs be a + child of it. This catches the both-are-replies and + neither-are-replies cases, and makes them be siblings instead + of asserting a hierarchical relationship which might not be + true. + + (People who reply to messages without using ``Re:'' and without + using a References line will break this slightly. Those people + suck.) + + (It has occurred to me that taking the date or message number into + account would be one way of resolving some of the ambiguous cases, but + that's not altogether straightforward either.) + + 6. Now you're done threading! + Specifically, you no longer need the ``parent'' slot of the Container + object, so if you wanted to flush the data out into a smaller, longer-lived + structure, you could reclaim some storage as a result. + + 7. Now, sort the siblings. + At this point, the parent-child relationships are set. However, the sibling + ordering has not been adjusted, so now is the time to walk the tree one + last time and order the siblings by date, sender, subject, or whatever. + This step could also be merged in to the end of step 4, above, but it's + probably clearer to make it be a final pass. If you were careful, you could + also sort the messages first and take care in the above algorithm to not + perturb the ordering, but that doesn't really save anything. + + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + +You might be wondering what Netscape Confusicator 4.0 broke. Well, basically +they never got threading working right. Aside from crashing, corrupting their +databases files, and general bugginess, the fundamental problem had been +twofold: + + • 4.0 eliminated the ``dummy thread parent'' step, which is an absolute + necessity to get threading right in the case where you don't have every + message (e.g., because one has expired, or was never sent to you at all.) + The best explanation I was able to get from them for why they did this was, + ``it looked ugly and I didn't understand why it was there.'' + + • 4.0 eliminated the ``group similar unthreaded subjects'' step, which is + necessary to get some semblance of threading right in the absence of + References and In-Reply-To, or in the presence of mangled References. If + there was no References header, 4.0 just didn't thread at all. + +Plus my pet peeve, + + • The 4.0 UI presented threading as a kind of sorting, which is just not the + case. Threading is the act of presenting parent/child relationships, + whereas sorting is the act of ordering siblings. + + That is, 4.0 gives you these choices: ``Sort by Date; Sort by Subject; Sort + by message number; or Thread.'' Where they assume that ``Thread'' implies + ``Sort by Date.'' So that means that there's no way to see a threaded set + of messages that are sorted by message number, or by sender, etc. + + There should be options for how to sort the messages; and then, orthogonal + to that should be the boolean option of whether the messages should be + threaded. + +I seem to recall there being some other problem that was a result of the thread +hierarchy being stored in the database, instead of computed as needed in +realtime (there were was some kind of ordering or stale-data issue that came +up?) but maybe they finally managed to fix that. + +My C version of this code was able to thread 10,000 messages in less than half +a second on a low-end (90 MHz) Pentium, so the argument that it has to be in +the database for efficiency is pure bunk. + +Also bunk is the idea that databases are needed for ``scalability.'' This code +can thread 100,000 messages without a horrible delay, and the fact is, if +you're looking at a 100,000 message folder (or for that matter, if you're +running Confusicator at all), you're doing so on a machine that has sufficient +memory to hold these structures in core. Also consider the question of whether +your GUI toolkit contains a list/outliner widget that can display a million +elements in the first place. (The answer is probably ``no.'') Also consider +whether you have ever in your life seen a single folder that has a million +messages in it, and that further, you've wanted to look at all at once (rather +than only looking at the most recent 100,000 messages to arrive in that +newsgroup...) + +In short, all the arguments I've heard for using databases to implement +threading and mbox summarization are solving problems that simply don't exist. +Show me a real-world situation where the above technique actually falls down, +and then we'll talk. + +Just say no to databases! + +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + + [ up ]