diff --git a/doc/jwz-threading.txt b/doc/jwz-threading.txt deleted file mode 100644 index 4a65ef1..0000000 --- a/doc/jwz-threading.txt +++ /dev/null @@ -1,485 +0,0 @@ - message threading. - (C) 1997-2002 Jamie Zawinski - -━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ - -In this document, I describe what is, in my humble but correct opinion, the -best known algorithm for threading messages (that is, grouping messages -together in parent/child relationships based on which messages are replies to -which others.) This is the threading algorithm that was used in Netscape Mail -and News 2.0 and 3.0, and in Grendel. - -Sadly, my C implementation of this algorithm is not available, because it was -purged during the 4.0 rewrite, and Netscape refused to allow me to free the 3.0 -source code. - -However, my Java implementation is available in the Grendel source. You can -find a descendant of that code on ftp.mozilla.org. Here's the original source -release: grendel-1998-09-05.tar.gz; and a later version, ported to more modern -Java APIs: grendel-1999-05-14.tar.gz. The threading code is in view/ -Threader.java. See also IThreadable and TestThreader. (The mailsum code in -storage/MailSummaryFile.java and the MIME parser in the mime/ directory may -also be of interest.) - -This is not the algorithm that Netscape 4.x uses, because this is another area -where the 4.0 team screwed the pooch, and instead of just continuing to use the -existing working code, replaced it with something that was bloated, slow, -buggy, and incorrect. But hey, at least it was in C++ and used databases! - -This algorithm is also described in the imapext-thread Internet Draft: Mark -Crispin and Kenneth Murchison formalized my description of this algorithm, and -propose it as the THREAD extension to the IMAP protocol (the idea being that -the IMAP server could give you back a list of messages in a pre-threaded state, -so that it wouldn't need to be done on the client side.) If you find my -description of this algorithm confusing, perhaps their restating of it will be -more to your taste. - -I'm told this algorithm is also used in the Evolution and Balsa mail readers. -Also, Simon Cozens and Richard Clamp have written a Perl version; Frederik -Dietz has written a Ruby version; and Max Ogden has written a JavaScript -version. (I've not tested any of these implementations, so I make no claims as -to how faithfully they implement it.) - - ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ - -First some background on the headers involved. - -In-Reply-To: - - The In-Reply-To header was originally defined by RFC 822, the 1982 standard - for mail messages. In 2001, its definition was tightened up by RFC 2822. - - RFC 822 defined the In-Reply-To header as, basically, a free-text header. - The syntax of it allowed it to contain basically any text at all. The - following is, literally, a legal RFC 822 In-Reply-To header: - - In-Reply-To: thirty-five ham and cheese sandwiches - - So you're not guaranteed to be able to parse anything useful out of - In-Reply-To if it exists, and even if it contains something that looks like - a Message-ID, it might not be (especially since Message-IDs and email - addresses have identical syntax.) - - However, most of the time, In-Reply-To headers do have something useful in - them. Back in 1997, I grepped over a huge number of messages and collected - some damned lies, I mean, statistics, on what kind of In-Reply-To headers - they contained. The results: - - In a survey of 22,950 mail messages with In-Reply-To headers: - - 18,396 had at least one occurrence of <>-bracketed text. - 4,554 had no <>-bracketed text at all (just names and - dates.) - 714 contained one <>-bracketed addr-spec and no message - IDs. - 4 contained multiple message IDs. - 1 contained one message ID and one <>-bracketed - addr-spec. - - The most common forms of In-Reply-To seemed to be: - - 31% NAME's message of TIME - 22% - 9% from NAME at "TIME" - 8% USER's message of TIME - 7% USER's message of TIME - 6% Your message of "TIME" - 17% hundreds of other variants (average 0.4% each?) - - Of course these numbers are very much dependent on the sample set, which, - in this case, was probably skewed toward Unix users, and/or toward people - who had been on the net for quite some time (due to the age of the archives - I checked.) - - However, this seems to indicate that it's not unreasonable to assume that, - if there is an In-Reply-To field, then the first <>-bracketed text found - therein is the Message-ID of the parent message. It is safe to assume this, - that is, so long as you still exhibit reasonable behavior when that - assumption turns out to be wrong, which will happen a small-but-not- - insignificant portion of the time. - - RFC 2822, the successor to RFC 822, updated the definition of In-Reply-To: - by the more modern standard, In-Reply-To may contain only message IDs. - There will usually be only one, but there could be more than one: these are - the IDs of the messages to which this one is a direct reply (the idea being - that you might be sending one message in reply to several others.) - -References: - - The References header was defined by RFC 822 in 1982. It was defined in, - effectively, the same way as the In-Reply-To header was defined: which is - to say, its definition was pretty useless. (Like In-Reply-To, its - definition was also tightened up in 2001 by RFC 2822.) - - However, the References header was also defined in 1987 by RFC 1036 - (section 2.2.5), the standard for USENET news messages. That definition was - much tighter and more useful than the RFC 822 definition: it asserts that - this header contain a list of Message-IDs listing the parent, grandparent, - great-grandparent, and so on, of this message, oldest first. That is, the - direct parent of this message will be the last element of the References - header. - - It is not guaranteed to contain the entire tree back to the root-most - message in the thread: news readers are allowed to truncate it at their - discretion, and the manner in which they truncate it (from the front, from - the back, or from the middle) is not defined. - - Therefore, while there is useful info in the References header, it is not - uncommon for multiple messages in the same thread to have seemingly- - contradictory References data, so threading code must make an effort to do - the right thing in the face of conflicting data. - - RFC 2822 updated the mail standard to have the same semantics of References - as the news standard, RFC 1036. - -In practice, if you ever see a References header in a mail message, it will -follow the RFC 1036 (and RFC 2822) definition rather than the RFC 822 -definition. Because the References header both contains more information and is -easier to parse, many modern mail user agents generate and use the References -header in mail instead of (or in addition to) In-Reply-To, and use the USENET -semantics when they do so. - -You will generally not see In-Reply-To in a news message, but it can -occasionally happen, usually as a result of mail/news gateways. - -So, any sensible threading software will have the ability to take both -In-Reply-To and References headers into account. - -Note: RFC 2822 (section 3.6.4) says that a References field should contain the -contents of the parent message's References field, followed by the contents of -the parent's Message-ID field (in other words, the References field should -contain the path through the thread.) However, I've been informed that recent -versions of Eudora violate this standard: they put the parent Message-ID in the -In-Reply-To header, but do not duplicate it in the References header: instead, -the References header contains the grandparent, great-grand-parent, etc. - -This implies that to properly reconstruct the thread of a message in the face -of this nonstandard behavior, we need to append any In-Reply-To message IDs to -References. - - ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ - - The Algorithm - -This algorithm consists of five main steps, and each of those steps is somewhat -complicated. However, once you've wrapped your brain around it, it's not really -that complicated, considering what it does. - -In defense of its complexity, I can say this: - - • This algorithm is incredibly robust in the face of garbage input, and even - in the face of malicious input (you cannot construct a set of inputs that - will send this algorithm into a loop, for example.) - - • This algorithm has been field-tested by something on the order of ten - million users over the course of six years. - - • It really does work incredibly well. I've never seen it produce results - that were anything less than totally reasonable. - -Well, enough with the disclaimers. - - ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ - -Definitions: - - • A Container object is composed of: - - Message message; // (may be null) - Container parent; - Container child; // first child - Container next; // next element in sibling list, or null - - • A Message object only has a few fields we are interested in: - - String subject; - ID message_id; // the ID of this message - ID *references; // list of IDs of parent messages - - The References field is populated from the ``References'' and/or - ``In-Reply-To'' headers. If both headers exist, take the first thing in the - In-Reply-To header that looks like a Message-ID, and append it to the - References header. - - If there are multiple things in In-Reply-To that look like Message-IDs, - only use the first one of them: odds are that the later ones are actually - email addresses, not IDs. - - These ID objects can be strings, or they can be any other token on which - you can do meaningful equality comparisons. - - Only two things need to be done with the subject strings: ask whether they - begin with ``Re:'', and compare the non-Re parts for equivalence. So you - can get away with interning or otherwise hashing these, too. (This is a - very good idea: my code does this so that I can use == instead of strcmp - inside the loop.) - - The ID objects also don't need to be strings, for the same reason. They can - be hashes or numeric indexes or anything for which equality comparisons - hold, so it's way faster if you can do pointer-equivalence comparisons - instead of strcmp. - - The reason the Container and Message objects are separate is because the - Container fields are only needed during the act of threading: you don't - need to keep those around, so there's no point in bulking up every Message - structure with them. - - • The id_table is a hash table associating Message-IDs with Containers. - - • An ``empty container'' is one that doesn't have a message in it, but which - shows evidence of having existed. For whatever reason, we don't have that - message in our list (maybe it is expired or canceled, maybe it was deleted - from the folder, or any of several other reasons.) - - At presentation-time, these will show up as unselectable ``parent'' - containers, for example, if we have the thread - - -- A - |-- B - \-- C - -- D - - and we know about messages B and C, but their common parent A does not - exist, there will be a placeholder for A, to group them together, and - prevent D from seeming to be a sibling of B and C. - - These ``dummy'' messages only ever occur at depth 0. - -The Algorithm: - - 1. For each message: - - A. If id_table contains an empty Container for this ID: - ● Store this message in the Container's message slot. - Else: - ● Create a new Container object holding this message; - ● Index the Container by Message-ID in id_table. - - B. For each element in the message's References field: - - ● Find a Container object for the given Message-ID: - ● If there's one in id_table use that; - ● Otherwise, make (and index) one with a null Message. - - ● Link the References field's Containers together in the order - implied by the References header. - ● If they are already linked, don't change the existing links. - ● Do not add a link if adding that link would introduce a loop: - that is, before asserting A->B, search down the children of B - to see if A is reachable, and also search down the children of - A to see if B is reachable. If either is already reachable as a - child of the other, don't add the link. - - C. Set the parent of this message to be the last element in References. - Note that this message may have a parent already: this can happen - because we saw this ID in a References field, and presumed a parent - based on the other entries in that field. Now that we have the actual - message, we can be more definitive, so throw away the old parent and - use this new one. Find this Container in the parent's children list, - and unlink it. - - Note that this could cause this message to now have no parent, if it - has no references field, but some message referred to it as the - non-first element of its references. (Which would have been some kind - of lie...) - - Note that at all times, the various ``parent'' and ``child'' fields - must be kept inter-consistent. - - 2. Find the root set. - - Walk over the elements of id_table, and gather a list of the Container - objects that have no parents. - - 3. Discard id_table. We don't need it any more. - - 4. Prune empty containers. - Recursively walk all containers under the root set. - For each container: - A. If it is an empty container with no children, nuke it. - - Note: Normally such containers won't occur, but they can show up when - two messages have References lines that disagree. For example, assuming - A and B are messages, and 1, 2, and 3 are references for messages we - haven't seen: - - A has references: 1, 2, 3 - B has references: 1, 3 - - There is ambiguity as to whether 3 is a child of 1 or of 2. So, - depending on the processing order, we might end up with either - - -- 1 - |-- 2 - \-- 3 - |-- A - \-- B - - or - - -- 1 - |-- 2 <--- non root childless container! - \-- 3 - |-- A - \-- B - - B. If the Container has no Message, but does have children, remove this - container but promote its children to this level (that is, splice them - in to the current child list.) - - Do not promote the children if doing so would promote them to the root - set -- unless there is only one child, in which case, do. - - 5. Group root set by subject. - - If any two members of the root set have the same subject, merge them. This - is so that messages which don't have References headers at all still get - threaded (to the extent possible, at least.) - A. Construct a new hash table, subject_table, which associates subject - strings with Container objects. - - B. For each Container in the root set: - - ● Find the subject of that sub-tree: - ● If there is a message in the Container, the subject is the - subject of that message. - ● If there is no message in the Container, then the Container - will have at least one child Container, and that Container will - have a message. Use the subject of that message instead. - ● Strip ``Re:'', ``RE:'', ``RE[5]:'', ``Re: Re[4]: Re:'' and so - on. - ● If the subject is now "", give up on this Container. - ● Add this Container to the subject_table if: - ● There is no container in the table with this subject, or - ● This one is an empty container and the old one is not: the - empty one is more interesting as a root, so put it in the - table instead. - ● The container in the table has a ``Re:'' version of this - subject, and this container has a non-``Re:'' version of - this subject. The non-re version is the more interesting of - the two. - - C. Now the subject_table is populated with one entry for each subject - which occurs in the root set. Now iterate over the root set, and gather - together the difference. - - For each Container in the root set: - - ● Find the subject of this Container (as above.) - ● Look up the Container of that subject in the table. - ● If it is null, or if it is this container, continue. - - ● Otherwise, we want to group together this Container and the one in - the table. There are a few possibilities: - - ● If both are dummies, append one's children to the other, and - remove the now-empty container. - - ● If one container is a empty and the other is not, make the - non-empty one be a child of the empty, and a sibling of the - other ``real'' messages with the same subject (the empty's - children.) - - ● If that container is a non-empty, and that message's subject - does not begin with ``Re:'', but this message's subject does, - then make this be a child of the other. - - ● If that container is a non-empty, and that message's subject - begins with ``Re:'', but this message's subject does not, then - make that be a child of this one -- they were misordered. (This - happens somewhat implicitly, since if there are two messages, - one with Re: and one without, the one without will be in the - hash table, regardless of the order in which they were seen.) - - ● Otherwise, make a new empty container and make both msgs be a - child of it. This catches the both-are-replies and - neither-are-replies cases, and makes them be siblings instead - of asserting a hierarchical relationship which might not be - true. - - (People who reply to messages without using ``Re:'' and without - using a References line will break this slightly. Those people - suck.) - - (It has occurred to me that taking the date or message number into - account would be one way of resolving some of the ambiguous cases, but - that's not altogether straightforward either.) - - 6. Now you're done threading! - Specifically, you no longer need the ``parent'' slot of the Container - object, so if you wanted to flush the data out into a smaller, longer-lived - structure, you could reclaim some storage as a result. - - 7. Now, sort the siblings. - At this point, the parent-child relationships are set. However, the sibling - ordering has not been adjusted, so now is the time to walk the tree one - last time and order the siblings by date, sender, subject, or whatever. - This step could also be merged in to the end of step 4, above, but it's - probably clearer to make it be a final pass. If you were careful, you could - also sort the messages first and take care in the above algorithm to not - perturb the ordering, but that doesn't really save anything. - - ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ - -You might be wondering what Netscape Confusicator 4.0 broke. Well, basically -they never got threading working right. Aside from crashing, corrupting their -databases files, and general bugginess, the fundamental problem had been -twofold: - - • 4.0 eliminated the ``dummy thread parent'' step, which is an absolute - necessity to get threading right in the case where you don't have every - message (e.g., because one has expired, or was never sent to you at all.) - The best explanation I was able to get from them for why they did this was, - ``it looked ugly and I didn't understand why it was there.'' - - • 4.0 eliminated the ``group similar unthreaded subjects'' step, which is - necessary to get some semblance of threading right in the absence of - References and In-Reply-To, or in the presence of mangled References. If - there was no References header, 4.0 just didn't thread at all. - -Plus my pet peeve, - - • The 4.0 UI presented threading as a kind of sorting, which is just not the - case. Threading is the act of presenting parent/child relationships, - whereas sorting is the act of ordering siblings. - - That is, 4.0 gives you these choices: ``Sort by Date; Sort by Subject; Sort - by message number; or Thread.'' Where they assume that ``Thread'' implies - ``Sort by Date.'' So that means that there's no way to see a threaded set - of messages that are sorted by message number, or by sender, etc. - - There should be options for how to sort the messages; and then, orthogonal - to that should be the boolean option of whether the messages should be - threaded. - -I seem to recall there being some other problem that was a result of the thread -hierarchy being stored in the database, instead of computed as needed in -realtime (there were was some kind of ordering or stale-data issue that came -up?) but maybe they finally managed to fix that. - -My C version of this code was able to thread 10,000 messages in less than half -a second on a low-end (90 MHz) Pentium, so the argument that it has to be in -the database for efficiency is pure bunk. - -Also bunk is the idea that databases are needed for ``scalability.'' This code -can thread 100,000 messages without a horrible delay, and the fact is, if -you're looking at a 100,000 message folder (or for that matter, if you're -running Confusicator at all), you're doing so on a machine that has sufficient -memory to hold these structures in core. Also consider the question of whether -your GUI toolkit contains a list/outliner widget that can display a million -elements in the first place. (The answer is probably ``no.'') Also consider -whether you have ever in your life seen a single folder that has a million -messages in it, and that further, you've wanted to look at all at once (rather -than only looking at the most recent 100,000 messages to arrive in that -newsgroup...) - -In short, all the arguments I've heard for using databases to implement -threading and mbox summarization are solving problems that simply don't exist. -Show me a real-world situation where the above technique actually falls down, -and then we'll talk. - -Just say no to databases! - -━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ - - [ up ]