Moved doc.

This commit is contained in:
Bill Thiede 2014-04-08 20:06:56 -07:00
parent 7f3674cbd8
commit b1c2bb6e25

View File

@ -1,485 +0,0 @@
message threading.
(C) 1997-2002 Jamie Zawinski <jwz@jwz.org>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
In this document, I describe what is, in my humble but correct opinion, the
best known algorithm for threading messages (that is, grouping messages
together in parent/child relationships based on which messages are replies to
which others.) This is the threading algorithm that was used in Netscape Mail
and News 2.0 and 3.0, and in Grendel.
Sadly, my C implementation of this algorithm is not available, because it was
purged during the 4.0 rewrite, and Netscape refused to allow me to free the 3.0
source code.
However, my Java implementation is available in the Grendel source. You can
find a descendant of that code on ftp.mozilla.org. Here's the original source
release: grendel-1998-09-05.tar.gz; and a later version, ported to more modern
Java APIs: grendel-1999-05-14.tar.gz. The threading code is in view/
Threader.java. See also IThreadable and TestThreader. (The mailsum code in
storage/MailSummaryFile.java and the MIME parser in the mime/ directory may
also be of interest.)
This is not the algorithm that Netscape 4.x uses, because this is another area
where the 4.0 team screwed the pooch, and instead of just continuing to use the
existing working code, replaced it with something that was bloated, slow,
buggy, and incorrect. But hey, at least it was in C++ and used databases!
This algorithm is also described in the imapext-thread Internet Draft: Mark
Crispin and Kenneth Murchison formalized my description of this algorithm, and
propose it as the THREAD extension to the IMAP protocol (the idea being that
the IMAP server could give you back a list of messages in a pre-threaded state,
so that it wouldn't need to be done on the client side.) If you find my
description of this algorithm confusing, perhaps their restating of it will be
more to your taste.
I'm told this algorithm is also used in the Evolution and Balsa mail readers.
Also, Simon Cozens and Richard Clamp have written a Perl version; Frederik
Dietz has written a Ruby version; and Max Ogden has written a JavaScript
version. (I've not tested any of these implementations, so I make no claims as
to how faithfully they implement it.)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
First some background on the headers involved.
In-Reply-To:
The In-Reply-To header was originally defined by RFC 822, the 1982 standard
for mail messages. In 2001, its definition was tightened up by RFC 2822.
RFC 822 defined the In-Reply-To header as, basically, a free-text header.
The syntax of it allowed it to contain basically any text at all. The
following is, literally, a legal RFC 822 In-Reply-To header:
In-Reply-To: thirty-five ham and cheese sandwiches
So you're not guaranteed to be able to parse anything useful out of
In-Reply-To if it exists, and even if it contains something that looks like
a Message-ID, it might not be (especially since Message-IDs and email
addresses have identical syntax.)
However, most of the time, In-Reply-To headers do have something useful in
them. Back in 1997, I grepped over a huge number of messages and collected
some damned lies, I mean, statistics, on what kind of In-Reply-To headers
they contained. The results:
In a survey of 22,950 mail messages with In-Reply-To headers:
18,396 had at least one occurrence of <>-bracketed text.
4,554 had no <>-bracketed text at all (just names and
dates.)
714 contained one <>-bracketed addr-spec and no message
IDs.
4 contained multiple message IDs.
1 contained one message ID and one <>-bracketed
addr-spec.
The most common forms of In-Reply-To seemed to be:
31% NAME's message of TIME <ID@HOST>
22% <ID@HOST>
9% <ID@HOST> from NAME at "TIME"
8% USER's message of TIME <ID@HOST>
7% USER's message of TIME
6% Your message of "TIME"
17% hundreds of other variants (average 0.4% each?)
Of course these numbers are very much dependent on the sample set, which,
in this case, was probably skewed toward Unix users, and/or toward people
who had been on the net for quite some time (due to the age of the archives
I checked.)
However, this seems to indicate that it's not unreasonable to assume that,
if there is an In-Reply-To field, then the first <>-bracketed text found
therein is the Message-ID of the parent message. It is safe to assume this,
that is, so long as you still exhibit reasonable behavior when that
assumption turns out to be wrong, which will happen a small-but-not-
insignificant portion of the time.
RFC 2822, the successor to RFC 822, updated the definition of In-Reply-To:
by the more modern standard, In-Reply-To may contain only message IDs.
There will usually be only one, but there could be more than one: these are
the IDs of the messages to which this one is a direct reply (the idea being
that you might be sending one message in reply to several others.)
References:
The References header was defined by RFC 822 in 1982. It was defined in,
effectively, the same way as the In-Reply-To header was defined: which is
to say, its definition was pretty useless. (Like In-Reply-To, its
definition was also tightened up in 2001 by RFC 2822.)
However, the References header was also defined in 1987 by RFC 1036
(section 2.2.5), the standard for USENET news messages. That definition was
much tighter and more useful than the RFC 822 definition: it asserts that
this header contain a list of Message-IDs listing the parent, grandparent,
great-grandparent, and so on, of this message, oldest first. That is, the
direct parent of this message will be the last element of the References
header.
It is not guaranteed to contain the entire tree back to the root-most
message in the thread: news readers are allowed to truncate it at their
discretion, and the manner in which they truncate it (from the front, from
the back, or from the middle) is not defined.
Therefore, while there is useful info in the References header, it is not
uncommon for multiple messages in the same thread to have seemingly-
contradictory References data, so threading code must make an effort to do
the right thing in the face of conflicting data.
RFC 2822 updated the mail standard to have the same semantics of References
as the news standard, RFC 1036.
In practice, if you ever see a References header in a mail message, it will
follow the RFC 1036 (and RFC 2822) definition rather than the RFC 822
definition. Because the References header both contains more information and is
easier to parse, many modern mail user agents generate and use the References
header in mail instead of (or in addition to) In-Reply-To, and use the USENET
semantics when they do so.
You will generally not see In-Reply-To in a news message, but it can
occasionally happen, usually as a result of mail/news gateways.
So, any sensible threading software will have the ability to take both
In-Reply-To and References headers into account.
Note: RFC 2822 (section 3.6.4) says that a References field should contain the
contents of the parent message's References field, followed by the contents of
the parent's Message-ID field (in other words, the References field should
contain the path through the thread.) However, I've been informed that recent
versions of Eudora violate this standard: they put the parent Message-ID in the
In-Reply-To header, but do not duplicate it in the References header: instead,
the References header contains the grandparent, great-grand-parent, etc.
This implies that to properly reconstruct the thread of a message in the face
of this nonstandard behavior, we need to append any In-Reply-To message IDs to
References.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
The Algorithm
This algorithm consists of five main steps, and each of those steps is somewhat
complicated. However, once you've wrapped your brain around it, it's not really
that complicated, considering what it does.
In defense of its complexity, I can say this:
• This algorithm is incredibly robust in the face of garbage input, and even
in the face of malicious input (you cannot construct a set of inputs that
will send this algorithm into a loop, for example.)
• This algorithm has been field-tested by something on the order of ten
million users over the course of six years.
• It really does work incredibly well. I've never seen it produce results
that were anything less than totally reasonable.
Well, enough with the disclaimers.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Definitions:
• A Container object is composed of:
Message message; // (may be null)
Container parent;
Container child; // first child
Container next; // next element in sibling list, or null
• A Message object only has a few fields we are interested in:
String subject;
ID message_id; // the ID of this message
ID *references; // list of IDs of parent messages
The References field is populated from the ``References'' and/or
``In-Reply-To'' headers. If both headers exist, take the first thing in the
In-Reply-To header that looks like a Message-ID, and append it to the
References header.
If there are multiple things in In-Reply-To that look like Message-IDs,
only use the first one of them: odds are that the later ones are actually
email addresses, not IDs.
These ID objects can be strings, or they can be any other token on which
you can do meaningful equality comparisons.
Only two things need to be done with the subject strings: ask whether they
begin with ``Re:'', and compare the non-Re parts for equivalence. So you
can get away with interning or otherwise hashing these, too. (This is a
very good idea: my code does this so that I can use == instead of strcmp
inside the loop.)
The ID objects also don't need to be strings, for the same reason. They can
be hashes or numeric indexes or anything for which equality comparisons
hold, so it's way faster if you can do pointer-equivalence comparisons
instead of strcmp.
The reason the Container and Message objects are separate is because the
Container fields are only needed during the act of threading: you don't
need to keep those around, so there's no point in bulking up every Message
structure with them.
• The id_table is a hash table associating Message-IDs with Containers.
• An ``empty container'' is one that doesn't have a message in it, but which
shows evidence of having existed. For whatever reason, we don't have that
message in our list (maybe it is expired or canceled, maybe it was deleted
from the folder, or any of several other reasons.)
At presentation-time, these will show up as unselectable ``parent''
containers, for example, if we have the thread
-- A
|-- B
\-- C
-- D
and we know about messages B and C, but their common parent A does not
exist, there will be a placeholder for A, to group them together, and
prevent D from seeming to be a sibling of B and C.
These ``dummy'' messages only ever occur at depth 0.
The Algorithm:
1. For each message:
A. If id_table contains an empty Container for this ID:
● Store this message in the Container's message slot.
Else:
● Create a new Container object holding this message;
● Index the Container by Message-ID in id_table.
B. For each element in the message's References field:
● Find a Container object for the given Message-ID:
● If there's one in id_table use that;
● Otherwise, make (and index) one with a null Message.
● Link the References field's Containers together in the order
implied by the References header.
● If they are already linked, don't change the existing links.
● Do not add a link if adding that link would introduce a loop:
that is, before asserting A->B, search down the children of B
to see if A is reachable, and also search down the children of
A to see if B is reachable. If either is already reachable as a
child of the other, don't add the link.
C. Set the parent of this message to be the last element in References.
Note that this message may have a parent already: this can happen
because we saw this ID in a References field, and presumed a parent
based on the other entries in that field. Now that we have the actual
message, we can be more definitive, so throw away the old parent and
use this new one. Find this Container in the parent's children list,
and unlink it.
Note that this could cause this message to now have no parent, if it
has no references field, but some message referred to it as the
non-first element of its references. (Which would have been some kind
of lie...)
Note that at all times, the various ``parent'' and ``child'' fields
must be kept inter-consistent.
2. Find the root set.
Walk over the elements of id_table, and gather a list of the Container
objects that have no parents.
3. Discard id_table. We don't need it any more.
4. Prune empty containers.
Recursively walk all containers under the root set.
For each container:
A. If it is an empty container with no children, nuke it.
Note: Normally such containers won't occur, but they can show up when
two messages have References lines that disagree. For example, assuming
A and B are messages, and 1, 2, and 3 are references for messages we
haven't seen:
A has references: 1, 2, 3
B has references: 1, 3
There is ambiguity as to whether 3 is a child of 1 or of 2. So,
depending on the processing order, we might end up with either
-- 1
|-- 2
\-- 3
|-- A
\-- B
or
-- 1
|-- 2 <--- non root childless container!
\-- 3
|-- A
\-- B
B. If the Container has no Message, but does have children, remove this
container but promote its children to this level (that is, splice them
in to the current child list.)
Do not promote the children if doing so would promote them to the root
set -- unless there is only one child, in which case, do.
5. Group root set by subject.
If any two members of the root set have the same subject, merge them. This
is so that messages which don't have References headers at all still get
threaded (to the extent possible, at least.)
A. Construct a new hash table, subject_table, which associates subject
strings with Container objects.
B. For each Container in the root set:
● Find the subject of that sub-tree:
● If there is a message in the Container, the subject is the
subject of that message.
● If there is no message in the Container, then the Container
will have at least one child Container, and that Container will
have a message. Use the subject of that message instead.
● Strip ``Re:'', ``RE:'', ``RE[5]:'', ``Re: Re[4]: Re:'' and so
on.
● If the subject is now "", give up on this Container.
● Add this Container to the subject_table if:
● There is no container in the table with this subject, or
● This one is an empty container and the old one is not: the
empty one is more interesting as a root, so put it in the
table instead.
● The container in the table has a ``Re:'' version of this
subject, and this container has a non-``Re:'' version of
this subject. The non-re version is the more interesting of
the two.
C. Now the subject_table is populated with one entry for each subject
which occurs in the root set. Now iterate over the root set, and gather
together the difference.
For each Container in the root set:
● Find the subject of this Container (as above.)
● Look up the Container of that subject in the table.
● If it is null, or if it is this container, continue.
● Otherwise, we want to group together this Container and the one in
the table. There are a few possibilities:
● If both are dummies, append one's children to the other, and
remove the now-empty container.
● If one container is a empty and the other is not, make the
non-empty one be a child of the empty, and a sibling of the
other ``real'' messages with the same subject (the empty's
children.)
● If that container is a non-empty, and that message's subject
does not begin with ``Re:'', but this message's subject does,
then make this be a child of the other.
● If that container is a non-empty, and that message's subject
begins with ``Re:'', but this message's subject does not, then
make that be a child of this one -- they were misordered. (This
happens somewhat implicitly, since if there are two messages,
one with Re: and one without, the one without will be in the
hash table, regardless of the order in which they were seen.)
● Otherwise, make a new empty container and make both msgs be a
child of it. This catches the both-are-replies and
neither-are-replies cases, and makes them be siblings instead
of asserting a hierarchical relationship which might not be
true.
(People who reply to messages without using ``Re:'' and without
using a References line will break this slightly. Those people
suck.)
(It has occurred to me that taking the date or message number into
account would be one way of resolving some of the ambiguous cases, but
that's not altogether straightforward either.)
6. Now you're done threading!
Specifically, you no longer need the ``parent'' slot of the Container
object, so if you wanted to flush the data out into a smaller, longer-lived
structure, you could reclaim some storage as a result.
7. Now, sort the siblings.
At this point, the parent-child relationships are set. However, the sibling
ordering has not been adjusted, so now is the time to walk the tree one
last time and order the siblings by date, sender, subject, or whatever.
This step could also be merged in to the end of step 4, above, but it's
probably clearer to make it be a final pass. If you were careful, you could
also sort the messages first and take care in the above algorithm to not
perturb the ordering, but that doesn't really save anything.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
You might be wondering what Netscape Confusicator 4.0 broke. Well, basically
they never got threading working right. Aside from crashing, corrupting their
databases files, and general bugginess, the fundamental problem had been
twofold:
• 4.0 eliminated the ``dummy thread parent'' step, which is an absolute
necessity to get threading right in the case where you don't have every
message (e.g., because one has expired, or was never sent to you at all.)
The best explanation I was able to get from them for why they did this was,
``it looked ugly and I didn't understand why it was there.''
• 4.0 eliminated the ``group similar unthreaded subjects'' step, which is
necessary to get some semblance of threading right in the absence of
References and In-Reply-To, or in the presence of mangled References. If
there was no References header, 4.0 just didn't thread at all.
Plus my pet peeve,
• The 4.0 UI presented threading as a kind of sorting, which is just not the
case. Threading is the act of presenting parent/child relationships,
whereas sorting is the act of ordering siblings.
That is, 4.0 gives you these choices: ``Sort by Date; Sort by Subject; Sort
by message number; or Thread.'' Where they assume that ``Thread'' implies
``Sort by Date.'' So that means that there's no way to see a threaded set
of messages that are sorted by message number, or by sender, etc.
There should be options for how to sort the messages; and then, orthogonal
to that should be the boolean option of whether the messages should be
threaded.
I seem to recall there being some other problem that was a result of the thread
hierarchy being stored in the database, instead of computed as needed in
realtime (there were was some kind of ordering or stale-data issue that came
up?) but maybe they finally managed to fix that.
My C version of this code was able to thread 10,000 messages in less than half
a second on a low-end (90 MHz) Pentium, so the argument that it has to be in
the database for efficiency is pure bunk.
Also bunk is the idea that databases are needed for ``scalability.'' This code
can thread 100,000 messages without a horrible delay, and the fact is, if
you're looking at a 100,000 message folder (or for that matter, if you're
running Confusicator at all), you're doing so on a machine that has sufficient
memory to hold these structures in core. Also consider the question of whether
your GUI toolkit contains a list/outliner widget that can display a million
elements in the first place. (The answer is probably ``no.'') Also consider
whether you have ever in your life seen a single folder that has a million
messages in it, and that further, you've wanted to look at all at once (rather
than only looking at the most recent 100,000 messages to arrive in that
newsgroup...)
In short, all the arguments I've heard for using databases to implement
threading and mbox summarization are solving problems that simply don't exist.
Show me a real-world situation where the above technique actually falls down,
and then we'll talk.
Just say no to databases!
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[ up ]