Moved doc.
This commit is contained in:
parent
7f3674cbd8
commit
b1c2bb6e25
@ -1,485 +0,0 @@
|
||||
message threading.
|
||||
(C) 1997-2002 Jamie Zawinski <jwz@jwz.org>
|
||||
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
|
||||
In this document, I describe what is, in my humble but correct opinion, the
|
||||
best known algorithm for threading messages (that is, grouping messages
|
||||
together in parent/child relationships based on which messages are replies to
|
||||
which others.) This is the threading algorithm that was used in Netscape Mail
|
||||
and News 2.0 and 3.0, and in Grendel.
|
||||
|
||||
Sadly, my C implementation of this algorithm is not available, because it was
|
||||
purged during the 4.0 rewrite, and Netscape refused to allow me to free the 3.0
|
||||
source code.
|
||||
|
||||
However, my Java implementation is available in the Grendel source. You can
|
||||
find a descendant of that code on ftp.mozilla.org. Here's the original source
|
||||
release: grendel-1998-09-05.tar.gz; and a later version, ported to more modern
|
||||
Java APIs: grendel-1999-05-14.tar.gz. The threading code is in view/
|
||||
Threader.java. See also IThreadable and TestThreader. (The mailsum code in
|
||||
storage/MailSummaryFile.java and the MIME parser in the mime/ directory may
|
||||
also be of interest.)
|
||||
|
||||
This is not the algorithm that Netscape 4.x uses, because this is another area
|
||||
where the 4.0 team screwed the pooch, and instead of just continuing to use the
|
||||
existing working code, replaced it with something that was bloated, slow,
|
||||
buggy, and incorrect. But hey, at least it was in C++ and used databases!
|
||||
|
||||
This algorithm is also described in the imapext-thread Internet Draft: Mark
|
||||
Crispin and Kenneth Murchison formalized my description of this algorithm, and
|
||||
propose it as the THREAD extension to the IMAP protocol (the idea being that
|
||||
the IMAP server could give you back a list of messages in a pre-threaded state,
|
||||
so that it wouldn't need to be done on the client side.) If you find my
|
||||
description of this algorithm confusing, perhaps their restating of it will be
|
||||
more to your taste.
|
||||
|
||||
I'm told this algorithm is also used in the Evolution and Balsa mail readers.
|
||||
Also, Simon Cozens and Richard Clamp have written a Perl version; Frederik
|
||||
Dietz has written a Ruby version; and Max Ogden has written a JavaScript
|
||||
version. (I've not tested any of these implementations, so I make no claims as
|
||||
to how faithfully they implement it.)
|
||||
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
|
||||
First some background on the headers involved.
|
||||
|
||||
In-Reply-To:
|
||||
|
||||
The In-Reply-To header was originally defined by RFC 822, the 1982 standard
|
||||
for mail messages. In 2001, its definition was tightened up by RFC 2822.
|
||||
|
||||
RFC 822 defined the In-Reply-To header as, basically, a free-text header.
|
||||
The syntax of it allowed it to contain basically any text at all. The
|
||||
following is, literally, a legal RFC 822 In-Reply-To header:
|
||||
|
||||
In-Reply-To: thirty-five ham and cheese sandwiches
|
||||
|
||||
So you're not guaranteed to be able to parse anything useful out of
|
||||
In-Reply-To if it exists, and even if it contains something that looks like
|
||||
a Message-ID, it might not be (especially since Message-IDs and email
|
||||
addresses have identical syntax.)
|
||||
|
||||
However, most of the time, In-Reply-To headers do have something useful in
|
||||
them. Back in 1997, I grepped over a huge number of messages and collected
|
||||
some damned lies, I mean, statistics, on what kind of In-Reply-To headers
|
||||
they contained. The results:
|
||||
|
||||
In a survey of 22,950 mail messages with In-Reply-To headers:
|
||||
|
||||
18,396 had at least one occurrence of <>-bracketed text.
|
||||
4,554 had no <>-bracketed text at all (just names and
|
||||
dates.)
|
||||
714 contained one <>-bracketed addr-spec and no message
|
||||
IDs.
|
||||
4 contained multiple message IDs.
|
||||
1 contained one message ID and one <>-bracketed
|
||||
addr-spec.
|
||||
|
||||
The most common forms of In-Reply-To seemed to be:
|
||||
|
||||
31% NAME's message of TIME <ID@HOST>
|
||||
22% <ID@HOST>
|
||||
9% <ID@HOST> from NAME at "TIME"
|
||||
8% USER's message of TIME <ID@HOST>
|
||||
7% USER's message of TIME
|
||||
6% Your message of "TIME"
|
||||
17% hundreds of other variants (average 0.4% each?)
|
||||
|
||||
Of course these numbers are very much dependent on the sample set, which,
|
||||
in this case, was probably skewed toward Unix users, and/or toward people
|
||||
who had been on the net for quite some time (due to the age of the archives
|
||||
I checked.)
|
||||
|
||||
However, this seems to indicate that it's not unreasonable to assume that,
|
||||
if there is an In-Reply-To field, then the first <>-bracketed text found
|
||||
therein is the Message-ID of the parent message. It is safe to assume this,
|
||||
that is, so long as you still exhibit reasonable behavior when that
|
||||
assumption turns out to be wrong, which will happen a small-but-not-
|
||||
insignificant portion of the time.
|
||||
|
||||
RFC 2822, the successor to RFC 822, updated the definition of In-Reply-To:
|
||||
by the more modern standard, In-Reply-To may contain only message IDs.
|
||||
There will usually be only one, but there could be more than one: these are
|
||||
the IDs of the messages to which this one is a direct reply (the idea being
|
||||
that you might be sending one message in reply to several others.)
|
||||
|
||||
References:
|
||||
|
||||
The References header was defined by RFC 822 in 1982. It was defined in,
|
||||
effectively, the same way as the In-Reply-To header was defined: which is
|
||||
to say, its definition was pretty useless. (Like In-Reply-To, its
|
||||
definition was also tightened up in 2001 by RFC 2822.)
|
||||
|
||||
However, the References header was also defined in 1987 by RFC 1036
|
||||
(section 2.2.5), the standard for USENET news messages. That definition was
|
||||
much tighter and more useful than the RFC 822 definition: it asserts that
|
||||
this header contain a list of Message-IDs listing the parent, grandparent,
|
||||
great-grandparent, and so on, of this message, oldest first. That is, the
|
||||
direct parent of this message will be the last element of the References
|
||||
header.
|
||||
|
||||
It is not guaranteed to contain the entire tree back to the root-most
|
||||
message in the thread: news readers are allowed to truncate it at their
|
||||
discretion, and the manner in which they truncate it (from the front, from
|
||||
the back, or from the middle) is not defined.
|
||||
|
||||
Therefore, while there is useful info in the References header, it is not
|
||||
uncommon for multiple messages in the same thread to have seemingly-
|
||||
contradictory References data, so threading code must make an effort to do
|
||||
the right thing in the face of conflicting data.
|
||||
|
||||
RFC 2822 updated the mail standard to have the same semantics of References
|
||||
as the news standard, RFC 1036.
|
||||
|
||||
In practice, if you ever see a References header in a mail message, it will
|
||||
follow the RFC 1036 (and RFC 2822) definition rather than the RFC 822
|
||||
definition. Because the References header both contains more information and is
|
||||
easier to parse, many modern mail user agents generate and use the References
|
||||
header in mail instead of (or in addition to) In-Reply-To, and use the USENET
|
||||
semantics when they do so.
|
||||
|
||||
You will generally not see In-Reply-To in a news message, but it can
|
||||
occasionally happen, usually as a result of mail/news gateways.
|
||||
|
||||
So, any sensible threading software will have the ability to take both
|
||||
In-Reply-To and References headers into account.
|
||||
|
||||
Note: RFC 2822 (section 3.6.4) says that a References field should contain the
|
||||
contents of the parent message's References field, followed by the contents of
|
||||
the parent's Message-ID field (in other words, the References field should
|
||||
contain the path through the thread.) However, I've been informed that recent
|
||||
versions of Eudora violate this standard: they put the parent Message-ID in the
|
||||
In-Reply-To header, but do not duplicate it in the References header: instead,
|
||||
the References header contains the grandparent, great-grand-parent, etc.
|
||||
|
||||
This implies that to properly reconstruct the thread of a message in the face
|
||||
of this nonstandard behavior, we need to append any In-Reply-To message IDs to
|
||||
References.
|
||||
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
|
||||
The Algorithm
|
||||
|
||||
This algorithm consists of five main steps, and each of those steps is somewhat
|
||||
complicated. However, once you've wrapped your brain around it, it's not really
|
||||
that complicated, considering what it does.
|
||||
|
||||
In defense of its complexity, I can say this:
|
||||
|
||||
• This algorithm is incredibly robust in the face of garbage input, and even
|
||||
in the face of malicious input (you cannot construct a set of inputs that
|
||||
will send this algorithm into a loop, for example.)
|
||||
|
||||
• This algorithm has been field-tested by something on the order of ten
|
||||
million users over the course of six years.
|
||||
|
||||
• It really does work incredibly well. I've never seen it produce results
|
||||
that were anything less than totally reasonable.
|
||||
|
||||
Well, enough with the disclaimers.
|
||||
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
|
||||
Definitions:
|
||||
|
||||
• A Container object is composed of:
|
||||
|
||||
Message message; // (may be null)
|
||||
Container parent;
|
||||
Container child; // first child
|
||||
Container next; // next element in sibling list, or null
|
||||
|
||||
• A Message object only has a few fields we are interested in:
|
||||
|
||||
String subject;
|
||||
ID message_id; // the ID of this message
|
||||
ID *references; // list of IDs of parent messages
|
||||
|
||||
The References field is populated from the ``References'' and/or
|
||||
``In-Reply-To'' headers. If both headers exist, take the first thing in the
|
||||
In-Reply-To header that looks like a Message-ID, and append it to the
|
||||
References header.
|
||||
|
||||
If there are multiple things in In-Reply-To that look like Message-IDs,
|
||||
only use the first one of them: odds are that the later ones are actually
|
||||
email addresses, not IDs.
|
||||
|
||||
These ID objects can be strings, or they can be any other token on which
|
||||
you can do meaningful equality comparisons.
|
||||
|
||||
Only two things need to be done with the subject strings: ask whether they
|
||||
begin with ``Re:'', and compare the non-Re parts for equivalence. So you
|
||||
can get away with interning or otherwise hashing these, too. (This is a
|
||||
very good idea: my code does this so that I can use == instead of strcmp
|
||||
inside the loop.)
|
||||
|
||||
The ID objects also don't need to be strings, for the same reason. They can
|
||||
be hashes or numeric indexes or anything for which equality comparisons
|
||||
hold, so it's way faster if you can do pointer-equivalence comparisons
|
||||
instead of strcmp.
|
||||
|
||||
The reason the Container and Message objects are separate is because the
|
||||
Container fields are only needed during the act of threading: you don't
|
||||
need to keep those around, so there's no point in bulking up every Message
|
||||
structure with them.
|
||||
|
||||
• The id_table is a hash table associating Message-IDs with Containers.
|
||||
|
||||
• An ``empty container'' is one that doesn't have a message in it, but which
|
||||
shows evidence of having existed. For whatever reason, we don't have that
|
||||
message in our list (maybe it is expired or canceled, maybe it was deleted
|
||||
from the folder, or any of several other reasons.)
|
||||
|
||||
At presentation-time, these will show up as unselectable ``parent''
|
||||
containers, for example, if we have the thread
|
||||
|
||||
-- A
|
||||
|-- B
|
||||
\-- C
|
||||
-- D
|
||||
|
||||
and we know about messages B and C, but their common parent A does not
|
||||
exist, there will be a placeholder for A, to group them together, and
|
||||
prevent D from seeming to be a sibling of B and C.
|
||||
|
||||
These ``dummy'' messages only ever occur at depth 0.
|
||||
|
||||
The Algorithm:
|
||||
|
||||
1. For each message:
|
||||
|
||||
A. If id_table contains an empty Container for this ID:
|
||||
● Store this message in the Container's message slot.
|
||||
Else:
|
||||
● Create a new Container object holding this message;
|
||||
● Index the Container by Message-ID in id_table.
|
||||
|
||||
B. For each element in the message's References field:
|
||||
|
||||
● Find a Container object for the given Message-ID:
|
||||
● If there's one in id_table use that;
|
||||
● Otherwise, make (and index) one with a null Message.
|
||||
|
||||
● Link the References field's Containers together in the order
|
||||
implied by the References header.
|
||||
● If they are already linked, don't change the existing links.
|
||||
● Do not add a link if adding that link would introduce a loop:
|
||||
that is, before asserting A->B, search down the children of B
|
||||
to see if A is reachable, and also search down the children of
|
||||
A to see if B is reachable. If either is already reachable as a
|
||||
child of the other, don't add the link.
|
||||
|
||||
C. Set the parent of this message to be the last element in References.
|
||||
Note that this message may have a parent already: this can happen
|
||||
because we saw this ID in a References field, and presumed a parent
|
||||
based on the other entries in that field. Now that we have the actual
|
||||
message, we can be more definitive, so throw away the old parent and
|
||||
use this new one. Find this Container in the parent's children list,
|
||||
and unlink it.
|
||||
|
||||
Note that this could cause this message to now have no parent, if it
|
||||
has no references field, but some message referred to it as the
|
||||
non-first element of its references. (Which would have been some kind
|
||||
of lie...)
|
||||
|
||||
Note that at all times, the various ``parent'' and ``child'' fields
|
||||
must be kept inter-consistent.
|
||||
|
||||
2. Find the root set.
|
||||
|
||||
Walk over the elements of id_table, and gather a list of the Container
|
||||
objects that have no parents.
|
||||
|
||||
3. Discard id_table. We don't need it any more.
|
||||
|
||||
4. Prune empty containers.
|
||||
Recursively walk all containers under the root set.
|
||||
For each container:
|
||||
A. If it is an empty container with no children, nuke it.
|
||||
|
||||
Note: Normally such containers won't occur, but they can show up when
|
||||
two messages have References lines that disagree. For example, assuming
|
||||
A and B are messages, and 1, 2, and 3 are references for messages we
|
||||
haven't seen:
|
||||
|
||||
A has references: 1, 2, 3
|
||||
B has references: 1, 3
|
||||
|
||||
There is ambiguity as to whether 3 is a child of 1 or of 2. So,
|
||||
depending on the processing order, we might end up with either
|
||||
|
||||
-- 1
|
||||
|-- 2
|
||||
\-- 3
|
||||
|-- A
|
||||
\-- B
|
||||
|
||||
or
|
||||
|
||||
-- 1
|
||||
|-- 2 <--- non root childless container!
|
||||
\-- 3
|
||||
|-- A
|
||||
\-- B
|
||||
|
||||
B. If the Container has no Message, but does have children, remove this
|
||||
container but promote its children to this level (that is, splice them
|
||||
in to the current child list.)
|
||||
|
||||
Do not promote the children if doing so would promote them to the root
|
||||
set -- unless there is only one child, in which case, do.
|
||||
|
||||
5. Group root set by subject.
|
||||
|
||||
If any two members of the root set have the same subject, merge them. This
|
||||
is so that messages which don't have References headers at all still get
|
||||
threaded (to the extent possible, at least.)
|
||||
A. Construct a new hash table, subject_table, which associates subject
|
||||
strings with Container objects.
|
||||
|
||||
B. For each Container in the root set:
|
||||
|
||||
● Find the subject of that sub-tree:
|
||||
● If there is a message in the Container, the subject is the
|
||||
subject of that message.
|
||||
● If there is no message in the Container, then the Container
|
||||
will have at least one child Container, and that Container will
|
||||
have a message. Use the subject of that message instead.
|
||||
● Strip ``Re:'', ``RE:'', ``RE[5]:'', ``Re: Re[4]: Re:'' and so
|
||||
on.
|
||||
● If the subject is now "", give up on this Container.
|
||||
● Add this Container to the subject_table if:
|
||||
● There is no container in the table with this subject, or
|
||||
● This one is an empty container and the old one is not: the
|
||||
empty one is more interesting as a root, so put it in the
|
||||
table instead.
|
||||
● The container in the table has a ``Re:'' version of this
|
||||
subject, and this container has a non-``Re:'' version of
|
||||
this subject. The non-re version is the more interesting of
|
||||
the two.
|
||||
|
||||
C. Now the subject_table is populated with one entry for each subject
|
||||
which occurs in the root set. Now iterate over the root set, and gather
|
||||
together the difference.
|
||||
|
||||
For each Container in the root set:
|
||||
|
||||
● Find the subject of this Container (as above.)
|
||||
● Look up the Container of that subject in the table.
|
||||
● If it is null, or if it is this container, continue.
|
||||
|
||||
● Otherwise, we want to group together this Container and the one in
|
||||
the table. There are a few possibilities:
|
||||
|
||||
● If both are dummies, append one's children to the other, and
|
||||
remove the now-empty container.
|
||||
|
||||
● If one container is a empty and the other is not, make the
|
||||
non-empty one be a child of the empty, and a sibling of the
|
||||
other ``real'' messages with the same subject (the empty's
|
||||
children.)
|
||||
|
||||
● If that container is a non-empty, and that message's subject
|
||||
does not begin with ``Re:'', but this message's subject does,
|
||||
then make this be a child of the other.
|
||||
|
||||
● If that container is a non-empty, and that message's subject
|
||||
begins with ``Re:'', but this message's subject does not, then
|
||||
make that be a child of this one -- they were misordered. (This
|
||||
happens somewhat implicitly, since if there are two messages,
|
||||
one with Re: and one without, the one without will be in the
|
||||
hash table, regardless of the order in which they were seen.)
|
||||
|
||||
● Otherwise, make a new empty container and make both msgs be a
|
||||
child of it. This catches the both-are-replies and
|
||||
neither-are-replies cases, and makes them be siblings instead
|
||||
of asserting a hierarchical relationship which might not be
|
||||
true.
|
||||
|
||||
(People who reply to messages without using ``Re:'' and without
|
||||
using a References line will break this slightly. Those people
|
||||
suck.)
|
||||
|
||||
(It has occurred to me that taking the date or message number into
|
||||
account would be one way of resolving some of the ambiguous cases, but
|
||||
that's not altogether straightforward either.)
|
||||
|
||||
6. Now you're done threading!
|
||||
Specifically, you no longer need the ``parent'' slot of the Container
|
||||
object, so if you wanted to flush the data out into a smaller, longer-lived
|
||||
structure, you could reclaim some storage as a result.
|
||||
|
||||
7. Now, sort the siblings.
|
||||
At this point, the parent-child relationships are set. However, the sibling
|
||||
ordering has not been adjusted, so now is the time to walk the tree one
|
||||
last time and order the siblings by date, sender, subject, or whatever.
|
||||
This step could also be merged in to the end of step 4, above, but it's
|
||||
probably clearer to make it be a final pass. If you were careful, you could
|
||||
also sort the messages first and take care in the above algorithm to not
|
||||
perturb the ordering, but that doesn't really save anything.
|
||||
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
|
||||
You might be wondering what Netscape Confusicator 4.0 broke. Well, basically
|
||||
they never got threading working right. Aside from crashing, corrupting their
|
||||
databases files, and general bugginess, the fundamental problem had been
|
||||
twofold:
|
||||
|
||||
• 4.0 eliminated the ``dummy thread parent'' step, which is an absolute
|
||||
necessity to get threading right in the case where you don't have every
|
||||
message (e.g., because one has expired, or was never sent to you at all.)
|
||||
The best explanation I was able to get from them for why they did this was,
|
||||
``it looked ugly and I didn't understand why it was there.''
|
||||
|
||||
• 4.0 eliminated the ``group similar unthreaded subjects'' step, which is
|
||||
necessary to get some semblance of threading right in the absence of
|
||||
References and In-Reply-To, or in the presence of mangled References. If
|
||||
there was no References header, 4.0 just didn't thread at all.
|
||||
|
||||
Plus my pet peeve,
|
||||
|
||||
• The 4.0 UI presented threading as a kind of sorting, which is just not the
|
||||
case. Threading is the act of presenting parent/child relationships,
|
||||
whereas sorting is the act of ordering siblings.
|
||||
|
||||
That is, 4.0 gives you these choices: ``Sort by Date; Sort by Subject; Sort
|
||||
by message number; or Thread.'' Where they assume that ``Thread'' implies
|
||||
``Sort by Date.'' So that means that there's no way to see a threaded set
|
||||
of messages that are sorted by message number, or by sender, etc.
|
||||
|
||||
There should be options for how to sort the messages; and then, orthogonal
|
||||
to that should be the boolean option of whether the messages should be
|
||||
threaded.
|
||||
|
||||
I seem to recall there being some other problem that was a result of the thread
|
||||
hierarchy being stored in the database, instead of computed as needed in
|
||||
realtime (there were was some kind of ordering or stale-data issue that came
|
||||
up?) but maybe they finally managed to fix that.
|
||||
|
||||
My C version of this code was able to thread 10,000 messages in less than half
|
||||
a second on a low-end (90 MHz) Pentium, so the argument that it has to be in
|
||||
the database for efficiency is pure bunk.
|
||||
|
||||
Also bunk is the idea that databases are needed for ``scalability.'' This code
|
||||
can thread 100,000 messages without a horrible delay, and the fact is, if
|
||||
you're looking at a 100,000 message folder (or for that matter, if you're
|
||||
running Confusicator at all), you're doing so on a machine that has sufficient
|
||||
memory to hold these structures in core. Also consider the question of whether
|
||||
your GUI toolkit contains a list/outliner widget that can display a million
|
||||
elements in the first place. (The answer is probably ``no.'') Also consider
|
||||
whether you have ever in your life seen a single folder that has a million
|
||||
messages in it, and that further, you've wanted to look at all at once (rather
|
||||
than only looking at the most recent 100,000 messages to arrive in that
|
||||
newsgroup...)
|
||||
|
||||
In short, all the arguments I've heard for using databases to implement
|
||||
threading and mbox summarization are solving problems that simply don't exist.
|
||||
Show me a real-world situation where the above technique actually falls down,
|
||||
and then we'll talk.
|
||||
|
||||
Just say no to databases!
|
||||
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
|
||||
[ up ]
|
||||
Loading…
x
Reference in New Issue
Block a user