Can Google Really Tell Which of Your E-mails are Interesting?

Learn how Google decides which e-mails to store for offline viewing.
One of the most fascinating tidbits from Google's announcement last week of offline capability for Gmail was this: The company says when it stores your messages for offline viewing, it tries to choose the most interesting ones. It's a sensible, if somewhat creepy strategy -- after all, why waste your disk space on messages you'll never want to read again? But after looking at the results on my own e-mail account, I can't see any evidence that, in my case at least, Google is successfully separating the wheat from the chaff.

Here's exactly what Google says in its online support:

"We try to download your most recent conversations along with any conversations that seem to be important (regardless of their age). We also try not to dowload [sic] uninteresting conversations. This process is done heuristically and as with any heuristic can and will miss things. We'll continue to tune things up, but more importantly, we'll eventually provide a UI that will allow you to change the settings."

I asked a Google rep if someone could give me more detail on the process, but he declined. So we're left with this somewhat cryptic explanation. (When I first started hearing the word heuristic a few years ago, I thought it actually meant something. After hearing it applied to countless mysterious and diverse technological processes over the years, I've concluded that it's really just a polite way of saying "You wouldn't understand.")

So Google seems to be saying it analyzes your messages to figure out whether they're important or interesting (certainly not always the same thing). Does it do that by looking at the content (it already processes the content of your messages to serve up contextual ads)? Does it look at the activity a message engenders -- the number of responses, etc.? My guess is they'd use a combination of both methods.

But whatever the strategy is, it doesn't seem to be working, at least on my Gmail account. I looked at what Gmail did with uninteresting, unimportant messages and what it did with very important and interesting missives. What I found was it seems to be doing the exact same thing with both kinds of messages.

I started by looking at all the mail in my offline cache. Basically Gmail has kept a pretty comprehensive collection of my mail back to the beginning of December 2008, about 6,500 messages. It also kept other messages that have a few select labels. (Google says that it chooses some labels that will be completely cached. Here's Google's baroque explanation of how it chooses those labels: "Additionally, we'll download any conversation marked with a label that contains less than 200 conversations, has at least one conversation that has been received in the last 30 days and also has at least one conversation that's outside the estimated time period. For many users, this list of labels will include Starred and Drafts.")

Then I looked at unimportant, uninteresting messages, for instance, a stream of collected business press releases I get daily, never read and immediately archive. The result: Every one of them is preserved in my Google cache, right back to early December. I also looked at a stream of e-mail ads from Amazon about bargains. Again, these are messages I don't open, just immediately archive. Again, each one seems to be preserved in my offline cache, back to the beginning of December.

Finally, I looked at important messages from my colleagues here at PC World (some of which are actually interesting as well.) These are messages that I read closely and often respond to or forward to another editor. According to Google's description of its system, I should expect that some of those messages would be preserved even if they predate the early December cutoff for the rest of my mail. (Remember: "We try to download ... any conversations that seem to be important (regardless of their age).")

But that's not the case. The only messages in my cache that are from before early December are those that have one of the labels that Gmail decided to keep a complete record of. It doesn't matter if an e-mail thread has 15 responses or includes words like "this is important" or "that's interesting." It's still not there.

So I'm forced to conclude that for my account at least, Google's heuristic isn't working. Or perhaps it's something that I just wouldn't understand.

