12 Years of Gmail, Part 5: Mail

Posted on 05 December 2016 in Technology

This post is part of my series, 12 Years of Gmail, taking a look at the data Google has accumulated on me over the past 12 years of using various Google services and documenting the learning experience developing an open source Python project (Takeout Inspector) to analyze that data.

After taking a look at the chat data in my export, I am finally ready to move on to some of the actual mail! Much of what I will look at here is pretty similar to what I was able to turn up with chat data. I tried to branch out a bit, bringing in a new package to create word clouds, and also refactored some of the Takeout Inspector code to form the beginning of a more "formal" report generating process (instead of just spitting out a single HTML file with only a certain subset of the data). Hopefully I can continue to improve this to a point allowing for easier report generation for any user. Anyway, on to the mail data!

Top 10 Recipients

Top 10 Recipients

This is one of the first graphs I produced while prototyping this idea. The top recipient (person I have sent email to) by far, Crystal Meyer, is my wife and the remaining folks are a mix of friends and family so there is nothing much to dig in to for this graph.

Top 10 Senders

Top 10 Senders

This is at least slightly more interesting - while Crystal Meyer is the top person I send email to, she is not the top person I receive email from. So who is this mystery Lindsey Campbell? Not a person at all! This happens to be a "noreply" style address used for automated emails from a website forum dedicated to a Left 4 Dead game server that I operated for a couple of years.

Number two on this list, Craig Wilber is also not a real person. This heavy traffic is from an automated email service from DNSstuff that I have used in the past to alert me of DNS issues for a website. For exactly one year from 2006 to 2007 I received a couple of emails from this service every day. Looking at the content of some of these emails, I am not quite sure why I received so many for such a long time. They all report odd results about changing DNS records every couple of hours. Curious...

After those two and my wife, the rest of the list rounds out with a mix of more automated email services and other friends and family members as above.

Thread Durations

Thread Durations

This graphs uses Gmail's thread IDs to determine total duration for each mail thread (based on the Date header of the first and last email). The following query retrieves the basis for this information:

SELECT strftime('%s', MAX(`date`)) - strftime('%s', MIN(`date`)) AS duration,
COUNT(message_key) AS message_count
FROM messages
WHERE gmail_labels NOT LIKE '%Chat%'
GROUP BY gmail_thread_id
HAVING message_count > 1;

In this case I chose to only include threads of more than one email (Line 5) to keep out things like newsletters and other automated email services. So this should reflect mostly real conversations with actual people.

No duration really dominates here, but if this were re-categorized as "one week or less" and "more than one week", the latter would only account for a little more than 10% of communications.

The longest thread lasted more than seven years (221,932,765 seconds) and only had four emails. It was an alert service for Bug 285774, an issue with the Camino web browser that I was apparently interested in back in 2006. In fact, the top five longest threads are all Mozilla bug reports relating to Camino (aka not real conversations).

The first and longest real thread is only three emails wherein I send my résumé to someone and receive a response nine months later. I guess he wasn't impressed...

Thread Sizes

Thread Sizes

Again using Gmail's thread ID email header, this graph compares thread sizes by the total number of emails in the thread (at least two emails). The curve ends up being almost exactly inverse except for small deviations and a couple of anomalies leading up to one big 98-email thread.

The first small anomaly is nine 23-email threads (zoom in to see it). Not surprisingly, these threads have nothing to do with one another. Some are between me and the developer of Quinn back in the early 2000's, another is a vacation planning thread with some friends, one relates to my wife and I closing on our first home, and the remainder are random conversations with friends.

The more noticeable anomaly is 19 61-email threads and it turns out these threads are in fact related somehow. As I mentioned previously in the Top 10 Senders section, I used to run a forum for a Left 4 Dead game server. This forum was not hugely popular, but it was eventually targeted by automated spam service that would attempt to create large swaths of accounts to spam the forum. In order to combat this, I activated a function requiring manual review of all newly created accounts. This is what led the noreply address from the forum to be far and away the top sender to my account.

Why 61 emails per thread? I'm not sure. My best guess was that Gmail grouped large threads after a set amount of time, but these threads range from about two to seven days with no one thread having the same duration. I continued digging around looking for similarities and, finding none, turned to the Internet. While I could not find any definitive source or reason for this grouping, at least a few other folks have stumbled on the oddity [1, 2]

There are only two more numbers of note after 61:

  • Two 66-email threads which are unrelated but both about planning and launching a website.
  • One 98-email thread, once again about the development process for the Quinn website.

Activity by Day of the Week

Activity by Day of the Week

The distribution here for the days is not terribly surprising as I have never done much emailing on the weekends. But I am surprised to see the sent and received number almost 50/50 for every day of the week. I suspect this may be skewed a little bit because spam is cleared out automatically every 30 days and because I (regrettably) "cleaned" out a lot of old email before doing my Takeout export. Had I not cleaned things up, I suspect the sent/received distribution would be closer to 40/60 than 50/50.

Activity by Hour of the Day

Activity by Hour of the Day

This activity falls pretty squarely in line with my chat time activity. Apparently the best time to catch me for a response of any kind is on a weekday at/around EST lunch time (11AM - 1PM EST).

Interestingly, I think both the day of week and time of day activity were a habit of Gmail in particular. Since I moved away from it, I find that I am more prone to check and respond to emails in the evenings and on weekends. Perhaps this is simply because of the lack of chat functionality with my new email provider. I wish that I had had a longer history of Gmail before chat was introduced, as I wonder if that addition made some larger effect on my overall communication habits.

Label Usage

Label Usage

I was never a particularly heavy user of labels so this graph is mostly dominated by the default labels that Gmail applies automatically. Removing those from the graph (click the labels in the legend) reveals the labels I did actually use:

  • The most used label, FBF refers to Friends of Burkina Faso, an organization I provide some volunteer work to on occasion.
  • PC labeled emails relate to my time in Peace Corps, most of these emails are part of the long application process.
  • House emails relate to the many, many back-and-forth emails and documents involved in purchasing a home.
  • RAID refers to the Rogue AI Dungeon project I worked on with friends.
  • CHLHS is the Chesapeake Chapter of the U.S. Lighthouse Society, another organization I have volunteered with in the past.
  • Last but not least, haxors.com is a domain that I have owned for many years and put a lot of random projects on. By the time I started using Gmail, I used this domain less and less so there are not very many emails with this label (compared to the others).

I never found the labels concept particularly useful - my emails tend to have one primary subject and mixing labels would be a very rare occurrence. Prior to using Gmail and labels, I did make heavy use of folders with my own email services and to this day I still do with Exchange-based services. The Gmail experience did a very effective job of pushing users to the "archive it all and search as needed" philosophy thanks to Google's powerful search algorithms.

Subject Word Cloud

Mail Subject Word Cloud

Most of the words that pop out here are not at all surprising given the topics covered in this post. The subject line of that automated forum email was always the same, "Activate user account", so it dominates the graphic. The rest tend to be either common English parts of speech and things related to Peace Corps or side projects I have worked on.

This graphic is generated using the wordcloud package, which has a very simple API for Python and also includes an advanced feature for building the word clouds using masks.

As this graphic is only based on email Subject headers, getting the data from Takeout Inspector's sqlite table is very simple:

SELECT subject FROM messages
WHERE gmail_labels NOT LIKE '%Chat%' AND subject != '';

From here Python is used to separate and clean the words and calculate frequency:

words = {}
for row in c.fetchall():
    subject = row[0]
    for prefix in ['Re:', 'Fwd:']:
        subject = subject.replace(prefix, '')

    subject = re.sub('[^a-zA-Z. ]', '', subject).strip().lower()

    for word in subject.split(' '):
        if word:
            word = word.rstrip('.')
            if word not in words:
                words[word] = 0
            words[word] += 1
  • Lines 3 & 4 remove common email subject prefixes.
  • Line 6 filters out non-alpha characters, leaving dots (for web addresses) and space characters, and converts all letters to lower case.
  • Lines 8 - 13 break each subject in to a list of words, removes periods from the end of words with rstrip (this assumes sentence structure) and finally ticks the frequency count for each word.

If required, the sorted() function is great for creating sorted lists from a dictionary like words above. However this word cloud simply filters the list for all words with more 100 occurrences before passing the data on to wordcloud's generate_from_frequencies() method. Wordcloud can also handle the frequency calculate itself with the regular generate() method

More to Come

Much of the graphs generated for email information are very similar to what was done for chat information. I will use a future post in the series to dig deeper in to some of the more technical email headers, hopefully take a look at body word usage/statistics and come up with some other ways to look at and compare more aggregate message data.