Analyzing Browser History Using Python and Pandas

Today, we are going to do some spelunking within the deep, dark place which is your browser history. In order to obtain the data for this tutorial from Google Chrome, go to ~/Library/Application Support/Google/Chrome/Default on a Mac/Linux computer or %LocalAppData%\Google\Chrome\User Data\Default on a Windows PC. Run the following SQLite command to obtain a text file in reverse chronological order:

On Mac/Linux:

sqlite3 History "select datetime(last_visit_time/1000000-11644473600,'unixepoch'), url from urls order by last_visit_time desc" > ~/hist.txt

Or, on Windows:

sqlite3 History "select datetime(last_visit_time/1000000-11644473600,'unixepoch'), url from urls order by last_visit_time desc" > %userprofile%\hist.txt

Check your user folder. A file called hist.txt should be there. Move the file to a suitable place for this exercise.

(This process brought to you by the brilliant people on Stack Exchange).

Import the needed libraries, numpy and pandas:

import pandas as pd
import numpy as np

Clean Up Data

That data that we pulled is extremely messy. Here’s an example row:

2017-11-12 21:10:11|https://news.ycombinator.com/item?id=15678587

We need to split on that vertical bar while making sure not to split on a bar in the URL itself. Since Pandas probably doesn’t do this out of the box, let’s write a custom import function:

# ...continued from previous code block...
# Open our file
with open('hist.txt') as f:
    content = f.readlines()
# Strip whitespace then split on first occurrence of pipe character
raw_data = [line.split('|', 1) for line in [x.strip() for x in content]]
# We now have a 2D list.
print(raw_data[1])

['2017-11-12 21:09:21', 'https://news.ycombinator.com/']

Using our 2D list, let’s make a Pandas DataFrame with custom column headers and make sure it is working:

# ...continued from previous code block...
data = pd.DataFrame(raw_data, columns=['datetime', 'url'])
data.head(1)

	datetime	url
0	2017-11-12 21:10:11	https://news.ycombinator.com/item?id=15678587

Now, we’re almost done with ingesting the data. Let’s convert the datetime string column into a column of Pandas datetime elements and double-check that it is indeed a Pandas timestamp:

# ...continued from previous code block...
data.datetime = pd.to_datetime(data.datetime)
data.datetime[0]

Timestamp('2017-11-12 21:10:11')

Finally, let’s remove all information from the URL, leaving only the domain/subdomain and check our work again:

# ...continued from previous code block...
# Import the urlparse function from the urllib library
from urllib.parse import urlparse
# Define a lambda function to parse the domain from the URL
parser = lambda u: urlparse(u).netloc
# Apply the lambda function to the url column
data.url = data.url.apply(parser)
data.head(1)

	datetime	url
0	2017-11-12 21:10:11	news.ycombinator.com

Finally, our data is clean.

Analyzing the Data

Now that the boring part is done, let’s analyze our browsing data.

Most Visited Sites

Let’s generate a list of our top sites sorted by frequency, then print out the first two to get our most visited sites:

# ...continued from previous code block...
# Aggregate domain entries
site_frequencies = data.url.value_counts().to_frame()
# Make the domain a column
site_frequencies.reset_index(level=0, inplace=True)
# Rename columns to appropriate names
site_frequencies.columns = ['domain', 'count']
# Display top 2
site_frequencies.head(2)

	domain	count
0	www.google.com	3904
1	github.com	1571

It should come as no shock that my top sites, just like any other dev, were Google and Github.

(Is secretly surprised that Stack Overflow was not one of the top ten)

Now, let’s see our top sites in a beautiful chart:

# ...continued from previous code block...
import matplotlib.pyplot as plt
topN = 20
plt.figure(1, figsize=(10,10))
plt.title('Top $n Sites Visited'.replace('$n', str(topN)))
pie_data = site_frequencies['count'].head(topN).tolist()
pie_labels = None
# Uncomment to get specific domain names
# pie_labels = site_frequencies['domain'].head(topN).tolist()
plt.pie(pie_data, autopct='%1.1f%%', labels=pie_labels)
plt.show()

browser history pie chart

What else can we do with the data? A lot, I’m sure, since this is the Holy Grail for ad tracking companies.