>At Mozilla, we need to understand how Firefox is used in the wild. Knowing what “typical” profiles are like and having automated tests that attempt to model real world situations is a big plus for writing well performing code.
Just in case anyone else needs to collect data about Firefox use or model “typical” user data for performance testing, here is how Drew and I quickly put together our “Places” toolkit.
The Sprint info page is here: https://wiki.mozilla.org/Firefox/Sprints/Places_DB_Creation_Scripts
We needed:
1. a client side script that collects places.sqlite metrics
The client side script is a Javascript written by Drew.
His script runs a bunch of aggregate SQL queries against your Places SQLite database and posts this to the collection url: https://places-stats.mozilla.com/stats
and
2. A server side script to generate a places.sqlite database based on the metrics we are collecting.
I focused on the database generation.
For now, we are doing this so we can create a test (mock) sqlite database with as many records as we wish, or based on the min, max or average of the users that post to the places-stats collection url.
So the basic flow is:
1. have users visit https://places-stats.mozilla.com and run the collection script.
2. get a large number of users (and varied types of users) posting their stats to the collection url
3. be able to produce a “power user”, “average user”, and “light user” places.sqlite database on the fly from data hosted at places-stats.mozilla.com
I wrote a Python script for the aggregate data collection and database generation.
To make this an easy, fast exercise in software re-use, I used Django’s db module to reverse engineer the Places schema into a set of Python models.
Once you have Django set up you can run the famous ‘manage.py inspectdb’, which queries your SQLite db schema and outputs the corresponding django.db Python classes.
It’s trivial to inject new rows into the database using django.db:
place = MozPlaces(
url=my_url,
title=my_title,
rev_host=reverse_host(my_url),
visit_count=1,
hidden=0,
typed=1,
favicon=new_favicon(),
frecency=1)
place.save()
(‘MozPlaces’ is a django.db ORM class)
Wow, that was easy, but wait, there is more to do.
We are not even attempting to create ‘real’ generated place data, we just want the rows in the database to seem real. We can generate random host, domain, and tld data like this:
def url_parts():
"""
return a dictionary like: {'proto':'http'
'host':'www',
'domain':'foo',
'tld':'com'}
"""
protocol = ['https','http','ftp']
host_len = random.randint(4,26)
host = "".join(random.sample(ALPHA,host_len))
domain_len = random.randint(2,26)
domain = "".join(random.sample(ALPHA,domain_len))
tld_len = random.randint(2,3)
tld = "".join(random.sample(ALPHA,tld_len))
proto_idx = random.randint(0, 2)
proto = protocol[proto_idx]
return {'proto':proto,'host':host,'domain':domain,'tld':tld}
Python’s random module has a ton of cool features. Output from the program shows that we end up with crazy looking hosts:
% python builddb/generate.py
h = httplib2.Http(os.tmpnam())
########################################################
Creating 131901 Places
Creating about 191594 History Visits
Creating about 12779 Bookmarks
Creating 101 Keywords
Creating 2173 Input History Records
########################################################
131901
Place #1 created
https://rmxwunibhvqzgjfclasypedko.zjrlundpaocs.kc/00000120269538042dedec07007f000000010001
Place #2 created
http://hlbgtm.wjxbdquyraotliek.au/000001202695391f62a5444e007f000000010001
Place #3 created
http://zdlxfpavecirty.urjawdvzoxgqemcikl.fp/00000120269539d794891209007f000000010001
Place #4 created
http://viwzykb.ofwxjmvltr.oa/0000012026953ab4b233317e007f000000010001
Place #5 created
https://yphswltjfmrbqogcd.qvd.ozd/0000012026953b539bc78b95007f000000010001
Place #6 created
ftp://pncqvksgazieuhdlofwxrtbymj.oekt.rbk/0000012026953c1f28a069ce007f000000010001
Place #7 created
http://lsmqeaojpxibvgnukwztcryhfd.isryhudzoeqjxtcankfgm.sg/0000012026953ca74487966d007f000000010001
My favorite site of the lot is “yphswltjfmrbqogcd.qvd.ozd”:)
The generation script populates “Places”, History, Bookmarks, Favicons, Input History and Keywords. I still have a few more entity types to generate, but this is sufficient for the testing we need to do now.
The current patch is here: https://bug480340.bugzilla.mozilla.org/attachment.cgi?id=367263
The bug is here: https://bugzilla.mozilla.org/show_bug.cgi?id=480340
The basic lesson learned is that you can build an effective, one-off data collection/metrics tool quickly and easily. I am sure others at Mozilla need tools like this, so do not hesitate to ping me with questions.