Archives For places

>JetCrawl

2010/01/16 — Leave a comment

>In an effort to provide realistic data in places.sqlite, I wrote a data generator in Python which inserts many records into places, and this was a good thing.

The data is entriely made up from random strings, and you end up with urls like this: http://ffhjhfj.uwtgbz.wsc

The same for tags, etc…

In order to make things more realistic and “testable” inside xpcshell, I created a crawler using Jetpack and standard Firefox XPCOM components. I have a feeling that QA might be interested in this as, so I posted to the Jetpack Gallery

It is not configurable from the outside, but I have plans for that. I am planning on making JetCrawl surf every night for a set period to increase my collection of data work with.

I need this data in crafting a new Places Query API that is fast and well tested against a rather large collection of bookmarks and history. The urls that Jetcrawl use are taken from the Alexa Top 100, so it is a common set to boot.

Automated tests will work better with this data since it is hitting predictable urls and it is the actual places apis creating the data in the first place.

If you want to try it out, please use a new profile. You can stop it by closing the tab. There is no UI as of yet.

Advertisements

>With the Firefox UI/UX team starting to crank out design ideas for “Places” ( a Mozilla internal name for bookmarks and history) in Firefox 3.7 and 4.0, it’s high time the Places team revamped the query API.

Alex Faaborg has posted some initial UI concepts here: http://blog.mozilla.com/faaborg/2009/10/13/browsing-your-personal-web/

I have started to think about how to make an elegant API to do the heavy lifting of querying the Places database for bookmarks, history and related hierarchies. The current Places query API is not simple to use, and we want this to be simple and easily extensible by extension authors, as well as a drop in api for Jetpack.

The bug for this work is here: https://bugzilla.mozilla.org/show_bug.cgi?id=522572

One of our non-goals is to make a snap in replacement for the current API. We get to focus on the new features, like “browsing” your bookmarks and history in content-space, as well as accessing bookmarks and history via the “awesomebar”.

I have posted the beginning stages of this work to the wiki, here, the Firefox “project page” is here. We are in a stage of thinking about and sketching what this simple, elegant API might look like, and we would love to get feedback and ideas from our colleagues and the Mozilla community. The Places 3.7 meta bug is here: https://bugzilla.mozilla.org/show_bug.cgi?id=523519

>At Mozilla, we need to understand how Firefox is used in the wild. Knowing what “typical” profiles are like and having automated tests that attempt to model real world situations is a big plus for writing well performing code.

Just in case anyone else needs to collect data about Firefox use or model “typical” user data for performance testing, here is how Drew and I quickly put together our “Places” toolkit.

The Sprint info page is here: https://wiki.mozilla.org/Firefox/Sprints/Places_DB_Creation_Scripts

We needed:

1. a client side script that collects places.sqlite metrics

The client side script is a Javascript written by Drew.

His script runs a bunch of aggregate SQL queries against your Places SQLite database and posts this to the collection url: https://places-stats.mozilla.com/stats

and

2. A server side script to generate a places.sqlite database based on the metrics we are collecting.

I focused on the database generation.

For now, we are doing this so we can create a test (mock) sqlite database with as many records as we wish, or based on the min, max or average of the users that post to the places-stats collection url.

So the basic flow is:

1. have users visit https://places-stats.mozilla.com and run the collection script.
2. get a large number of users (and varied types of users) posting their stats to the collection url
3. be able to produce a “power user”, “average user”, and “light user” places.sqlite database on the fly from data hosted at places-stats.mozilla.com

I wrote a Python script for the aggregate data collection and database generation.

To make this an easy, fast exercise in software re-use, I used Django’s db module to reverse engineer the Places schema into a set of Python models.

Once you have Django set up you can run the famous ‘manage.py inspectdb’, which queries your SQLite db schema and outputs the corresponding django.db Python classes.

It’s trivial to inject new rows into the database using django.db:


place = MozPlaces(
url=my_url,
title=my_title,
rev_host=reverse_host(my_url),
visit_count=1,
hidden=0,
typed=1,
favicon=new_favicon(),
frecency=1)
place.save()

(‘MozPlaces’ is a django.db ORM class)

Wow, that was easy, but wait, there is more to do.

We are not even attempting to create ‘real’ generated place data, we just want the rows in the database to seem real. We can generate random host, domain, and tld data like this:


def url_parts():
"""
return a dictionary like: {'proto':'http'
'host':'www',
'domain':'foo',
'tld':'com'}
"""
protocol = ['https','http','ftp']
host_len = random.randint(4,26)
host = "".join(random.sample(ALPHA,host_len))
domain_len = random.randint(2,26)
domain = "".join(random.sample(ALPHA,domain_len))
tld_len = random.randint(2,3)
tld = "".join(random.sample(ALPHA,tld_len))
proto_idx = random.randint(0, 2)
proto = protocol[proto_idx]
return {'proto':proto,'host':host,'domain':domain,'tld':tld}

Python’s random module has a ton of cool features. Output from the program shows that we end up with crazy looking hosts:


% python builddb/generate.py

h = httplib2.Http(os.tmpnam())
########################################################
Creating 131901 Places
Creating about 191594 History Visits
Creating about 12779 Bookmarks
Creating 101 Keywords
Creating 2173 Input History Records
########################################################
131901
Place #1 created
https://rmxwunibhvqzgjfclasypedko.zjrlundpaocs.kc/00000120269538042dedec07007f000000010001
Place #2 created
http://hlbgtm.wjxbdquyraotliek.au/000001202695391f62a5444e007f000000010001
Place #3 created
http://zdlxfpavecirty.urjawdvzoxgqemcikl.fp/00000120269539d794891209007f000000010001
Place #4 created
http://viwzykb.ofwxjmvltr.oa/0000012026953ab4b233317e007f000000010001
Place #5 created
https://yphswltjfmrbqogcd.qvd.ozd/0000012026953b539bc78b95007f000000010001
Place #6 created
ftp://pncqvksgazieuhdlofwxrtbymj.oekt.rbk/0000012026953c1f28a069ce007f000000010001
Place #7 created
http://lsmqeaojpxibvgnukwztcryhfd.isryhudzoeqjxtcankfgm.sg/0000012026953ca74487966d007f000000010001

My favorite site of the lot is “yphswltjfmrbqogcd.qvd.ozd”:)

The generation script populates “Places”, History, Bookmarks, Favicons, Input History and Keywords. I still have a few more entity types to generate, but this is sufficient for the testing we need to do now.

The current patch is here: https://bug480340.bugzilla.mozilla.org/attachment.cgi?id=367263

The bug is here: https://bugzilla.mozilla.org/show_bug.cgi?id=480340

The basic lesson learned is that you can build an effective, one-off data collection/metrics tool quickly and easily. I am sure others at Mozilla need tools like this, so do not hesitate to ping me with questions.

>I am having a wild time trying to follow the voluminous amount of code involved in the Places component of Firefox. I finally remembered to install the Extension Developer’s extension, which would not install by default as it has no “secure update method” (or some such complaint by Add-ons). Anyway, I added a couple of configuration option to about:config…

note: Javascript Shell is included in “Extension Developer’s Extension“.

extensions.checkCompatibility (bool) = false <– because I am hacking on Trunk

extensions.checkUpdateSecurity (bool) = false <– because Ext. Devel. Ext. has no secure update?

Not sure exactly, too lazy to find out:) Coming from the Python school, I always prototype and inspect code and live objects in iPython or the Python interpreter. This is a great way to become familiar with new code even before you try to read or run the test suite – especially complex code like Firefox Places. What I wouldn’t do for a python “inspect”-like module for Javascript inside chrome. So anyway – I get all of this running and am in high spirits, but I noticed that the Javascript shell is not formatting anything that is spit out into it’s ouput div. Tab completion on objects shows you a long sinlge line of each completable item, making this basically unusable (for me). I also started playing with Xush, but, without tab completion, a lightweight stand-alone window and a few bugs on Linux (which I think have been rectified), I figured I would add some css and js tweaks to make JS shell more my style. Here is what I tweaked:

1. Made all fonts 1em and ‘monospace’ font-family. Like a shell should be:)

2. Check the output to see if it is a function, if so, display in a pre element.

3. Added “prettyprint” source code beautifier so the above-mentioned functions are easier on the eyes – not “emacs classic theme” or anything, but a step in the “iPython” direction.

4. Added ctrl-a and ctrl-e key commands to the input widget. Yay!

5. Open the shell in a 800 x 600 window

6. Added a promt like the xpcshell promt: js>

This makes jsshell just a tad bit easier to use and cleaner to boot. I like that it is so lightweight – I tried using ChromeBug too, and it is getting faster, but it can be a bit flaky.

I just think nothing can beat a very lightweight, responsive shell.

I am sure I will keep tweaking it. If you want to get a copy, I have posted my extensiondev.jar that is part of the Extension deveoper’s extension here: extensiondev.jar

postscript:

If I had bothered to look (i’m slow like that) at the google code page for Extension Developer’s Extension, ( http://code.google.com/p/extensiondev/ ) I would have seen that the formatting bug was reported and the submitted patch was a simple css fix!

Well, it was fun. I may have to join the project:)