Archives For Mozilla

>JetCrawl

2010/01/16 — Leave a comment

>In an effort to provide realistic data in places.sqlite, I wrote a data generator in Python which inserts many records into places, and this was a good thing.

The data is entriely made up from random strings, and you end up with urls like this: http://ffhjhfj.uwtgbz.wsc

The same for tags, etc…

In order to make things more realistic and “testable” inside xpcshell, I created a crawler using Jetpack and standard Firefox XPCOM components. I have a feeling that QA might be interested in this as, so I posted to the Jetpack Gallery

It is not configurable from the outside, but I have plans for that. I am planning on making JetCrawl surf every night for a set period to increase my collection of data work with.

I need this data in crafting a new Places Query API that is fast and well tested against a rather large collection of bookmarks and history. The urls that Jetcrawl use are taken from the Alexa Top 100, so it is a common set to boot.

Automated tests will work better with this data since it is hitting predictable urls and it is the actual places apis creating the data in the first place.

If you want to try it out, please use a new profile. You can stop it by closing the tab. There is no UI as of yet.

Advertisements

>The timeline work I mentioned earlier has been going well. The patches are pretty well on their way. Until this lands, I am keeping the instrumentation patch unbitrotted each week.

On the “startup” team, we use the bugzilla whiteboard to mark all startup-related bugs with “[ts]”. I wanted to propose that in the meantime before the startup patch lands, feel free to mark [ft] in the whiteboard and list any function names you want to be timed with cold/warm start, platform, etc.

I will run the [ft] query often and add the timers to the instrumentation patch(es), and update the bug with the details.

Let me know if you have any ideas or questions.

>Counting Lines

2009/05/03 — 5 Comments

>Out of curiosity, I decided to see how many lines of code are in the Mozilla central repo.

I used CLOC, which stands for “Counting Lines of Code” a fast perl program http://cloc.sourceforge.net


ddahl-t500 ~ % perl ~/bin/cloc.pl --exclude-dir=.hg ~/code/moz/mozilla-central/mozilla
34572 text files.
33534 unique files.
56892 files ignored.

http://cloc.sourceforge.net v 1.08 T=134.0 s (204.0 files/s, 45458.4 lines/s)
--------------------------------------------------------------------------------
Language files blank comment code scale 3rd gen. equiv
--------------------------------------------------------------------------------
C++ 3096 246458 281710 1224917 x 1.51 = 1849624.67
C 1670 166078 241676 894512 x 0.77 = 688774.24
HTML 10113 92348 19579 625324 x 1.90 = 1188115.60
Javascript 4756 156677 221426 471609 x 1.48 = 697981.32
C/C++ Header 4109 100775 310865 413259 x 1.00 = 413259.00
IDL 1250 12107 0 120206 x 3.80 = 456782.80
Bourne Shell 220 16398 20518 106269 x 3.81 = 404884.89
Assembly 100 4545 2967 53678 x 0.25 = 13419.50
XML 650 14244 8639 50284 x 1.90 = 95539.60
Perl 242 6613 11643 30146 x 4.00 = 120584.00
CSS 463 7264 12306 30108 x 1.00 = 30108.00
Python 128 4880 12547 20428 x 4.20 = 85797.60
m4 22 1873 287 15602 x 1.00 = 15602.00
Java 129 2782 5736 11613 x 1.36 = 15793.68
DTD 141 2321 3028 10117 x 1.90 = 19222.30
Teamcenter def 43 60 2 4930 x 1.00 = 4930.00
SKILL 5 79 3 2680 x 2.00 = 5360.00
make 94 1554 4818 2620 x 2.50 = 6550.00
Objective C 3 235 211 959 x 2.96 = 2838.64
DOS Batch 42 175 121 621 x 0.63 = 391.23
Bourne Again Shell 5 138 452 503 x 3.81 = 1916.43
C# 4 57 261 453 x 1.36 = 616.08
Korn Shell 4 44 249 303 x 3.81 = 1154.43
Pascal 5 56 29 295 x 0.88 = 259.60
MATLAB 2 48 0 277 x 4.00 = 1108.00
Lisp 1 30 32 256 x 1.25 = 320.00
PHP 2 75 126 248 x 3.50 = 868.00
MSBuild scripts 2 2 0 242 x 1.90 = 459.80
XSLT 8 52 36 234 x 1.90 = 444.60
lex 2 64 59 223 x 1.00 = 223.00
Visual Basic 7 30 349 143 x 2.76 = 394.68
yacc 1 15 42 79 x 1.51 = 119.29
awk 3 15 46 70 x 3.81 = 266.70
SQL 1 2 0 56 x 2.29 = 128.24
sed 4 1 0 52 x 4.00 = 208.00
Ada 1 5 0 49 x 0.52 = 25.48
C Shell 2 22 13 40 x 3.81 = 152.40
D 2 7 98 16 x 1.70 = 27.20
Expect 1 0 0 1 x 2.00 = 2.00
--------------------------------------------------------------------------------
SUM: 27333 838129 1159874 4093422 x 1.50 = 6124253.00
--------------------------------------------------------------------------------

So: 4 million lines of code, wow.

What about Chrome?


ddahl-t500 ~ % perl ~/bin/cloc.pl --exclude-dir=.hg ~/code/cpp/home/chrome-svn/tarball/chromium/src
36331 text files.
34106 unique files.
118852 files ignored.

http://cloc.sourceforge.net v 1.08 T=169.0 s (100.0 files/s, 28630.1 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code scale 3rd gen. equiv
-------------------------------------------------------------------------------
C++ 4567 219341 217686 1074656 x 1.51 = 1622730.56
C 657 76387 135868 526696 x 0.77 = 405555.92
C/C++ Header 5501 142924 330982 489098 x 1.00 = 489098.00
Perl 1463 98760 135606 270429 x 4.00 = 1081716.00
Python 1202 45766 70503 200647 x 4.20 = 842717.40
Javascript 1678 37060 71865 184861 x 1.48 = 273594.28
Bourne Shell 113 18492 22420 137002 x 3.81 = 521977.62
Assembly 4 60 44 131490 x 0.25 = 32872.50
HTML 741 4491 1244 44920 x 1.90 = 85348.00
m4 18 4361 228 37068 x 1.00 = 37068.00
Objective C 185 6036 7799 29487 x 2.96 = 87281.52
IDL 342 1666 0 13779 x 3.80 = 52360.20
XML 208 604 712 6651 x 1.90 = 12636.90
CSS 30 1319 437 6492 x 1.00 = 6492.00
make 55 852 960 5860 x 2.50 = 14650.00
Tcl/Tk 23 810 1196 5507 x 4.00 = 22028.00
yacc 4 572 171 4532 x 1.51 = 6843.32
Expect 15 4 2 2113 x 2.00 = 4226.00
MATLAB 18 222 0 1937 x 4.00 = 7748.00
XSD 3 143 1129 1587 x 1.90 = 3015.30
C# 11 182 632 1226 x 1.36 = 1667.36
DOS Batch 31 138 86 649 x 0.63 = 408.87
Teamcenter def 16 29 88 521 x 1.00 = 521.00
XSLT 2 10 28 294 x 1.90 = 558.60
Korn Shell 1 39 46 223 x 3.81 = 849.63
awk 6 6 84 211 x 3.81 = 803.91
Java 4 27 0 106 x 1.36 = 144.16
MSBuild scripts 1 0 7 100 x 1.90 = 190.00
YAML 1 0 0 84 x 0.90 = 75.60
MUMPS 1 2 0 29 x 4.21 = 122.09
sed 2 0 10 26 x 4.00 = 104.00
Ruby 1 10 7 15 x 4.20 = 63.00
D 1 3 24 13 x 1.70 = 22.10
PHP 1 0 0 3 x 3.50 = 10.50
-------------------------------------------------------------------------------
SUM: 16906 660316 999864 3178312 x 1.77 = 5615500.34
-------------------------------------------------------------------------------

3.1 million lines…
Looks like they are catching up on us. By one metric anyway:P

If this doesn’t seem correct let me know.

>At Mozilla, we need to understand how Firefox is used in the wild. Knowing what “typical” profiles are like and having automated tests that attempt to model real world situations is a big plus for writing well performing code.

Just in case anyone else needs to collect data about Firefox use or model “typical” user data for performance testing, here is how Drew and I quickly put together our “Places” toolkit.

The Sprint info page is here: https://wiki.mozilla.org/Firefox/Sprints/Places_DB_Creation_Scripts

We needed:

1. a client side script that collects places.sqlite metrics

The client side script is a Javascript written by Drew.

His script runs a bunch of aggregate SQL queries against your Places SQLite database and posts this to the collection url: https://places-stats.mozilla.com/stats

and

2. A server side script to generate a places.sqlite database based on the metrics we are collecting.

I focused on the database generation.

For now, we are doing this so we can create a test (mock) sqlite database with as many records as we wish, or based on the min, max or average of the users that post to the places-stats collection url.

So the basic flow is:

1. have users visit https://places-stats.mozilla.com and run the collection script.
2. get a large number of users (and varied types of users) posting their stats to the collection url
3. be able to produce a “power user”, “average user”, and “light user” places.sqlite database on the fly from data hosted at places-stats.mozilla.com

I wrote a Python script for the aggregate data collection and database generation.

To make this an easy, fast exercise in software re-use, I used Django’s db module to reverse engineer the Places schema into a set of Python models.

Once you have Django set up you can run the famous ‘manage.py inspectdb’, which queries your SQLite db schema and outputs the corresponding django.db Python classes.

It’s trivial to inject new rows into the database using django.db:


place = MozPlaces(
url=my_url,
title=my_title,
rev_host=reverse_host(my_url),
visit_count=1,
hidden=0,
typed=1,
favicon=new_favicon(),
frecency=1)
place.save()

(‘MozPlaces’ is a django.db ORM class)

Wow, that was easy, but wait, there is more to do.

We are not even attempting to create ‘real’ generated place data, we just want the rows in the database to seem real. We can generate random host, domain, and tld data like this:


def url_parts():
"""
return a dictionary like: {'proto':'http'
'host':'www',
'domain':'foo',
'tld':'com'}
"""
protocol = ['https','http','ftp']
host_len = random.randint(4,26)
host = "".join(random.sample(ALPHA,host_len))
domain_len = random.randint(2,26)
domain = "".join(random.sample(ALPHA,domain_len))
tld_len = random.randint(2,3)
tld = "".join(random.sample(ALPHA,tld_len))
proto_idx = random.randint(0, 2)
proto = protocol[proto_idx]
return {'proto':proto,'host':host,'domain':domain,'tld':tld}

Python’s random module has a ton of cool features. Output from the program shows that we end up with crazy looking hosts:


% python builddb/generate.py

h = httplib2.Http(os.tmpnam())
########################################################
Creating 131901 Places
Creating about 191594 History Visits
Creating about 12779 Bookmarks
Creating 101 Keywords
Creating 2173 Input History Records
########################################################
131901
Place #1 created
https://rmxwunibhvqzgjfclasypedko.zjrlundpaocs.kc/00000120269538042dedec07007f000000010001
Place #2 created
http://hlbgtm.wjxbdquyraotliek.au/000001202695391f62a5444e007f000000010001
Place #3 created
http://zdlxfpavecirty.urjawdvzoxgqemcikl.fp/00000120269539d794891209007f000000010001
Place #4 created
http://viwzykb.ofwxjmvltr.oa/0000012026953ab4b233317e007f000000010001
Place #5 created
https://yphswltjfmrbqogcd.qvd.ozd/0000012026953b539bc78b95007f000000010001
Place #6 created
ftp://pncqvksgazieuhdlofwxrtbymj.oekt.rbk/0000012026953c1f28a069ce007f000000010001
Place #7 created
http://lsmqeaojpxibvgnukwztcryhfd.isryhudzoeqjxtcankfgm.sg/0000012026953ca74487966d007f000000010001

My favorite site of the lot is “yphswltjfmrbqogcd.qvd.ozd”:)

The generation script populates “Places”, History, Bookmarks, Favicons, Input History and Keywords. I still have a few more entity types to generate, but this is sufficient for the testing we need to do now.

The current patch is here: https://bug480340.bugzilla.mozilla.org/attachment.cgi?id=367263

The bug is here: https://bugzilla.mozilla.org/show_bug.cgi?id=480340

The basic lesson learned is that you can build an effective, one-off data collection/metrics tool quickly and easily. I am sure others at Mozilla need tools like this, so do not hesitate to ping me with questions.

>It’s quite amazing how much traction we get out of irc at work. It’s pretty much a rule to keep the conversation inside irc. This makes a whole lot of sense, as we have all kinds of documentation in the chat logs. I am very accustomed to chatting for work conversation, debugging and whatnot. I love the decentralized model. This is above and beyond what I have ever experienced. So cool.

Between irc, blogs and bugzilla, we are all having a lot of online conversations. Email is important, but secondary. I likes. 99% of the communication is captured in a pretty meaningful way, unlike typical corporate massive reliance on email, which is so full of spam and nonsense.

>The cool thing about really getting your hands dirty deep inside of chrome is learning how the Mozilla developers workflow works. So cool. As an extension developer, I was so used to a workflow where I code, restart Firefox and test. When you work on Mozilla you code/write tests, run make, make check on the module or modules you are touching and start your build, then test if need be…

So cool. I am having a ton of fun. At my last job I was leading a project and now I am back in the position as a student, my favorite place to be:)