Posts Tagged ‘statistics’

datanommer: Making Fedora metrics more transparent

June 10th, 2010

I kind of surprised myself when I realized I hadn’t blogged about this yet. I talked about it with Max, I talked about it with folks in #fedora-infrastructure, and I’m giving a talk at SELF that circles around this very project.

The Fedora Project, from the beginning of its collection of statistics surrounding itself, has been open and transparent about the numbers we get and how we get them.

There’s just one problem with that: a lot of the actual raw data isn’t publicly available.

Of course, we don’t want to go about publishing raw httpd access logs to public locations. We don’t want everybody to be able to see the IP addresses that visit fedoraproject.org. But we do want people to be able to come up with a number for themselves that answers questions like “how many distinct IP addresses visited fedoraproject.org between January 4 at 4:32 a.m. and February 2 and 6:28 p.m.?” without giving access to our log servers to everybody.

Or, even if the data is publicly available, it’s difficult to get that data because the application doesn’t provide an API of sorts (Mailman, for example). Writing a screen scraper for Mailman is non-trivial.

What if there was a central API that held raw data about the everyday activity of the Fedora community?

I plan to write that. And it shall be called “datanommer.” It’ll use the TG2 stack, at the request of Infrastructure, and, although it will be designed around Fedora’s existing infrastructure, will be agnostic so that other free software projects can use it right out of the box.

Here’s a quick summary of how it’ll work.

  • Applications that already make log files will have those transferred to our log servers by normal means. Applications that don’t already make log files will either use an extension, module or the like to write a log file, or an external script will create a log file, which will then be transferred to the log servers.
  • A cron job will populate a database used for datanommer based on those log entries.
  • The TG2 front end of datanommer will provide a RESTful API to access the data in the database. Applications that provide data and what data they provide to datanommer will be automatically documented for maximum usability.

At first glance, this may seem like a lot of hoops just to get some data. But here’s some reasons we’re doing it this way, specifically:

  • Less load on the app servers. If we programmed datanommer to collect data from each application about once per hour, the app servers and databases would be under somewhat heavy load while that data is generated.
  • If datanommer is down for some reason, it doesn’t matter, because data entry is done directly to the database.
  • If the database is down for some reason, it doesn’t matter. The cron job will just wait another hour to populate the databases.
  • If the log servers are down for some reason, it doesn’t matter. Logs are generated locally on each app server, much like httpd. The log servers will go through and pick up the logs when they get around to it.
  • If the applications are down for some reason, they won’t be generating any data anyway, so it doesn’t matter. :)

For the end-user, accessing the data will be extremely easy. Since a REST API is just based on query parameters, you don’t have to be an expert to download data. It’ll be encoded in JSON so it’s easy to use in any language (especially Python, the lingua franca of Fedora Infrastructure.)

Of course, your thoughts about this process are definitely wanted. You can comment on this blog post to leave your suggestions.

Edit: I forgot to include a bit about privacy — information that shouldn’t be publicly available, such as IP addresses or email addresses, will be stored in the database as UUIDs. Another table in the database will relate UUIDs to their original values for the purposes of allowing statistics to determine pageviews from distinct IP addresses, for example. Privacy is of top priority in this project and if we feel like we’re infringing on the privacy of our users and contributors too much, we will not report that information through this system.

Tags: , , , , , | 3 Comments »

FAD NA 2010: “Can you see…”

May 26th, 2010

One of the great advantages of having membership within the Fedora Project (including all the little subgroups like ambassadors) centralized in FAS is that you can write a simple script to get some meaningful numbers.

We were discussing ambassador mentoring at FAD NA 2010 and one of the many proposals tossed back and forth was to require that ambassadors are within the project for a period of time before they apply to be an ambassador. David Nalley asked the group: “How long do people wait now before they join the ambassadors group in FAS?” Three seconds later he turned to me and asked me to write a script to do that. Here it is.

This script downloads a bunch of group data from FAS (which takes a little while because it needs to grab cla_done), finds users who have signed the CLA (approved in cla_done), and who have applied to be an ambassador but have not yet been approved. It then determines the amount of time the user spent between signing the CLA and applying to be an ambassador in FAS (what we’ll call the “delta”). It prints two lines: the first is a sorted Python list of the delta, converted to seconds; the second is the number of users the list describes (a count of the elements in the list).

(It should be noted that there is a cutoff for the usability of time-based data in FAS. For some reason or another—whether it was beacuse FAS1 didn’t track times, or because the upgrade to FAS2 overwrote the times—timestamps for group joins and approvals are all horribly wrong before March 12, 2008 at 02:06 UTC. See line 11 in the script.)

As of the FAD, here’s the data it produced (with line breaks added):

[65, 71, 90, 100, 117, 157, 177, 359, 367, 390, 432, 455, 518, 1032, 4174, 4327,
10162, 18168, 21257, 66571, 120267, 122254, 230746, 451587, 904754, 1293886,
1378508, 2001388, 2619665, 3862083, 6272559, 10794330, 15915004, 19977760,
36867582, 39432762]
36

Some conclusions we can make based on this data:

  • The average delta was 1544 seconds, which is about 26 minutes.
  • 20 of the 36 users (55.6%) had a delta of less than a day (86400 seconds). 7 of the 36 users (19.4%) had a delta of less than 5 minutes (300 seconds).
  • The maximum delta was 456.4 days (about 15 months).

If you look to the comment on line 19 of the script, it’s a simple one-line change to get data for those who already have become ambassadors as well. Here’s the data for that, as of now(ish), with line breaks added:

[-7046682, -2244969, -2169415, -2105694, -1210664, -946193, -171773, -132781,
-105235, -88070, -11491, -2193, -380, -70, -31, -30, -13, 18, 19, 22, 26, 26,
26, 33, 33, 33, 36, 39, 40, 41, 43, 46, 47, 47, 47, 57, 59, 60, 61, 62, 62, 66,
66, 66, 67, 68, 69, 71, 71, 75, 76, 76, 77, 80, 85, 90, 90, 90, 92, 93, 95, 96,
98, 104, 105, 106, 109, 109, 109, 110, 111, 118, 119, 119, 120, 120, 127, 128,
131, 134, 135, 137, 139, 143, 145, 145, 146, 150, 150, 152, 155, 156, 158, 159,
168, 169, 176, 183, 185, 189, 191, 194, 194, 196, 198, 199, 205, 210, 211, 214,
217, 222, 222, 225, 237, 240, 243, 245, 252, 256, 258, 262, 264, 270, 272, 272,
278, 283, 294, 294, 295, 296, 297, 304, 306, 319, 321, 321, 323, 328, 335, 343,
346, 353, 361, 374, 378, 400, 400, 402, 412, 421, 441, 450, 452, 452, 455, 478,
484, 491, 520, 531, 575, 589, 607, 607, 621, 648, 658, 663, 705, 720, 722, 724,
732, 733, 738, 749, 753, 814, 827, 832, 874, 880, 929, 950, 956, 1012, 1014,
1041, 1046, 1131, 1286, 1381, 1408, 1430, 1559, 1577, 1821, 1845, 1887, 1906,
1971, 2028, 2165, 2195, 2424, 2479, 2640, 2901, 2934, 3094, 3339, 3354, 3364,
3413, 3414, 3711, 4874, 5386, 5426, 5577, 6329, 7416, 8916, 11001, 18324, 18575,
19330, 19936, 21462, 24887, 27708, 28870, 31331, 37117, 37872, 43673, 45269,
45565, 48128, 49488, 63696, 66359, 68765, 69655, 69813, 70958, 73441, 75468,
76693, 78022, 80469, 81074, 83926, 84313, 85884, 94732, 97918, 109199, 132682,
153970, 159001, 159096, 166200, 167190, 172526, 203033, 209366, 232599, 254839,
298215, 335812, 338047, 346164, 347030, 350391, 373753, 390049, 402758, 419056,
419722, 426483, 473510, 516436, 573911, 602051, 677595, 692417, 760878, 763579,
765369, 856220, 857455, 988386, 988834, 1000077, 1100141, 1208640, 1209160,
1296560, 1298298, 1391236, 1399265, 1409442, 1462069, 1468372, 1475776, 1549503,
1551292, 1556641, 1570053, 1644704, 1724047, 1727078, 1736449, 1819393, 1852417,
1883617, 1908922, 1969031, 1989497, 2075824, 2122750, 2139385, 2145740, 2186876,
2267192, 2292659, 2410660, 2430179, 2503012, 2594221, 2644249, 2699353, 2711578,
2826634, 2905727, 2917899, 2926825, 2928264, 3087834, 3130616, 3133132, 3772561,
4058559, 4446452, 4477283, 4590461, 4666894, 4771861, 4809502, 4868847, 5005004,
5058314, 5092264, 5183777, 5196236, 5411273, 5593249, 5628497, 5873109, 5947922,
6105292, 6240295, 6368175, 6488855, 7137656, 7348233, 7412019, 7524910, 7695694,
7712467, 7743736, 7950337, 8184019, 8226472, 8898541, 9143874, 9157720, 9354098,
9481789, 9552013, 9850428, 10295579, 10468848, 11302343, 11365382, 11483738,
11680912, 12374970, 12556286, 12776962, 12916884, 14004298, 14098912, 14506093,
14567374, 14836520, 15074649, 15868294, 16877210, 16920294, 17261366, 17462813,
17654050, 18496770, 18578171, 19207671, 19240507, 20335751, 20650780, 21510299,
21576474, 22797578, 25967324, 26705809, 26819684, 27315401, 27475767, 27628951,
28697835, 29272369, 29484943, 30322585, 30675304, 31282206, 31359463, 35558509,
36867582, 37016239, 37389204, 40520264, 43289246, 45256091, 45268939, 49846083,
56418326]
438

The first thing I noticed was that there were negative numbers. (lolwut?) These were probably before FAS had the ability to require that you were in cla_done before you joined ambassadors.

The main reason I’m posting about this is because I want to show that it’s really easy to pull group information from FAS and start messing with numbers. Take a look at pydoc fedora.client.fas2 and some other modules inside python-fedora. Looking at numbers can help you figure out what you can do within Fedora to help the project move along. (As for the requiring a certain amount of time as a contributor before becoming an ambassador proposal, I’m not sure where that ended up. I think we determined it was unneeded, but I can’t quite remember.)

Tags: , , , , , | Comments Off

Repositioning myself within the Fedora Project

March 21st, 2010

After talking with a few people recently and doing some self-analysis, I feel like it’s time to make a major shift in what I do within the Fedora Project. My Fedora résumé so far has consisted mostly of wiki czaring,1 package maintenance and other odds-and-ends jobs others kindly ask me to do.

I’m presently concerned with the second in that list — a combination of increased stress and decreased time available due to school and the speed of discussion on package maintenance and release engineering is a losing game. In the next few weeks, I’ll be checking all of my packages and determining which ones have dead or slow upstreams or bugs that I can’t resolve on my own. Those packages will likely be orphaned, and if nobody wants to care for them, so be it.

The two others? Wiki czaring is fine, but I need to improve on it a bit (see the footnote), and I always enjoy the random problems that I can help quickly solve for people. This being said, development on mw, supybot-fedora and other convenient software is (hopefully) Not Going Away™ any time soon.

With the pushing away of my first Fedora love, package maintenance, I’ve found something new to focus on. Through my internship with Red Hat last year, I discovered that there is a large deficit of good statistics about our community. There’s a large deficit of good statistics about most free software communities, according to some random Google keywords I just tried, apart from “this is how many times our product has been downloaded.” I really loved the opportunity to combine my self-proclaimed mad Python skillz with answering other people’s questions, such as:

  • How many contributors does Fedora really have? And according to these standards/filters?
  • How often is the wiki edited and when?
  • How many “things” has this random dude over here done? Do we consider that “active”?
  • How many vague statistically-related questions can we come up with on devel@l.fp.o or during a marketing meeting?

Some of these, obviously, have no answer. Yet.

When I finally graduate from high school, I’ll be pushing full swing into answering these sorts of things. Until then, you can help me make Fedora a better place by simply telling us what you want to see tallied up. I asked this about 9 months ago and I got a lot of responses — thank you. But with recent discussions about the future of Fedora and a lot of claims about our user and contributor bases not being backed up (not pointing fingers), I think there are even more questions that can be answered. Please add your statistically-inclined questions to [[Statistics 2.0]] and I’ll do my best in the near future to get them answered with statistics on our community.

I also love help. (Shout out to joshkayse who is taking the lead on making it simple to find a single contributor’s actions within Fedora, taking inspiration from Mel’s FAS scraper.)

Quick summary: Maintaining packages is a drag (for me) right now. I like taking questions and answering with numbers. I graduate soon. Ask questions.

1 While writing this I decided to Google for “fedora wiki czar“. What I found was a mysterious character who was appointed as such in a community touting full transparency. Mel brought this to my attention the other day — I really suck at providing transparency into the process of administering the wiki. It’s pretty much on a whim. It shouldn’t be this way.

Tags: , , , | 1 Comment »

Community statistics in Fedora and beyond — and where it’s going from here

September 10th, 2009

During my summer internship with Red Hat’s Community Architecture team, my main assignment was to build an automated platform (which eventually was built into Fedora Community) for generating and displaying statistics within our community.

Needless to say, it didn’t get done. :) But it did get a healthy start, and even though the last couple of months I haven’t been extremely active in Fedora, it’s still alive and well.

This week, I started working on a research paper for my independent study at my high school. This independent study just happens to be continuing work on the project that I started a couple of months ago. The paper will include mostly primary sources of what people have said on Stats 2.0’s discussion page on the wiki, but I would love to talk with people on IRC about what they think is important to track so we can analyze not only the growth of the Fedora, but the growth of the community.

It doesn’t end with the one-semester independent study. I am presenting on this subject at UTOSC 2009. In this presentation I will discuss many of the variables of a free software community that can be tracked, and even provide example code and where to get started on automatically tracking them.

So, there’s the state of the Stats 2.0. Would you like to speak with me on IRC sometime about what you think is important to be tracked?

Tags: , , , , , , , | 2 Comments »

Statistics 2.0: The beginnings

July 6th, 2009

I’ve been doing some work on getting a Statistics application in Fedora Community. It’s very weak as it stands — only shows you two wiki-related things right now — but now that I’ve kind of meandered around the code a bunch, I think I have a better idea of what I’m doing, and it shouldn’t be difficult to churn out code for other parts of Fedora’s stuff now.

Currently, we have a Grid widget and a Flot widget. Grids are used for displaying data in, well, a grid, and Flot widgets are used for nice, pretty charts. (The awesome thing about Flot is that it uses pure HTML to create charts. How about that?!)

I need to thank Luke Macken and J5 for all the help I’ve gotten from them so far. :)

So let’s go through how you can test this and see the magic unfold. (And potentially figure out how to write code for you own use cases!)

  1. Install Luke’s repo file for TurboGears 2. It’s not all in Fedora yet so this is necessary. You can find the repo files at http://lmacken.fedorapeople.org/rpms/tg2/.
  2. Install moksha.
    # yum install moksha
  3. Pull fedoracommunity.git.
    $ git clone git://git.fedorahosted.org/fedoracommunity.git

    (If you’ve got a FAS account and you’ve ever used Hosted before, it’s usually a good idea to use ssh://, IMHO. Makes it easier to push later if you get push access. The URL for that is ssh://git.fedorahosted.org/git/fedoracommunity.git)

  4. Create the stats branch locally, and pull from the remote stats branch.
    $ cd fedoracommunity/; git checkout -b stats; git pull origin stats
  5. Now, I don’t know if this is the proper way to do it, but it’s portable and it works. Before you can start up paster to serve the content, an egg needs to be created.
    $ python setup.py egg_info
  6. Then you can run paster:
    $ paster serve development.ini

    And yay!

The important files to note are fedoracommunity/connectors/wikiconnector.py, fedoracommunity/mokshaapps/statistics/widgets/wiki.py, and fedoracommunity/mokshaapps/statistics/templates/wiki_active_pages.mak. My next priority is to get stats for how FAS groups grow over time.

Happy hacking! :)

Tags: , , , | Comments Off

Gnumeric and Excel

August 10th, 2008

I just ran across a neat little PDF showing problems in Gnumeric being fixed by their developers while the same problems in Excel not being fixed by Microsoft. Most of these are statistically related, and I do have a bit of background on them.

This is quite interesting, and in my opinion, proves the spirit of open source. ;)

Tags: , , , , , , , | Comments Off