Thursday, August 18th, 2005

What’s the PageRank of your university and what’s yours and are they related?

Filed under: Academia/Research — Daniel Lemire @ 13:25

PageRank is this well known Google measure of how influential a web page is.

Name of the school PageRank
Stanford University 9
University of Toronto
(I got my B.Sc. and M.Sc. there!)
9
University of Waterloo 8
École Polytechnique de Montréal (got my Ph.D. there) 8
Simon Fraser 8
UBC 8
University of Calgary 8
University of Alberta 8
Université de Montréal (I did a post-doc there) 8
Université de Sherbrooke (I taught a couple of courses there) 8
Dalhousie University 8
McGill University 8
Concordia University 8
Université Laval 8
University of New Brunswick (I am an adjunct prof. there) 7
Laurentian University 7
Université du Québec à Montréal (my employer!) 7
Acadia University (I was an assistant prof. there once) 7
Université de Moncton 7
Université du Québec (main site) 7
Université du Québec à Chicoutimi 7
Université du Québec en Outaouais 7
Université du Québec à Trois-Rivières 7

I can also compare the PageRank of some researcher’s home page…

Name of the school PageRank
Jim Gray 7
Daniel Lemire (old URL) 6
Peter Turney 6
Harold Boley 6
Owen Kaser 5
Martin Brooks 5
Dan Kucerovsky 5
Will Fitzgerald 5
Serge Dubuc 5
Eamonn Keogh 5
Alberto Mendelzon 5
Guy Mineau 4
Yuhong Yan 4
Robert Godin (old page) 4
Philippe Gabrini 4
Mamadou Tadiou 4
Guy Tremblay 2

Conclusions so far:

  1. Larger English-speaking schools get a higher PageRank. In Canada, the largest school, the University of Toronto had the top school (equal to Stanford). Somewhat smaller but very reputable schools like McGill don’t stand out.
  2. There seems to be no obvious correlation between the PageRank of a researcher and its university’s. This is somewhat surprising. Researchers at Laval University scored really badly for example, despite a solid PageRank for their university.

Working too much is bad for your health

Filed under: — Daniel Lemire @ 12:30

These medical studies are often funny because they state the obvious and seem to have costed a bundle. This being said, proving the obvious can be hard sometimes. Ever tried to prove that a closed loop has an inside and an outside? Camille Jordan thought it was easy and he made the textbooks by proving it!

With this renewed interest in the obvious, consider what I found in a recent research report:

Those who routinely put in overtime or work a long day are thought to be at heightened risk for a variety of ailments, including high blood pressure, heart disease, depression, diabetes, chronic infections, general health complaints, and even death, write the researchers.

Don’t touch XML Schema

Filed under: — Daniel Lemire @ 8:01

A year or two ago, talking against XML Schema meant talking about what W3C touted as the one thing that would replace DTDs.

Now the cat is out of the bag, XML Schema is a complete failure. Among other testimonies, I found the following quote on Cafe con Leche XML:

I wrote an XML Schema for SVG Full 1.1, and another for SVG Tiny 1.1. Doing so taught me a number of things:

  • 85% of XML Schema is thoroughly useless and without value;
  • the few useful features are weak and without honour;
  • creating a modularized XML Schema is easier than with DTDs, but nowhere near as simple as with RNG;
  • while a zillion useless features have been included in the spec, anything useful such as making attributes part of the content model has obviously been weeded out with great care, basically leaving one with DTDs supporting namespaces, a few cardinality bits, no entities, and loads of cruft;
  • tools like XML Spy that are supposed to help one write schemata will produce very obviously wrong instances, meanwhile the syntax of XML Schema was obviously produced by someone who grew up at the bottom of a deep well in the middle of a dark, wasteful moor where he was tortured daily by abusive giant squirrels and wishes to share his pain with the world;
  • the resulting schema is mostly useless anyway as there is no tool available that will process it correctly.

–Robin Berjon on the xml-dev mailing list, Sunday, 09 Jun 2005 11:59:45

This is very interesting for deeper reasons. This tells us that W3C, which was doing a reasonable job up to now, has fallen in a big way and, among the good things they still do, produces crap. They are no longer the reference as far as the web is concerned.

Wednesday, August 17th, 2005

How many emails are we sending?

Filed under: — Daniel Lemire @ 9:23

Earlier, I was interested in how many pages google indexes over time, today, I’m interested in how many emails are sent. After two hours of research, I still haven’t found a reliable source. However, a paper by Jaeyeon Jung and Emil Sit gives me some indication at least. It seems that between 2000 and 2004, for one particular mail server, the number of accepted SMTP connection daily went from around 25k to 320k which suggests email traffic doubles every 13 months.


Anyone has more accurate figures?

ACM SAC 2006 Data Mining Track (September 3, 2005 / Apil 23-27 2006)

Filed under: Passed CFP — Daniel Lemire @ 8:06

Just received the ACM SAC 2006 Data Mining Track CFP. ACM SAC 2006 will be held in Dijon, France. A popular city these days.

Recent interests in combining data mining technology with database management systems, and extending XML to support data mining offer significant promises for modern applications that require the support of declarative and ad hoc querying in an increasingly distributed and heterogeneous environment . These emerging applications pose new challenges and demand novel solutions. Researchers have realized that fusing techniques from machine learning, statistics, information theory into data mining will result in clear benefits and enhance the state of the art in data mining. The goal of this track, thus, is to encourage researchers and practitioners to address some of these challenges, help cross fertilization of ideas among different focus groups and provide a common forum for the exchange of ideas in an informal environment.

Voluntary academic simplicity

Filed under: Academia/Research — Daniel Lemire @ 7:45

Here’s a nice article stating that “Money Can Buy Happiness, Up to a Point.” The gist of it is that you are happy if you are richer than your peers, here’s the consequence of this theory:

As incomes in the U.S. tend to rise over the course of our lifetimes, individuals may find themselves on a sort of treadmill, consuming more and more just to maintain a constant level of happiness, they write.

I certainly buy this theory. Looking at the gigantic houses being built around where I live, and considering that people have no more than 2 kids these days, there is clearly a “must have a larger house than my friends” syndrome.

This “rising standards for judging human beings” phenomenon can be observed everywhere. We see it in academia a lot where we expect a lot more from younger professors than we did 10 or 20 years ago, but it also applies to many other fields. However, people can’t continuously earn more, publish more, work more and know more. Human beings are scalable, but only up to a point. Comes a time where the threadmil will break your legs.

As far as “how much money you must earn”, I think the solution is quite clearly to leave the threadmil now and choose voluntary simplicity. It is easy and all you risk is being happier.

Is there an academic voluntary simplicity? A way of life where you choose not to continuously publish more, go for larger grants, and have more students? There are people who choose this path, I know some, but nobody ever attempted to organize and promote this choice. I think we ought to do it.

Tuesday, August 16th, 2005

Number of pages indexed by Google over time

Filed under: — Daniel Lemire @ 11:48

I could not find out how many pages Google indexed over time but I found the self-reported numbers of Google’s timeline. Here they are in a convenient table:

When Number of web pages indexed
May-June 2000 1 billion
November-December 2000 1.3 billion
July - August 2002 2.5 billion
November - December 2002 4 billion
January - February 2004 4.28 billion
November - December 2004 8 billion
August 2005 8.2 billion

It seems, roughly speaking, that the number of pages indexed doubles every 16 months or so though the growth seems to be sometimes sudden. It is likely that the number of pages indexed depends on hardware and Google probably purchase its hardware in large bundles. Also, competition from other search engines is likely to be a factor.

Update: I’m sorry. I was motivated partly in my quest by a post by Will which I read earlier today. I should have given him credit. Will also points out that the index size as reported by Google needs to be verified. I really like this link Will sent me about why the search problem is hard.

« Previous PageNext Page »

32 queries. 0.404 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.