When to use the geometric mean?

This is better documented elsewhere, but I could not find a quick reference on the web as to when you’d want to use the geometric mean instead of the arithmetic (usual) mean.

  • Suppose that I’m 30% richer than last year, but last year I was 20% richer than the year before… what is the average growth? Well, my current wealth is 1.3 * 1.2 * w if w is my wealth two years ago. I can expect that if t is the average growth factor over the last two years, then my current wealth is t * t * w. Setting t = 1.25 is the wrong answer. In such a case, choosing t = sqrt(1.3 * 1.2) solves the problem.
  • Another case where the geometric mean makes sense is when you are stuck averaging numbers that are not comparable like the time necessary to build a data cube, versus the average query time. Indeed, if a and b are two numbers and a is much smaller than b, then (2a +b)/2 is about the same as (a+b)/2. One component of your system is significantly worse and yet, you get the same average performance? That’s wrong. Computing sqrt (2ab) seems to make much more sense.

Why blog? What about Reed’s law?

Here’s one reason to blog: so that you can belong to a very large network.

The utility of large networks, particularly social networks, can scale exponentially with the size of the network. (Wikipedia entry.)

The 20th century blueprint for research is now mythical

Will sent me a link to this article in InformationWeek called Research Revolution (April 10th, 2006). It explains how the “web labs”, while still small, are changing the way we do research. Industry research would now be extremely fast paced and be based the vast amounts of data we have.

To be honest, after reading this article, I’m still searching for what the revolution might be. They seem to be hinting that previously, the research component was hidden away in the company and that it has now become more integrated. Well, duh!

Still, if anyone wishes to start such a web lab in Montreal, give me a call!

Now, in trying to gain an edge in the fast-paced Internet software market, Google, Microsoft, and Yahoo are taking a wholly new approach to research. They’re building labs focused on the problems and opportunities that have emerged with sleeker Web sites, the explosion of online video and photos, widespread broadband connections, and the soaring numbers of hours people spend online. Inventions can be tested on thousands of users at little cost, and adjusting an algorithm today can mean big gains in the effectiveness of a Web service tomorrow. The 20th century blueprint for research is “essentially mythical now,” says Alan Eustace, Google’s senior VP of engineering and research. “The model of research has changed.”

The end of SOA web services

The Web is the biggest success story in IT ever. Well, maybe Walmart would come close behind. What bugs me is that people still don’t understand why the web is a success. It is not because Tim Berners-Lee is a great scientist. He had the wrong big idea about the web, but as a great hacker, he implemented the basics very well (including HTTP) and that’s what took off.

Information technology and engineering is not unlike science. Most often, the simplest and most elegant theory that can explain the data wins. For some strange reasons, researchers often tend to prefer complicated ideas maybe because they are train to formalize everything. In IT, the simplest and most elegant model that can satisfy our needs ought to win. Good engineers and IT experts know this, intuitively. That’s what Tim gave us, nearly the simplest solution that will work, and that’s why his name will go in the history books.

Regarding web services, the answer is not SOA, but rather REST. The writing is on the wall and has been there for years. Why? It is simply not the simplest solution that will work.

SOA may have meant something once but it’s just vendor bullshit now. (…) The crucial point is that Web-like things should be simple and lightweight and easy to set up; so I think the Web part of Web Services is more important than the Services part. SOA isn’t the future, Web style is. (Source: Tim Bray)

ACM International Collegiate Programming Contest Final Results

The ACM International Collegiate Programming Contest results are out. The top ten schools are given below.

  1. Saratov State University (Russia)
  2. Jagiellonian University – Krakow (Poland)
  3. Altai State Technical University (Russia)
  4. University of Twente (The Netherlands)
  5. Shanghai Jiao Tong University (China)
  6. St. Petersburg State University (Russia)
  7. Warsaw University (Poland)
  8. Massachusetts Institute of Technology (USA)
  9. Moscow State University (Russia)
  10. Ufa State Technical University of Aviation (Russia)

Asian Universities are beginning to dominate

Asian Universities are beginning to dominate the Top Institutions portion of the Top Scholars and Institutions survey conducted and published by the Journal of Systems and Software each year in October. In the latest survey findings, published in October 2005, among the leading institutions of the world based on counting the number of software engineering research publications emerging from them, three of the top five institutions are Asian. Korea Advanced Institute of Science and Technology is number one, National Chiao Tung University of China is number two, and Seoul National University of Korea is number 4. The non-Asian institutions in the top five are Carnegie Mellon University (including its Software Engineering Institute), at number three; and Fraunhofer Institute for Experimental Software Engineering, at number five. What is striking about this particular fact is that as recently as three years ago, in earlier such survey findings, Asian schools were only marginally represented in the top 10, and the top institutions were clearly North American.

Source: Robert L. Glass, Practical programmer: Is the crouching tiger a threat?, Communications of the ACM, Volume 49, Number 3 (2006), Pages 19-20.

What are the computer langage people waiting for?

The glorious time when people could design a new insightful computer language is gone.

Or is it? In our Data Warehousing and OLAP classes, we cover MDX and various APIs for OLAP. Arguably, MDX is de facto the standard OLAP language. But as far as languages go, it is just ugly. Microsoft chose to mimick closely SQL and yet, extend it dramatically into a multidimensional setting with a large dose of abstraction. I’ve never designed computer languages, but I’ve used them and just like a painter can recognize a bad brush even if he can’t design a brush, I just don’t like MDX.

But even if I’m wrong, you can’t hope to teach MDX to a busy decision maker even if he has sufficient programming experience:

I believe that OLAP using MDX with Mondrian requires expert language knowledge and it would be very difficult for a user, with only domain knowledge, to be able to issue correct queries. (Hazel Webb)

What is needed is a simpler, easier langage. Something someone who knows about control structures (loops and if clauses) and has a basic understanding of what a data cube is (drilling-down, rolling-up, slicing and so on), can quickly pick up and use, say within a day.

Would make a great Ph.D. thesis.

Slashdot: Why Is Data Mining Still A Frontier?

Slashdot asks “Why Is Data Mining Still A Frontier?” The article itself is not very exciting, but the comments are great. Here are some I like:

I would suggest that, in practice, the real difficulty is that the problems that need to really be solved for data mining to be as effective as some people seem to wish it was are, when you actually get down to it, issues of pure mathematics. Research in pure mathematics (and pure CS which is awfully similar really) is just hard. Pretending that this is a new and growing field is actually somewhat of a lie.

Available datasets are not themselves in anything like normal relational form, and so have potential internal inconsistencies. And that gets in the way before you even have the chance to try to form intelligent inferences based on relations between data sets, which of course are terribly inconsistent.

The ultimate problem, is that for most datasets, there are an infinite (at least), set of relations that can be induced from the data. This doesn’t even address the issue, that the choice of available data is a human task. However, going back to assuming we have all the data possible, you still need to have a specific performance task in mind.

To sum it up:

  • Data Mining requires hard and fancy Mathematics.
  • Data cleaning and integration is hard.
  • There are infinitely many ways to mine data and it is not obvious a priori what is useful.

I think Data Mining is a beautiful research topic. However, as the comments indicate, it is very hard and it requires a wide ranging expertise.

Kunal Anand: Some XML exam questions

Kunal has almost picked up my challenge on his blog: come up with deep homework questions having to do with XML.

  • Given at least 10 blog/link feeds, determine the top ten outbound URLs?
  • Parse an iTunes library file and capture all the unique artist/albums.
  • Given a user’s XML file from del.icio.us, determine the top 10 intersecting tags.
  • Scrape a dynamic list from a web site (i.e. the Google Zeitgeist) and serialize a well-formed Atom feed.

The last one seems like mostly hard labour probably requiring quite a bit of fiddling.

The other ones are all interesting because they are examples of aggregation and that’s not trivial to do in XSLT/XPath. Naturally, Kunal suggests to solve these problems using a nice script language like Python, but solving them in XSLT is much more fun because it is harder.

CSWWS 2006: Call for Participation

The 2006 Canadian Semantic Web Working Symposium will be held June 6th at Laval University, Quebec, Canada. We invite you to attend. This event which will be held in conjunction with Canadian AI 2006.

  • On-line registration is available. THERE WILL BE NO ON-SITE REGISTRATION.
  • Industry partner and sponsor: OntoText (ontotext.com)
  • One plenary talk by Professor Michael N. Huhns, Director of the Center of Information Technology at the University of South Carolina,
    USA.
  • Two Tutorials:
    • State of Affairs in Semantic Web Services
      Michael Stollberg
      Leopold-Franzens Universität Innsbruck, Austria
    • MDA Standards for Ontology Development
      Dragan Gasevic, Dragan Djuric, Vladan Devedzic
      Simon Fraser University, Canada
  • Proceedings in the Semantic Web and Beyond series of Springer-Verlag:
    • A Trust Model for Sharing Ratings of Information Providers on the Semantic Web
      Jie Zhang, Robin Cohen
    • Ontoligent Interactive Query Tool
      Christopher Baker, Xioa Su, Greg Butler, Volker Haarslev
    • A Rule-based Approach for Semantic Annotation Evolution in the CoSWEM System
      Phuc-Hiep Luong, Rose Dieng-Kunt
    • Incorporating multiple ontologies into IEEE learning object metadata standard
      Phaedra Mohammed, Permanand Mohan
    • Applying and Inferring Fuzzy Trust in Semantic Web Social Networks
      Mohsen Lesani, Saeed Bagheri
    • Integrating Ontologies by Means of Semantic Partitioning
      Gizem Olgu, Atilla Elçi
    • DatalogDL: Datalog Rules Parameterized by Description Logic
      Jing Mei, Harold Boley, Jie Li, Virendrakumar C. Bhavsar, Zuoquan Lin
    • Completion Rules for Uncertainty Reasoning with the Description Logic ALC
      Volker Haarslev, Hsueh-Ieng Pai, Nematollaah Shiri
    • Fulfilling the Needs of a Metadata Creator and Analyst – An Investigation of RDF Browsing and Visualization Tools
      Shah Khusro and A. Min Tjoa
    • A Semantic Web Mediation Architecture
      Michael Stollberg, Emilia Cimpian, Adrian Mocan, Dieter Fensel
    • Resolution-based Explanations for Reasoning in Description Logic ALC
      Xi Deng, Volker Haarslev, Shiri Nematollaah
    • A Distributed Agent System Upon Semantic Web Technologies to Provide Biological Data
      Farzad Kohantorabi, Gregory Butler, Christopher J.O. Baker
    • Toward the Identification and Elimination of Semantic Conflicts for Integration of Ontologies
      Yevgen Biletskiy, David Hirtle, Olga Vorochek

    A Birds of Feather and Poster sessions are also scheduled with the
    parallel session.

You can purchase the proceedings from Amazon.

« Previous PageNext Page »

18 queries. 0.444 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.