Saturday, April 9th, 2005

The Geomblog: On knowledge for knowledge’s sake

Filed under: Academia/Research — Daniel Lemire @ 19:34

Suresh is complaining against the current for-profit trend in research:

The desire for knowledge for its own sake seems almost quaint in these days of interdisciplinary research, justifying one’s bottom line, monetizing one’s research, and so on and so forth.

I read him in the following way: he is afraid that governments will stop giving money for pure research and keep their most generous funding for industrial-oriented research.

I actually disagree in the following way: if you need a lot of cash to do your research, you’ve got to justify the use of the money. Justify it to whom? To the people who give you the money. This seems only fair. If you want to do research for its own sake, and you also want a lot of money, well, tough.

Many people do research using their own time and very little money. They don’t have 10 assistants, one super-computer and a large operating budget. Einstein had none of these things. If you need these things, if you need a large budget, then justify it… explain how it will make the country you live in richer or safer… if you can’t, and still need the money, then beg. Or use a friend’s gear. I don’t know. Don’t claim that the government must fund pure research. I don’t buy it.

No researcher has a God-given right to large funding. In fact, this very assumption is what is killing true research: we’ve come to measure research by how much funding is given. Hence, if no funding is given, no research is done.

I even go the other way: many funding agencies throw their money away without enough care. For example, the current trend to keep funding more and more graduate students on the basis that we are training “highly qualified personnel” is shameful. The government says to the public: we are investing money in training more graduate students because we need them in this new century… but the truth is that we really don’t need that many graduate students and that many of these students would have been better off doing something else with their lifes. My previous post was about Tim Bray… Tim started out in a university project, but as far as I can tell he never was a graduate student. [ Update: indeedTim doesn’t have a graduate degree.] This was no handicap to him, obviously.

Do I believe in “knowledge for knowledge’s sake”? Absolutely. Do I think that governments must fund “knowledge for knowledge’s sake” above and beyond the funding universities already receive as institutions of knowledge? No.

Where I live, most university professors get to spend half their time doing research. They usually have access to decent computers and related technological support. They even get funding to go to conferences. All of this is pretty standard. I claim that on this basis, they can easily do research for its own sake…

(Sorry Suresh if I misunderstood you.)

ACM Queue - A Conversation with Tim Bray

Filed under: Data Warehousing and OLAP — Daniel Lemire @ 11:30

This is brilliant! ACM Queue is publishing an interview with Tim Bray (of XML fame) done by Jim Gray (of data cube and database transactions fame). Tim now runs Web technologies for Sun Microsystems. Tim Bray basically says that RDF and Semantic Web are a no go but we knew that’s what he thought.

However, there are many cool quotes. Try to find the pattern in these:

My CEO, Tom Jenkins, agreed to turn me loose to work on it myself, and I spent six months basically doing nothing else and built the crawler and the interfaces. (…) I lost weeks and weeks and weeks of sleep, hacking and patching and kludging to keep this thing on the air under the pressure of the load.

Lark was the first XML processor, implemented in Java. I wrote it myself. I used it also as a vehicle to learn Java. It shipped in January 1997 and actually got used by a bunch of people. (…) So, I let Lark go. It was fun to write and I think it was helpful, but it hasn’t been maintained since 1998.

Some of the people working in syndication were extremely upset about XML’s strictness, saying, “Well, you know, people just can’t be expected to generate well-formed data.” And I said, “Yes they can.” I went looking around and found that there are some quite decent libraries capable of doing that for Java and Perl and Python, but there didn’t seem to be one for C.

So sitting on the beach in Australia I wrote this little library in C called Genx that generates XML efficiently and guarantees that it is well-formed and canonical.

See the pattern? Tim Bray is a hacker with a degree in mathematics and computer science. [Tim doesn’t have a graduate degree.] And he changed the world.

But his life was not always easy:

Microsoft really went insane. There was a major meltdown and a war, and I was temporarily fired as XML coeditor. There was an aggressive attempt to destroy my career over that.

(Note that the interviewer, Jim Gray, works for Microsoft!)

Friday, April 8th, 2005

The rise of social conciousness in cyberspace

Filed under: Science and Technology — Daniel Lemire @ 16:51

In an earlier post, I tried to predict what the next Gutenberg printing press or the next Web would be like. I predicted that ubiquitous massive storage would be the next big technological advance and that it would bring three new challenges: the need to bring data warehousing to the masses, the need to bring security to the masses, and the need to move all social software to the Wikipedia level and beyond.

Scott had this comment which is worth repeating here:

I think the third — the rise of social conciousness in cyberspace — is right on. Of course, it’s hard to be more specific. The one thing I’m fairly sure of is that the Next Big Thing will be familiar — it won’t be part of some alien new world. It will be a reflection of what people have been for a long time. What is important to people? Mainly, we communicate with each other. We communicate useful get-through-the-day facts and longer range planning. We gossip and small-talk to maintain or strengthen social relationships. And we produce and consume art to fulfill some deeply ingrained need to find resonance with other people. (Oh yes, and pornography, which is kind of in a class by itself.) So far, the big things in IT have all been direct reflections of those social needs: the Web, e-mail, instant messaging, cell phones, Napster/KaZaA, Skype, “social networking”, iTunes, video-on-demand, etc. I expect this web of communication to mature into something in which reputation and recommenation are pervasive — in a way that mirrors practices that we are already comfortable with, but with dramatically increased efficiency and/or accessibility. The open question for me is whether the increaase in efficiency or accessibility will be sufficient to have an impact approaching that of Gutenberg’s press.

Thursday, April 7th, 2005

Yuhong’s IJCAI’05 Working Notes

Filed under: Science and Technology — Daniel Lemire @ 18:44

Yuhong published her Working Notes in preparation for IJCAI were our paper was accepted.

She had the great idea of inviting people to comment on her review of related papers:

I read more papers according to IJCAI reviews. I noted down my summary on these papers. I hope some reviewers or the authors of the reference papers can read my blog, so that we can have more discussion.

I never saw anyone use a blog for this exact function before. Let’s see if it works!

Wednesday, April 6th, 2005

What constitutes research blogging?

Filed under: Academia/Research — Daniel Lemire @ 16:30

Mathemagenic discusses research blogging and she found, based on her experience, that research blogging covers the following tasks:
  • publishing / dissemination / announcements (of papers, presentations, events by me and others)
  • research process
    • reflections
    • emotions
  • event blogging
    • notes
    • reflections
    • event planning (including travel planning)
  • paper blogging (notes on papers I read)
  • asking for help (explicit)
  • “enculturation” into research (reflection/learning on research culture, practices, tricks of the trade, etc.)
  • articulation
    • articulation of personal experiences (relevant for PhD)
    • articulation of problems/questions (may be implicit call for help, but often just thinking aloud)
  • writing-related (this is the difficult one)
    • drafting/testing pieces that supposed to go into a paper
    • giving space to pieces that do not fit into a paper
  • reflections on methodology

Using CVS branches

Filed under: — Daniel Lemire @ 8:39

I do a lot of crazy programming for a living and I use CVS extensively. It allows me to keep track of various versions, see what others have done and so on. It is a great tool.

Recently, I got into my head to use “branches”. The idea is this: suppose I want to work on crazy code without breaking everyone’s code, then I create a private (and maybe temporary) version of the code where I can break everything I want to break. Later, presumably, I can merge my changes back into “HEAD” (HEAD being the main “branch” everyone uses by default).

Branches are scarcely documented: you have the semi-official documentation and other pages written by average users.

Here’s what I understand. Firstly, make sure you commit all your current changes (”cvs ci” will do it). Go into the subdirectory that you want to branch off (I assume here you don’t want to branch off the entire source base). Now, create a new branch like this:
cvs tag -b Branchname
You’ve now created a branch, but your current directory “remains” in the main branch, to switch, do:
cvs update -r Branchname
Hack as much as you want and commit your changes when you are done (”cvs ci”).

Now, suppose your branch is completed and you want to go back to the trunk, you must first exit your current directory from the branch, you do it this way:
cvs update -A
(Question: would “cvs update -r HEAD” do it?)

Finally, bring the changes you did in the branch back to the trunk (j stands for “join”):
cvs update -j Branchname

That’s it. Now, I’ve got no idea if this is safe or if you can go back to the branch later easily.

(Disclaimer: I’ve got no idea if this is accurate or not, but it worked for me.)

Tuesday, April 5th, 2005

Attention.XML

Filed under: — Daniel Lemire @ 12:41

I just became aware of the Attention.XML specification. The goal of Attention.XML is:

  • How many sources of information must you keep up with?
  • Tired of clicking the same link from a dozen different blogs?
  • RSS readers collect updates, but with so many unread items, how do you know which to read first?
  • Attention.XML is designed to to solve these problems and enable a whole new class of blog and feed related applications.

Technically, Attention.XML is about making available to others the posts and feeds you like and the ones you dislike.

Attention.XML is an XML file (specifically an XOXO file) that contains an outline of feeds/blogs, where each feed itself is an outline, and each post is also an outline under the feed. This hierarchical outline structure is then annotated with per-feed and per-post information which captures such information as, the last time the feed/post was accessed, the duration of time spent on the feed/post, recent times of feed/post access, user set (dis)approval of posts, etc.
Attention.XML is an XML file (specifically an XOXO file) that contains an outline of feeds/blogs, where each feed itself is an outline, and each post is also an outline under the feed. This hierarchical outline structure is then annotated with per-feed and per-post information which captures such information as, the last time the feed/post was accessed, the duration of time spent on the feed/post, recent times of feed/post access, user set (dis)approval of posts, etc.

The idea is then to use collaborative filtering to find out what you may like.

This sounds like a great idea, except for what Dare Obasanjo points out:

The only cloud I see on the horizon is that if anyone figures out how to do this right, it is unlikely that it will be made available as an open pool of data. The ‘attention.xml’ for each user would be demographic data that would be worth its weight in gold to advertisers. If Bloglines could figure out my likes and dislikes right down to what blog posts I’d want to read, I find it hard to imagine why the Bloglines team would make that information available to anyone including the user. For comparison, it’s not like Amazon makes my ‘attention.xml’ for books and CDs available to myself or their competitors.

It seems to me that what we need is a legal solution. We need to make it so that companies using publicly available Attention.XML files must give back (à la GPL). For example, if you use my Attention.XML, then you need to make yours available. This way, companies like blogline would be forced to either use only internal data, or else make available their data sets when requested to do so.

Indeed, Attention.XML is very different from RSS. With RSS, you provide content that you want to be used… everyone wants more readers, so RSS is a winner. But Attention.XML provide my preferences, and why would I share my preferences? What do I win? Why would a company share my preferences if not for financial gain?

(For further reading on collaborative filtering, see Slope One Predictors for Online Rating-Based Collaborative Filtering [SDM’05] and Scale And Translation Invariant Collaborative Filtering Systems [Journal of Information Retrieval, 2005].)
« Previous PageNext Page »

30 queries. 0.366 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.