Wednesday, December 21st, 2005

Wget doesn’t eat XML

Filed under: — Daniel Lemire @ 13:55

I wanted to retrieve a local copy of my online XML course. I instructed the technical staff to serve the XHTML files as application/xml. I believe this was to work around the limitations of Internet Explorer. In any case, I stumbled upon a wget bug! Wget won’t process XHTML with the mime-type application/xml as an XHTML file, and hence, it won’t follow the links inside it.

A deeper limitation is that wget doesn’t know XML. This means that it will not follow stylesheets. Wget also doesn’t know about javascript.

This meant I had to write my own scripts to recover the course. First, a bash script:

wget -m -r -l inf -v -p http://www.teluq.uquebec.ca/inf6450/index-fr.htm
find -path "*.htm" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "*.html" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "*.xhtml" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "*.xml" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "*.xml" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p

You see that the last line is repeated twice. Don’t do this type of scripting at home. Bad design!

Next I need a python script to extract the URLs I need (Perl or Ruby would also do):

#!/bin/env python
import re,sys
for filename in sys.argv[1:]:
file=open(filename)
#print “from “, file
for line in file:
# better hope that we don’t have repeated spaces!
for m in re.findall( “(?< = re.findall( "(?<= re.findall("(?<=).*(?=)”,line)+\
re.findall(”(?< =openwindow\(').*?(?=')",line)+\
re.findall("(?< =stylesheet href=["']).*?(?=["'])",line):
print "http://"+re.search("www.*/",filename).group()+m

This is a pretty awful hack, but it works!

Here is a project for the tech savvy among you: extend wget so that it can parse XML!

1 Comment »

  1. Hello,

    I think wgets refusal to check for links inside the xml is more likely a result of only parsing text/html documents than a problem of parsing.
    You can ask wget to regard all possible mime types as html to check with wget -F

    with kind regards
    Michael

    Comment by Michael Barth — 6/5/2006 @ 5:03

RSS feed for comments on this post.

Leave a comment

Warning: When entering a long comment, please ensure that you make copy of your text prior to submitting it. If the server should fail or if you hit a bug, you might lose your work. I am not responsible for your lost effort.

To spammers: I carefully review every single post and make sure that spam gets deleted. You are wasting your time if you are manually entering spam using this form. Read my terms of use to see what I consider to be abusive.

Example: I + II + IX= XII. Yes, you have to enter a roman numeral. (Answer must be in upper case.)

« Blog's main page

30 queries. 1.626 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.