XMLの解析2 - プログラミングノート

XPathが気になったので、はてブ人気エントリーのRSS解析にチャレンジ。色々試しつつ、何とか最新5件のタイトルとタグを取得してHTMLファイルを出力できた。が、何か処理が重い（これだけで6〜7秒）。これでは使い物にならない。。

require 'open-uri'
require 'rexml/document'
require 'kconv'

doc = nil
html = ""

open("http://b.hatena.ne.jp/hotentry?mode=rss") do |xml|
  doc = REXML::Document.new xml
end

# 最新5件取得
5.times do |i|
  item = "//item[position()="+(i+1).to_s+"]"
  title= doc.elements[item+"/title"]
  link = doc.elements[item+"/link"]
  tags = REXML::XPath.match(doc, item+"/dc:subject")
  taxo = REXML::XPath.match(doc, item+"/taxo:topics/rdf:Bag/rdf:li")	

  # 出力用
  html+= "<div style='margin:15px'>"
  html+= "<a href='"+link.text.tosjis+"'>"+title.text.tosjis+"</a><br />"
  tags.each_with_index do |e,j|
    html+= "<a href='"+taxo[j].attributes['resource']+"'>"+e.text.tosjis+"</a> "
  end
  html+="</div>"
end

open("hateb.html", "w") do |fp|
  fp.puts html
end

メモ

特定の要素を全て取得 : "//タグ"
N番目の要素を取得 : "//タグ[position()=N]"
最初の要素を取得 : REXML::XPath.first(doc, "path")
該当要素を全て取得 : REXML::XPath.match(doc, "path")

追記

# item = "//item[position()="+(i+1).to_s+"]"
  item = "/rdf:RDF/item[position()="+(i+1).to_s+"]"

XPathの記述を変更したら劇的に早くなりました！（//はだめなのか。。）
p Process.times() を仕込んで計測したところ、下記のような結果に。

改善前
(開始) utime=0.312, stime=0.218, cutime=0.0, cstime=0.0
(終了) utime=7.359, stime=0.421, cutime=0.0, cstime=0.0
改善後
(開始) utime=0.281, stime=0.281, cutime=0.0, cstime=0.0
(終了) utime=0.937, stime=0.531, cutime=0.0, cstime=0.0

うーん、こんなに違うのか。

参考

REXML 2.4.2のサンプル付きXPathの関数リファレンス
 REXML:RubyによるXML処理