fetus Diary
2007/12/24(月) - まじめに抽出してみる
<?php
function GetMetaData($html) {
$regex = '<(?:[mM][eE][tT][aA])((?:(?:[\x20\x09\x0d\x0a]+)(?:(?:[[:alpha:]][[:alnum:]\-\.:_]*)'.
'(?:[\x20\x09\x0d\x0a]+)?=(?:[\x20\x09\x0d\x0a]+)?(?:(?:"[^"]*")|(?:\'[^\']*\')|(?:[['.
':alnum:]\-\.:_]+))))*)(?:[\x20\x09\x0d\x0a]+)?(?:/)?>';
$regex2 = '(?:[\x20\x09\x0d\x0a]+)(?:((?:[[:alpha:]][[:alnum:]\-\.:_]*))(?:[\x20\x09\x0d\x0a]+)'.
'?=(?:[\x20\x09\x0d\x0a]+)?((?:(?:"[^"]*")|(?:\'[^\']*\')|(?:[[:alnum:]\-\.:_]+))))';
$result = array();
if(preg_match_all("!${regex}!s", $html, $matches1, PREG_SET_ORDER)) {
foreach($matches1 as $match1) {
$tmp = array();
if(preg_match_all("!${regex2}!s", $match1[1], $matches2, PREG_SET_ORDER)) {
foreach($matches2 as $match2) {
$tmp[strtolower($match2[1])] = preg_replace('/^(["\'])(.*?)\\1$/', '\2', $match2[2]);
}
}
$result[] = $tmp;
}
}
return $result;
}
if($fh = @fopen('http://fetus.k-hsu.net/document/webmaster/diary2/', 'r')) {
$html = '';
while(!feof($fh)) {
$tmp = fread($fh, 1024);
$html .= $tmp;
}
fclose($fh);
$meta = GetMetaData($html);
var_dump($meta);
}
?>
array(4) {
[0]=>
array(2) {
["http-equiv"]=>
string(12) "Content-Type"
["content"]=>
string(24) "text/html; charset=UTF-8"
}
[1]=>
array(2) {
["http-equiv"]=>
string(18) "Content-Style-Type"
["content"]=>
string(8) "text/css"
}
[2]=>
array(2) {
["http-equiv"]=>
string(19) "Content-Script-Type"
["content"]=>
string(15) "text/javascript"
}
[3]=>
array(2) {
["name"]=>
string(6) "Robots"
["content"]=>
string(23) "INDEX,FOLLOW,IMAGEINDEX"
}
}
動くのは動くみたい。
- 07/12/24 16:20
コメント
コメントはありません。