擁抱自由,使用 linux !

在Linux底下真是太幸福了,有好多東西可以玩。想從windows解脫就是現在!

Aug-1-07

Perl exampl for parsing html tag

posted by Thomas

Warning: array_keys() [function.array-keys]: The first argument should be an array in /home/thomas/public_html/wp/wp-content/plugins/wp-syntax/geshi/geshi.php on line 1827

Warning: Invalid argument supplied for foreach() in /home/thomas/public_html/wp/wp-content/plugins/wp-syntax/geshi/geshi.php on line 1827

Warning: Invalid argument supplied for foreach() in /home/thomas/public_html/wp/wp-content/plugins/wp-syntax/geshi/geshi.php on line 2180

Warning: Invalid argument supplied for foreach() in /home/thomas/public_html/wp/wp-content/plugins/wp-syntax/geshi/geshi.php on line 3025

Warning: implode() [function.implode]: Argument to implode must be an array. in /home/thomas/public_html/wp/wp-content/plugins/wp-syntax/geshi/geshi.php on line 3077

Warning: array_keys() [function.array-keys]: The first argument should be an array in /home/thomas/public_html/wp/wp-content/plugins/wp-syntax/geshi/geshi.php on line 3108

Warning: Invalid argument supplied for foreach() in /home/thomas/public_html/wp/wp-content/plugins/wp-syntax/geshi/geshi.php on line 3108

Warning: array_keys() [function.array-keys]: The first argument should be an array in /home/thomas/public_html/wp/wp-content/plugins/wp-syntax/geshi/geshi.php on line 3151

Warning: Invalid argument supplied for foreach() in /home/thomas/public_html/wp/wp-content/plugins/wp-syntax/geshi/geshi.php on line 3151

Warning: array_keys() [function.array-keys]: The first argument should be an array in /home/thomas/public_html/wp/wp-content/plugins/wp-syntax/geshi/geshi.php on line 3292

Warning: Invalid argument supplied for foreach() in /home/thomas/public_html/wp/wp-content/plugins/wp-syntax/geshi/geshi.php on line 3292

以前常用來解析 HTML (對稱式 tag 都可以) 或是 XML 的語法

   my @table = $html =~ /<table((?:(?!<table).)*)\/table>/igsx;

@ table : 宣告一個陣列用來儲存比對成功的每筆結果
$html : 一般HTML網頁用來進行 regular expression
/*/ : 比對 /pattern/ igsx 是一些參數
i : 忽略大小寫
g : 全部找,要不然找到一個就會停止繼續找
s : 忽略換行,多行也當作一行
x : 忽略空白

=~ /<table((?:(?!<table).)*)\/table>/igsx;
以 <table 為開始
到 /table> 截止
中間可以是任何字元旦不包含 <table

如此一來如果網頁有很多table 就分別儲存於 $table[0], $table[1], $table[2] ….
所以
XML 語法也可以這樣解析,不需要其他Module!
簡單的比對想法就是 開始到結束之間不包含開始
有時候可能是 開始到結束之間不包含結束
範例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
############################################################################
#                                                              parsencl.pl   -  description                                                          #
#                                                               -------------------                                                           #
#                                        copyright            : (C) 2000 by Yu-Chin Tsai                                                      #
#                                                email            : tlinux.tsai@gmail.com                                                          #
############################################################################
############################################################################
#                                                                                                                                                                             #
#   This program is free software; you can redistribute it and#or modify                                       #
#   it under the terms of the GNU General Public License as published by                                       #
#   the Free Software Foundation; either version 2 of the License, or                                              #
#   (at your option) any later version.                                                                                                        #
#                                                                                                                                                                            #
#############################################################################
 
#!/bin/perl -s
use strict;
use Encode;
use LWP;
 
#save as file
open(REC,"&gt;ncl.htm");
 
for(my $i=0;$i&lt;100;$i++){
    my $CN=101;  #要查詢的索書號
    $CN=$CN+$i;
    $CN = sprintf "%03d",$CN;
    #取得國x圖x館資料
    my $browser = LWP::UserAgent-&gt;new;
    my $url = 'http://xxxx';  #不方便透露
    my $response = $browser-&gt;post( $url,
    [
    'CN'=&gt;$CN,            #索書號(結尾要加',')
    'BR'=&gt;'BS',           #典藏地
    'PAGELINE'=&gt;'50',     #每頁資料筆數
    'routine'=&gt;'holding'  #必備(最後一個不必加 ',' )
    ]
    );
    #以上請參考http://lib.ncl.edu.tw/web/cross.htm 自行新增檢索欄位
    warn $CN." ".$response-&gt;status_line;
 
    #轉big5 to UTF8
    my $html = encode("utf8", decode("big5", $response-&gt;content));
 
    #拿掉垃圾
    $html =~ s/&nbsp\;//ig;
    #print "Source html $html \n\n";
 
    #分析國圖資料html->cvs
    my @table = $html =~ /<table((?:(?!<table).)*)\/table>/igsx; #這邊是重點
    if($#table == 0){
        print REC "$CN no data\n";
        warn "no data\n$html\n\n";
    }
 
    #圖書資料在$table[2]
    my @tr = $table[2] =~ /<tr((?:(?!<tr>).)*)\/tr>/igsx;
    shift(@tr);
    foreach(@tr){
        $_ =~ s/<//ig;
        $_ =~ s/>//ig;
        $_ =~ s/align=left//ig;
        my @td = $_ =~ /td((?:(?!td).)*)\/td/igsx;
        $td[3] =~ s/\s+//ig;
        $td[4] =~ s/\s+//ig;
        $td[5] =~ s/\s+//ig;
        $td[6] =~ s/\s\s+//ig;
        warn "\"$CN\",\"$td[3]\",\"$td[4]\",\"$td[5]\",\"$td[6]\"\n";
        print REC "\"$CN\",\"$td[3]\",\"$td[4]\",\"$td[5]\",\"$td[6]\"\n";
    }
    sleep 5;
}
 
close REC;
Tags:
Jun-13-07

開發 Perl Module

posted by Thomas

將既有的 Perl 程式 開發轉換成 Perl Module 其實非常容易

而且對於以後自己或是分享給他人使用非常有幫助

建立空的 module

1
h2xs -n module_name

就會產生基本的 template

可以看到有這些

1
2
cd module_name
Changes  fallback/  lib/  Makefile.PL  MANIFEST  ppport.h  README  t/  module_name.xs

如果有特殊需求請修改 Makefile.PL

儘量編輯 Changes and README

主程式在 lib/module_name.pm因此重點在
module_name.pm 的架構

package module_name; module 的名子
our @EXPORT = qw(

);

提供可以被使用函數的名子
Preloaded methods 將函數撰寫於此
1; module 程式碼到此為止
__END__ compiler 編譯到此為止
=head1 說明開始
=cut 說明結束

範例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
# 你的 module 的名子, 到 "1;" 為止之間都是程式碼
package module_name;
 
# 如果你的程式有需要 use, 一起加在這邊吧
use 5.008008;
use strict;
use warnings;
use Carp;
 
require Exporter;
use AutoLoader;
 
our @ISA = qw(Exporter);
 
# Items to export into callers namespace by default. Note: do not export
# names by default without a very good reason. Use EXPORT_OK instead.
# Do not simply export all your public functions/methods/constants.
 
# This allows declaration       use module_name ':all';
# If you do not need this, moving things directly into @EXPORT or @EXPORT_OK
# will save memory.
our %EXPORT_TAGS = ( 'all' =&gt; [ qw(
 
) ] );
 
our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );
 
#在這邊輸入你要給別人用的 function name
our @EXPORT = qw(
  hello
);
 
# 版本資訊,make dist 會用到
our $VERSION = '0.01';
 
sub AUTOLOAD {
    # This AUTOLOAD is used to 'autoload' constants from the constant()
    # XS function.
 
    my $constname;
    our $AUTOLOAD;
    ($constname = $AUTOLOAD) =~ s/.*:://;
    croak "&amp;module_name::constant not defined" if $constname eq 'constant';
    my ($error, $val) = constant($constname);
    if ($error) { croak $error; }
    {
        no strict 'refs';
        # Fixed between 5.005_53 and 5.005_61
#XXX    if ($] &gt;= 5.00561) {
#XXX        *$AUTOLOAD = sub () { $val };
#XXX    }
#XXX    else {
            *$AUTOLOAD = sub { $val };
#XXX    }
    }
    goto &amp;$AUTOLOAD;
}
 
require XSLoader;
XSLoader::load('module_name', $VERSION);
 
# Preloaded methods go here.
 
# 這邊撰寫你的程式碼
 
sub hello{
    print "hello worldn";
}
 
# Autoload methods go after =cut, and are processed by the autosplit program.
 
# 1; 表示 module 程式 結束, 從最一開始的 "package name" 到此 " 1; " 之間就是你的主要程式碼
1;
__END__
# Below is stub documentation for your module. You'd better edit it!
 
# 開始說明 從 head 到 cut 都是說明, 依照其規則可使用許多tool將說明轉成純文字或是html, 後續會介紹
=head1 NAME
 
# 簡單說明用途
module_name - Perl extension for blah blah blah
 
=head1 SYNOPSIS
 
# 說明如何使用
  use module_name;
  module_name::hello();
  blah blah blah
 
=head1 DESCRIPTION
 
# 更多詳細說明
Stub documentation for module_name, created by h2xs. It looks like the
author of the extension was negligent enough to leave the stub
unedited.
 
Blah blah blah.
 
=head2 EXPORT
 
None by default.
 
=head1 SEE ALSO
 
# 參考資訊,可供參考其他相關的 module
Mention other useful documentation such as the documentation of
related modules or operating system documentation (such as man pages
in UNIX), or any relevant external documentation such as RFCs or
standards.
 
If you have a mailing list set up for your module, mention it here.
 
If you have a web site set up for your module, mention it here.
 
=head1 AUTHOR
 
# 作者資訊
Yu-Chin Tsai, E<lt>thomas@E<gt>
 
=head1 COPYRIGHT AND LICENSE
 
# 版權資訊
Copyright (C) 2007 by Yu-Chin Tsai
 
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself, either Perl version 5.8.8 or,
at your option, any later version of Perl 5 you may have available.
 
# 結束說明 從 head 到 cut 都是說明
=cut
</gt></lt>

產生文件
pm 檔 可以使用 pod 轉出說明文件,相關工具有
pod2text, pod2man, pod2html, pod2latex, pod2usage.

# 產生純文字說明
$ pod2text module_name.pm
NAME
    module_name - Perl extension for blah blah blah
 
SYNOPSIS
      use module_name;
      blah blah blah
 
.....
 
# 產生 latex 格式說明文件
$ pod2latex module_name.pm
$ ls
module_name.pm  module_name.tex
 
# 輸出 usage (來自 )
$ pod2usage module_name.pm
Usage:
      use module_name;
      module_name::hello();
      blah blah blah

編譯與安裝

在目錄下依序執行
產生make file

$ perl Makefile.PL
Checking if your kit is complete...
Looks good
Writing Makefile for module_name

編譯

$ make
...
chmod 755 blib/arch/auto/module_name/module_name.so
cp module_name.bs blib/arch/auto/module_name/module_name.bs
chmod 644 blib/arch/auto/module_name/module_name.bs
Manifying blib/man3/module_name.3pm

測試

$ make test
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/module_name....ok
All tests successful.
Files=1, Tests=1,  0 wallclock secs ( 0.04 cusr +  0.01 csys =  0.05 CPU)

如果通過測試則可以繼續安裝
如果需要特別寫 test 請參考下方參考資料
進行安裝

$ make install

發佈 module

$ make dist
......
module_name-0.01/META.yml
module_name-0.01/module_name.xs
module_name-0.01/MANIFEST
rm -rf module_name-0.01
gzip --best module_name-0.01.tar

參考資料
http://world.std.com/~swmcd/steven/perl/module_mechanics.html#TOC12
http://world.std.com/~swmcd/steven/perl/module_anatomy.html

Tags: