annotate lazywww/README @ 373:dd3d76f43999

update script for collecting ally information.
author "Rex Tsai <chihchun@kalug.linux.org.tw>"
date Tue, 14 Apr 2009 17:00:40 +0800
parents d26eea95c52d
children
rev   line source
61
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
1 """
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
2 [Note] the project is not available yet.
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
3
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
4 A web page fetcing tool chain that has a JQuery-like selector and supports chain working.
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
5
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
6 Here is an exmaple can show the the main idea, To restrive a content you want
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
7 in a div box in a web page, and then post and restrive next wanted-content in the other
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
8 web page with the param you just maked from the content in first restriving.
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
9 finally, storage the production.
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
10
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
11 def func(s):
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
12 msg = s.html()
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
13 return {'msg':msg}
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
14
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
15 try:
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
16 c("http://example.tw/").get().find("#id > div") \
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
17 .build_param( func ).post_to("http://example2.com") \
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
18 .save_as('hellow.html')
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
19 except:
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
20 pass
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
21
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
22 more complex example
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
23
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
24 try:
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
25 c("http://example.tw/").retry(4, '5m').get() \
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
26 .find("#id > div"). \
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
27 .build_param( func ).post_to("http://example2.com") \
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
28 .save_as('hellow.html') \
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
29 .end().find("#id2 > img").download('pretty-%s.jpg'). \
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
30 tar_and_zip("pretty_girl.tar.gz")
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
31 except NotFound:
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
32 print "the web page is not found."
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
33 except NoPermissionTosave:
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
34 print "the files can not be save with incorrect permission."
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
35 else:
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
36 print "unknow error."
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
37 """
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
38
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
39 目前還在設計階段,驗證想法,目前卡關中… 卡在怎麼把workflow接在一起... orz
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
40
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
41 這邊的筆記滿亂的,請見諒。
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
42
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
43 本來是要寫bot的,但因為覺得python要控制網頁很不直覺?! 至少在取得html特定內容沒Jquery簡單,
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
44 又在IRC上看到thinker提到抓網頁架構想法,所以想嘗試在寫bot的過程中,看能不能時做出一個堪用的小工具 (誤, 又發散了
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
45
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
46 抓網頁的的動作與工廠生產線相似。 流程如下
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
47
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
48 取得網頁 找特定內容 儲存
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
49 加工
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
50
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
51 workflow -----------> workflow --> product -----> workflow
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
52 semiproduct
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
53
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
54
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
55 Lazy WWW Proposal
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
56
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
57 0.1
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
58 work flow 架構
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
59
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
60 Jquery-way to parse html easier.
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
61
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
62 http://phpimpact.wordpress.com/2008/08/07/php-simple-html-dom-parser-jquery-style/
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
63
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
64 Simple Fetcher - get web page
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
65
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
66 basic procces hook - process the content to build middleware object/ semiproduct
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
67
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
68 0.2
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
69 output serialize - c('http://www.example.com').build_dict(lambda x:x).to_xml()
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
70
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
71 0.3
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
72
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
73 Fetcher Exception hanldes ( Retry )
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
74
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
75 0.4
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
76 Storager - save the production.
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
77
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
78 tar / zip c('http://www.kimo.com.tw').get().tar_and_gzip('hello.tgz')
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
79
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
80 0.5
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
81 PipeLine Command operation supports. - ( the idea is from thinker )
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
82
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
83 lzw getpage http://www.kimo.com.tw/faq.html , find "#id > div" , save_as hello.html
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
84
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
85 0.6 proposal
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
86
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
87 Dispacher - manage the missions
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
88
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
89 Refrences:
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
90
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
91 WorkFollow: http://en.wikipedia.org/wiki/Getting_Things_Done
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
92 Thinkers code: http://master.branda.to/downloads/pywebtool/
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
93
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
94 c('http://www.kimo.com.tw').get() . find('#id div') . save_as('h.html') . tar('a.tar')
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
95 semiproduct --------------> workflow --------------------> workflow ----------------> workflow-----------> product ----------> workflow
d26eea95c52d new web fecther proposal
hychen@mluna
parents:
diff changeset
96 semiproduct semiproduct