comparison lazywww/README @ 61:d26eea95c52d

new web fecther proposal
author hychen@mluna
date Tue, 21 Oct 2008 01:36:28 +0800
parents
children
comparison
equal deleted inserted replaced
56:6e0d5e781949 61:d26eea95c52d
1 """
2 [Note] the project is not available yet.
3
4 A web page fetcing tool chain that has a JQuery-like selector and supports chain working.
5
6 Here is an exmaple can show the the main idea, To restrive a content you want
7 in a div box in a web page, and then post and restrive next wanted-content in the other
8 web page with the param you just maked from the content in first restriving.
9 finally, storage the production.
10
11 def func(s):
12 msg = s.html()
13 return {'msg':msg}
14
15 try:
16 c("http://example.tw/").get().find("#id > div") \
17 .build_param( func ).post_to("http://example2.com") \
18 .save_as('hellow.html')
19 except:
20 pass
21
22 more complex example
23
24 try:
25 c("http://example.tw/").retry(4, '5m').get() \
26 .find("#id > div"). \
27 .build_param( func ).post_to("http://example2.com") \
28 .save_as('hellow.html') \
29 .end().find("#id2 > img").download('pretty-%s.jpg'). \
30 tar_and_zip("pretty_girl.tar.gz")
31 except NotFound:
32 print "the web page is not found."
33 except NoPermissionTosave:
34 print "the files can not be save with incorrect permission."
35 else:
36 print "unknow error."
37 """
38
39 目前還在設計階段,驗證想法,目前卡關中… 卡在怎麼把workflow接在一起... orz
40
41 這邊的筆記滿亂的,請見諒。
42
43 本來是要寫bot的,但因為覺得python要控制網頁很不直覺?! 至少在取得html特定內容沒Jquery簡單,
44 又在IRC上看到thinker提到抓網頁架構想法,所以想嘗試在寫bot的過程中,看能不能時做出一個堪用的小工具 (誤, 又發散了
45
46 抓網頁的的動作與工廠生產線相似。 流程如下
47
48 取得網頁 找特定內容 儲存
49 加工
50
51 workflow -----------> workflow --> product -----> workflow
52 semiproduct
53
54
55 Lazy WWW Proposal
56
57 0.1
58 work flow 架構
59
60 Jquery-way to parse html easier.
61
62 http://phpimpact.wordpress.com/2008/08/07/php-simple-html-dom-parser-jquery-style/
63
64 Simple Fetcher - get web page
65
66 basic procces hook - process the content to build middleware object/ semiproduct
67
68 0.2
69 output serialize - c('http://www.example.com').build_dict(lambda x:x).to_xml()
70
71 0.3
72
73 Fetcher Exception hanldes ( Retry )
74
75 0.4
76 Storager - save the production.
77
78 tar / zip c('http://www.kimo.com.tw').get().tar_and_gzip('hello.tgz')
79
80 0.5
81 PipeLine Command operation supports. - ( the idea is from thinker )
82
83 lzw getpage http://www.kimo.com.tw/faq.html , find "#id > div" , save_as hello.html
84
85 0.6 proposal
86
87 Dispacher - manage the missions
88
89 Refrences:
90
91 WorkFollow: http://en.wikipedia.org/wiki/Getting_Things_Done
92 Thinkers code: http://master.branda.to/downloads/pywebtool/
93
94 c('http://www.kimo.com.tw').get() . find('#id div') . save_as('h.html') . tar('a.tar')
95 semiproduct --------------> workflow --------------------> workflow ----------------> workflow-----------> product ----------> workflow
96 semiproduct semiproduct