61
|
1 """
|
|
2 [Note] the project is not available yet.
|
|
3
|
|
4 A web page fetcing tool chain that has a JQuery-like selector and supports chain working.
|
|
5
|
|
6 Here is an exmaple can show the the main idea, To restrive a content you want
|
|
7 in a div box in a web page, and then post and restrive next wanted-content in the other
|
|
8 web page with the param you just maked from the content in first restriving.
|
|
9 finally, storage the production.
|
|
10
|
|
11 def func(s):
|
|
12 msg = s.html()
|
|
13 return {'msg':msg}
|
|
14
|
|
15 try:
|
|
16 c("http://example.tw/").get().find("#id > div") \
|
|
17 .build_param( func ).post_to("http://example2.com") \
|
|
18 .save_as('hellow.html')
|
|
19 except:
|
|
20 pass
|
|
21
|
|
22 more complex example
|
|
23
|
|
24 try:
|
|
25 c("http://example.tw/").retry(4, '5m').get() \
|
|
26 .find("#id > div"). \
|
|
27 .build_param( func ).post_to("http://example2.com") \
|
|
28 .save_as('hellow.html') \
|
|
29 .end().find("#id2 > img").download('pretty-%s.jpg'). \
|
|
30 tar_and_zip("pretty_girl.tar.gz")
|
|
31 except NotFound:
|
|
32 print "the web page is not found."
|
|
33 except NoPermissionTosave:
|
|
34 print "the files can not be save with incorrect permission."
|
|
35 else:
|
|
36 print "unknow error."
|
|
37 """
|
|
38
|
|
39 目前還在設計階段,驗證想法,目前卡關中… 卡在怎麼把workflow接在一起... orz
|
|
40
|
|
41 這邊的筆記滿亂的,請見諒。
|
|
42
|
|
43 本來是要寫bot的,但因為覺得python要控制網頁很不直覺?! 至少在取得html特定內容沒Jquery簡單,
|
|
44 又在IRC上看到thinker提到抓網頁架構想法,所以想嘗試在寫bot的過程中,看能不能時做出一個堪用的小工具 (誤, 又發散了
|
|
45
|
|
46 抓網頁的的動作與工廠生產線相似。 流程如下
|
|
47
|
|
48 取得網頁 找特定內容 儲存
|
|
49 加工
|
|
50
|
|
51 workflow -----------> workflow --> product -----> workflow
|
|
52 semiproduct
|
|
53
|
|
54
|
|
55 Lazy WWW Proposal
|
|
56
|
|
57 0.1
|
|
58 work flow 架構
|
|
59
|
|
60 Jquery-way to parse html easier.
|
|
61
|
|
62 http://phpimpact.wordpress.com/2008/08/07/php-simple-html-dom-parser-jquery-style/
|
|
63
|
|
64 Simple Fetcher - get web page
|
|
65
|
|
66 basic procces hook - process the content to build middleware object/ semiproduct
|
|
67
|
|
68 0.2
|
|
69 output serialize - c('http://www.example.com').build_dict(lambda x:x).to_xml()
|
|
70
|
|
71 0.3
|
|
72
|
|
73 Fetcher Exception hanldes ( Retry )
|
|
74
|
|
75 0.4
|
|
76 Storager - save the production.
|
|
77
|
|
78 tar / zip c('http://www.kimo.com.tw').get().tar_and_gzip('hello.tgz')
|
|
79
|
|
80 0.5
|
|
81 PipeLine Command operation supports. - ( the idea is from thinker )
|
|
82
|
|
83 lzw getpage http://www.kimo.com.tw/faq.html , find "#id > div" , save_as hello.html
|
|
84
|
|
85 0.6 proposal
|
|
86
|
|
87 Dispacher - manage the missions
|
|
88
|
|
89 Refrences:
|
|
90
|
|
91 WorkFollow: http://en.wikipedia.org/wiki/Getting_Things_Done
|
|
92 Thinkers code: http://master.branda.to/downloads/pywebtool/
|
|
93
|
|
94 c('http://www.kimo.com.tw').get() . find('#id div') . save_as('h.html') . tar('a.tar')
|
|
95 semiproduct --------------> workflow --------------------> workflow ----------------> workflow-----------> product ----------> workflow
|
|
96 semiproduct semiproduct
|