Mercurial > eagle-eye

diff lazywww/README @ 61:d26eea95c52d
new web fecther proposal
author: hychen@mluna
date: Tue, 21 Oct 2008 01:36:28 +0800
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/lazywww/README	Tue Oct 21 01:36:28 2008 +0800
@@ -0,0 +1,96 @@
+"""
+    [Note] the project is not available yet.
+
+    A web page fetcing tool chain that has a JQuery-like selector and supports chain working.
+    
+    Here is an exmaple can show the the main idea, To restrive a content you want
+    in a div box in a web page, and then post and restrive next wanted-content in the other
+    web page with the param you just maked from the content in first restriving.
+    finally, storage the production.
+    
+    def func(s):
+	msg = s.html()
+        return {'msg':msg}
+    
+    try:
+        c("http://example.tw/").get().find("#id > div") \
+            .build_param( func ).post_to("http://example2.com") \
+            .save_as('hellow.html')
+    except:
+        pass
+        
+    more complex example
+        
+    try:
+        c("http://example.tw/").retry(4, '5m').get() \
+            .find("#id > div"). \
+            .build_param( func ).post_to("http://example2.com") \
+            .save_as('hellow.html') \
+            .end().find("#id2 > img").download('pretty-%s.jpg'). \
+            tar_and_zip("pretty_girl.tar.gz")
+    except NotFound:
+        print "the web page is not found."
+    except NoPermissionTosave:
+        print "the files can not be save with incorrect permission."
+    else:
+        print "unknow error."
+"""
+
+目前還在設計階段，驗證想法，目前卡關中… 卡在怎麼把workflow接在一起... orz
+
+這邊的筆記滿亂的，請見諒。
+
+本來是要寫bot的，但因為覺得python要控制網頁很不直覺?! 至少在取得html特定內容沒Jquery簡單，
+又在IRC上看到thinker提到抓網頁架構想法，所以想嘗試在寫bot的過程中，看能不能時做出一個堪用的小工具 (誤, 又發散了
+
+抓網頁的的動作與工廠生產線相似。 流程如下
+
+  取得網頁              找特定內容                     儲存
+			加工
+
+  workflow ----------->  workflow --> product -----> workflow
+           semiproduct            
+
+
+Lazy WWW Proposal
+
+0.1
+	work flow 架構
+
+	Jquery-way to parse html easier.
+	
+	http://phpimpact.wordpress.com/2008/08/07/php-simple-html-dom-parser-jquery-style/
+	
+	Simple Fetcher - get web page
+
+	basic procces hook  - process the content to build middleware object/ semiproduct
+
+0.2  
+	output serialize - c('http://www.example.com').build_dict(lambda x:x).to_xml()
+	
+0.3
+
+	Fetcher Exception hanldes ( Retry ) 
+
+0.4 
+	Storager - save the production.
+	
+	 tar / zip c('http://www.kimo.com.tw').get().tar_and_gzip('hello.tgz')
+	
+0.5
+	PipeLine Command operation supports. - ( the idea is from thinker )
+	    
+	 lzw getpage http://www.kimo.com.tw/faq.html , find "#id > div" , save_as hello.html
+
+0.6 proposal
+
+	Dispacher - manage the missions
+
+Refrences:
+
+WorkFollow:    http://en.wikipedia.org/wiki/Getting_Things_Done 
+Thinkers code: http://master.branda.to/downloads/pywebtool/
+
+c('http://www.kimo.com.tw').get()                       . find('#id div')           . save_as('h.html')         .    tar('a.tar')
+semiproduct --------------> workflow --------------------> workflow ----------------> workflow-----------> product ----------> workflow
+                                                      semiproduct                 semiproduct
author	hychen@mluna
date	Tue, 21 Oct 2008 01:36:28 +0800
parents
children