Mercurial > eagle-eye
diff lazywww/README @ 61:d26eea95c52d
new web fecther proposal
author | hychen@mluna |
---|---|
date | Tue, 21 Oct 2008 01:36:28 +0800 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/lazywww/README Tue Oct 21 01:36:28 2008 +0800 @@ -0,0 +1,96 @@ +""" + [Note] the project is not available yet. + + A web page fetcing tool chain that has a JQuery-like selector and supports chain working. + + Here is an exmaple can show the the main idea, To restrive a content you want + in a div box in a web page, and then post and restrive next wanted-content in the other + web page with the param you just maked from the content in first restriving. + finally, storage the production. + + def func(s): + msg = s.html() + return {'msg':msg} + + try: + c("http://example.tw/").get().find("#id > div") \ + .build_param( func ).post_to("http://example2.com") \ + .save_as('hellow.html') + except: + pass + + more complex example + + try: + c("http://example.tw/").retry(4, '5m').get() \ + .find("#id > div"). \ + .build_param( func ).post_to("http://example2.com") \ + .save_as('hellow.html') \ + .end().find("#id2 > img").download('pretty-%s.jpg'). \ + tar_and_zip("pretty_girl.tar.gz") + except NotFound: + print "the web page is not found." + except NoPermissionTosave: + print "the files can not be save with incorrect permission." + else: + print "unknow error." +""" + +目前還在設計階段,驗證想法,目前卡關中… 卡在怎麼把workflow接在一起... orz + +這邊的筆記滿亂的,請見諒。 + +本來是要寫bot的,但因為覺得python要控制網頁很不直覺?! 至少在取得html特定內容沒Jquery簡單, +又在IRC上看到thinker提到抓網頁架構想法,所以想嘗試在寫bot的過程中,看能不能時做出一個堪用的小工具 (誤, 又發散了 + +抓網頁的的動作與工廠生產線相似。 流程如下 + + 取得網頁 找特定內容 儲存 + 加工 + + workflow -----------> workflow --> product -----> workflow + semiproduct + + +Lazy WWW Proposal + +0.1 + work flow 架構 + + Jquery-way to parse html easier. + + http://phpimpact.wordpress.com/2008/08/07/php-simple-html-dom-parser-jquery-style/ + + Simple Fetcher - get web page + + basic procces hook - process the content to build middleware object/ semiproduct + +0.2 + output serialize - c('http://www.example.com').build_dict(lambda x:x).to_xml() + +0.3 + + Fetcher Exception hanldes ( Retry ) + +0.4 + Storager - save the production. + + tar / zip c('http://www.kimo.com.tw').get().tar_and_gzip('hello.tgz') + +0.5 + PipeLine Command operation supports. - ( the idea is from thinker ) + + lzw getpage http://www.kimo.com.tw/faq.html , find "#id > div" , save_as hello.html + +0.6 proposal + + Dispacher - manage the missions + +Refrences: + +WorkFollow: http://en.wikipedia.org/wiki/Getting_Things_Done +Thinkers code: http://master.branda.to/downloads/pywebtool/ + +c('http://www.kimo.com.tw').get() . find('#id div') . save_as('h.html') . tar('a.tar') +semiproduct --------------> workflow --------------------> workflow ----------------> workflow-----------> product ----------> workflow + semiproduct semiproduct