Mercurial > eagle-eye
view lazywww/README @ 120:bdf025bc50ea
dump.yaml is tmp file, don't commit it.
author | "Rex Tsai <chihchun@kalug.linux.org.tw>" |
---|---|
date | Thu, 30 Oct 2008 15:10:07 +0800 |
parents | d26eea95c52d |
children |
line wrap: on
line source
""" [Note] the project is not available yet. A web page fetcing tool chain that has a JQuery-like selector and supports chain working. Here is an exmaple can show the the main idea, To restrive a content you want in a div box in a web page, and then post and restrive next wanted-content in the other web page with the param you just maked from the content in first restriving. finally, storage the production. def func(s): msg = s.html() return {'msg':msg} try: c("http://example.tw/").get().find("#id > div") \ .build_param( func ).post_to("http://example2.com") \ .save_as('hellow.html') except: pass more complex example try: c("http://example.tw/").retry(4, '5m').get() \ .find("#id > div"). \ .build_param( func ).post_to("http://example2.com") \ .save_as('hellow.html') \ .end().find("#id2 > img").download('pretty-%s.jpg'). \ tar_and_zip("pretty_girl.tar.gz") except NotFound: print "the web page is not found." except NoPermissionTosave: print "the files can not be save with incorrect permission." else: print "unknow error." """ 目前還在設計階段,驗證想法,目前卡關中… 卡在怎麼把workflow接在一起... orz 這邊的筆記滿亂的,請見諒。 本來是要寫bot的,但因為覺得python要控制網頁很不直覺?! 至少在取得html特定內容沒Jquery簡單, 又在IRC上看到thinker提到抓網頁架構想法,所以想嘗試在寫bot的過程中,看能不能時做出一個堪用的小工具 (誤, 又發散了 抓網頁的的動作與工廠生產線相似。 流程如下 取得網頁 找特定內容 儲存 加工 workflow -----------> workflow --> product -----> workflow semiproduct Lazy WWW Proposal 0.1 work flow 架構 Jquery-way to parse html easier. http://phpimpact.wordpress.com/2008/08/07/php-simple-html-dom-parser-jquery-style/ Simple Fetcher - get web page basic procces hook - process the content to build middleware object/ semiproduct 0.2 output serialize - c('http://www.example.com').build_dict(lambda x:x).to_xml() 0.3 Fetcher Exception hanldes ( Retry ) 0.4 Storager - save the production. tar / zip c('http://www.kimo.com.tw').get().tar_and_gzip('hello.tgz') 0.5 PipeLine Command operation supports. - ( the idea is from thinker ) lzw getpage http://www.kimo.com.tw/faq.html , find "#id > div" , save_as hello.html 0.6 proposal Dispacher - manage the missions Refrences: WorkFollow: http://en.wikipedia.org/wiki/Getting_Things_Done Thinkers code: http://master.branda.to/downloads/pywebtool/ c('http://www.kimo.com.tw').get() . find('#id div') . save_as('h.html') . tar('a.tar') semiproduct --------------> workflow --------------------> workflow ----------------> workflow-----------> product ----------> workflow semiproduct semiproduct