view lazywww/README @ 275:c1333052a4ed

fixed a typo
author "Rex Tsai <chihchun@kalug.linux.org.tw>"
date Tue, 02 Dec 2008 02:08:50 +0800
parents d26eea95c52d
children
line wrap: on
line source

"""
    [Note] the project is not available yet.

    A web page fetcing tool chain that has a JQuery-like selector and supports chain working.
    
    Here is an exmaple can show the the main idea, To restrive a content you want
    in a div box in a web page, and then post and restrive next wanted-content in the other
    web page with the param you just maked from the content in first restriving.
    finally, storage the production.
    
    def func(s):
	msg = s.html()
        return {'msg':msg}
    
    try:
        c("http://example.tw/").get().find("#id > div") \
            .build_param( func ).post_to("http://example2.com") \
            .save_as('hellow.html')
    except:
        pass
        
    more complex example
        
    try:
        c("http://example.tw/").retry(4, '5m').get() \
            .find("#id > div"). \
            .build_param( func ).post_to("http://example2.com") \
            .save_as('hellow.html') \
            .end().find("#id2 > img").download('pretty-%s.jpg'). \
            tar_and_zip("pretty_girl.tar.gz")
    except NotFound:
        print "the web page is not found."
    except NoPermissionTosave:
        print "the files can not be save with incorrect permission."
    else:
        print "unknow error."
"""

目前還在設計階段,驗證想法,目前卡關中… 卡在怎麼把workflow接在一起... orz

這邊的筆記滿亂的,請見諒。

本來是要寫bot的,但因為覺得python要控制網頁很不直覺?! 至少在取得html特定內容沒Jquery簡單,
又在IRC上看到thinker提到抓網頁架構想法,所以想嘗試在寫bot的過程中,看能不能時做出一個堪用的小工具 (誤, 又發散了

抓網頁的的動作與工廠生產線相似。 流程如下

  取得網頁              找特定內容                     儲存
			加工

  workflow ----------->  workflow --> product -----> workflow
           semiproduct            


Lazy WWW Proposal

0.1
	work flow 架構

	Jquery-way to parse html easier.
	
	http://phpimpact.wordpress.com/2008/08/07/php-simple-html-dom-parser-jquery-style/
	
	Simple Fetcher - get web page

	basic procces hook  - process the content to build middleware object/ semiproduct

0.2  
	output serialize - c('http://www.example.com').build_dict(lambda x:x).to_xml()
	
0.3

	Fetcher Exception hanldes ( Retry ) 

0.4 
	Storager - save the production.
	
	 tar / zip c('http://www.kimo.com.tw').get().tar_and_gzip('hello.tgz')
	
0.5
	PipeLine Command operation supports. - ( the idea is from thinker )
	    
	 lzw getpage http://www.kimo.com.tw/faq.html , find "#id > div" , save_as hello.html

0.6 proposal

	Dispacher - manage the missions

Refrences:

WorkFollow:    http://en.wikipedia.org/wiki/Getting_Things_Done 
Thinkers code: http://master.branda.to/downloads/pywebtool/

c('http://www.kimo.com.tw').get()                       . find('#id div')           . save_as('h.html')         .    tar('a.tar')
semiproduct --------------> workflow --------------------> workflow ----------------> workflow-----------> product ----------> workflow
                                                      semiproduct                 semiproduct