Mercurial > eagle-eye

"""
    [Note] the project is not available yet.

    A web page fetcing tool chain that has a JQuery-like selector and supports chain working.

    Here is an exmaple can show the the main idea, To restrive a content you want
    in a div box in a web page, and then post and restrive next wanted-content in the other
    web page with the param you just maked from the content in first restriving.
    finally, storage the production.

    def func(s):
	msg = s.html()
        return {'msg':msg}

    try:
        c("http://example.tw/").get().find("#id > div") \
            .build_param( func ).post_to("http://example2.com") \
            .save_as('hellow.html')
    except:
        pass

    more complex example

    try:
        c("http://example.tw/").retry(4, '5m').get() \
            .find("#id > div"). \
            .build_param( func ).post_to("http://example2.com") \
            .save_as('hellow.html') \
            .end().find("#id2 > img").download('pretty-%s.jpg'). \
            tar_and_zip("pretty_girl.tar.gz")
    except NotFound:
        print "the web page is not found."
    except NoPermissionTosave:
        print "the files can not be save with incorrect permission."
    else:
        print "unknow error."
"""

目前還在設計階段，驗證想法，目前卡關中… 卡在怎麼把workflow接在一起... orz

這邊的筆記滿亂的，請見諒。

本來是要寫bot的，但因為覺得python要控制網頁很不直覺?! 至少在取得html特定內容沒Jquery簡單，
又在IRC上看到thinker提到抓網頁架構想法，所以想嘗試在寫bot的過程中，看能不能時做出一個堪用的小工具 (誤, 又發散了

抓網頁的的動作與工廠生產線相似。 流程如下

  取得網頁              找特定內容                     儲存
			加工

  workflow ----------->  workflow --> product -----> workflow
           semiproduct


Lazy WWW Proposal

0.1
	work flow 架構

	Jquery-way to parse html easier.

	http://phpimpact.wordpress.com/2008/08/07/php-simple-html-dom-parser-jquery-style/

	Simple Fetcher - get web page

	basic procces hook  - process the content to build middleware object/ semiproduct

0.2
	output serialize - c('http://www.example.com').build_dict(lambda x:x).to_xml()

0.3

	Fetcher Exception hanldes ( Retry )

0.4
	Storager - save the production.

	 tar / zip c('http://www.kimo.com.tw').get().tar_and_gzip('hello.tgz')

0.5
	PipeLine Command operation supports. - ( the idea is from thinker )

	 lzw getpage http://www.kimo.com.tw/faq.html , find "#id > div" , save_as hello.html

0.6 proposal

	Dispacher - manage the missions

Refrences:

WorkFollow:    http://en.wikipedia.org/wiki/Getting_Things_Done
Thinkers code: http://master.branda.to/downloads/pywebtool/

c('http://www.kimo.com.tw').get()                       . find('#id div')           . save_as('h.html')         .    tar('a.tar')
semiproduct --------------> workflow --------------------> workflow ----------------> workflow-----------> product ----------> workflow
                                                      semiproduct                 semiproduct
author	"Rex Tsai <chihchun@kalug.linux.org.tw>"
date	Wed, 19 Nov 2008 12:23:30 +0800
parents	d26eea95c52d
children