view doc/v2_planning/coding_style.txt @ 1428:3823dbfff6cf

add parameter to randomize the valid and test data.
author Frederic Bastien <nouiz@nouiz.org>
date Tue, 08 Feb 2011 12:57:15 -0500
parents 7185d194bb7e
children
line wrap: on
line source

Discussion of Coding-Style
==========================

Participants
------------
- Dumitru
- Fred
- David
- Olivier D [leader]

Open for public debate
----------------------

    * File header:
        - Do we put the accents in 'Universite de Montreal'?
            OD: No (restricting code to ASCII characters is much safer)

        - Do we put the Mercurial version number in each file?
            OD: No (useless in my experience, if it's a release the version
                number can be provided in the README for instance, and in
                addition Mercurial IDs cannot be easily compared to figure
                out which of two versions is most recent)

    * OD: I like always doing the following when subclassing

.. code-block:: python

      a class A:
        class B(A):
            def __init__(self, b_arg_1, b_arg_2, **kw):
                super(B, self).__init__(**kw)
                ...

      The point here is that the constructor always allow for extra keyword
      arguments (except for the class at the very top of the hierarchy), which
      are automatically passed to the parent class.
      Pros:
        - You do not need to repeat the parent class arguments whenever you
          write a new subclass.
        - Whenever you add an argument to the parent class, all child classes
          can benefit from it without modifying their code.
      Cons:
        - One needs to look at the parent classes to see what these arguments
          are.
        - You cannot use a **kw argument in your constructor for your own
          selfish purpose (well, you can actually, but it would look a bit
          hackish).
        - I have no clue whether one could do this with multiple inheritance.
        - Pb if super class adds an argument that has same name as a child class.
      Question: Should we encourage this in Pylearn?
      JB: +0.5
      OD: Was discussed in lab meeting. The feeling was that the last Con was
          too dangerous. Note however that if we have some system that
          automatically generates proper doc (i.e. with all arguments, by
          asking the parent class as well), it could detect the situation
          mentioned in that last Con (and solve the first one as well).


Closed for public debate
------------------------

   * Imperative vs. third-person comments.
        # Return the sum of elements in x.  <-- imperative
        # Returns the sum of elements in x. <-- third-person
     OD: I am used to the imperative form and like it better only because it
         typically saves one letter (the 's') and is easier to conjugate.
     JB: What about being compatible with markup formats that have a :returns:
         tag?
     OD: That'd make sense. However, when I wrote the above I hadn't looked
         closely at PEP257 yet, and I just noticed the following official
         recommendation for one-line docstrings in it:

             The docstring is a phrase ending in a period. It prescribes the
             function or method's effect as a command ("Do this", "Return that"), not as a
             description; e.g. don't write "Returns the pathname ...".

         Anyone knows which style is most popular in the open-source
         community?
     OD: In lab meeting Yoshua ruled out: it is a waste of time to even
         discuss it. So we let everyone do it the way they like it best.

    * Avoid contractions in code comments (particularly in
      documentation): "We do not add blue to red because it does not look good"
      rather than "We don't add blue to red because it doesn't look good".
      OD: I mostly find it to be cleaner (been used to it while writing
          scientific articles too).
      JB: +1
      OD: Discussed in lab meeting, and agreed on.

   * Use imports for packages and modules only. I.e. avoid
        from foo import *
        from foo import Bar
     OD: Overall I agree with this. However we probably want to allow some
        exceptions, like:
            from itertools import imap, izip
        Also, some people may want to have shortcuts like
            from theano import tensor as T
        but I would prefer to forbid this. It is handy when trying stuff in
        the interactive interpreter, but in real code it can easily get messy
        when you want to copy / paste different pieces of code and they use
        different conventions. Typing tensor.* is a bit longer, but a lot more
        portable.
     JB: I thought that these are nice:
         - "from foo import Bar" 
         - "from foo import Bar, Blah"

        What's wrong with them?  They keep the code listing short and readable.
        I would discourage these forms when symbols 'Bar' and 'Blah' are
        ambiguous, in which case the parent module prefix serves to disambiguate
        them in the code.
        I agree that the "import A as B" form should be discouraged in general,
        because that's just confusing and makes code less grep-friendly.
     OD: I agree that "from foo import Bar, Blah" is sometimes convenient
        (typically when you re-use Bar / Blah many times in the same file),
        and would vote in favor of accepting it when it is appropriate.
        This guideline was taken from Google's coding recommendation:
            "from foo import * or from foo import Bar is very nasty and can
             lead to serious maintenance issues because it makes it hard to find
             module dependencies."
     OD: Decision was taken in committee's meeting to allow
            from foo import Bar, Blah
         when imported stuff is re-used multiple times in the same file, and
         there is no ambiguity.
     DWF: One exception I'd like to propose to the "import A as B" moratorium
          is that we adopt the "import numpy as np" standard that's used in
          NumPy and SciPy itself. For NumPy heavy code this really cuts down
	  on clutter, without significant impact on readability (IMHO).
     OD: We discussed it during a meeting. We agreed not to use the 'np'
         abbreviation at first. We may still revisit this point in the future
         if we find situations where it would really help.

   * Imports should usually be on separate lines.
     OD: I would add an exception, saying it is ok to group multiple imports
        from the standard library on a single line, e.g.
            import os, sys, time
        I just don't see much benefit in putting them on separate lines (for
        third-party imports I agree it is best to keep them separate, as it
        makes dependencies clearer, and diffs look better when someone adds /
        removes an import).  Does anyone see a good reason to keep standard
        library imports on different lines?
     JB: what does 'usually' mean here? The guideline seems vacuous.
     OD: Sorry my fault, I did not quote the whole guideline from PEP8. The
         'usually' was because of what followed:
            it's okay to say this though:
                from subprocess import Popen, PIPE
         (which btw contradicts Google's recommendation mentioned previously)
     OD: Decision was taken in committee's meeting to allow multiple imports
         on the same line for standard library modules (only).

    * The BDFL recommends inserting a blank line between the
      last paragraph in a multi-line docstring and its closing quotes, placing
      the closing quotes on a line by themselves. This way, Emacs'
      fill-paragraph command can be used on it.
      OD: I think it is ugly and I have not seen it used much. Any Emacs
          user believes it is a must?
      OD: Decision was taken in committee's meeting to drop this
          recommendation.

    * JB: How should we combine capitalization and underscores to name classes
          and functions related to an algorithm like 'SGD' or a model like 'RBM'
          whose common name is capitalized?  Case in point: How should I name a
          Hybrid Monte Carlo Sampler?  Should I use the common HMC abbreviation?
      OD: This one is answered by PEP8 (search HTTPServerError in it).
          You should use:
            RBMClassName
            rbm_function_name
          As far as using abbreviations is concerned:
            All identifiers in the Python standard library (...) SHOULD use
            English words wherever feasible (in many cases, abbreviations and
            technical terms are used which aren't English).
          so I guess HMC is ok when using Hybrid Monte Carlo is considered to
          make some names too long.


Note about warnings
-------------------

Fred: This is a refactored thing from James email of what we should put in message
that we send to the user:
1) Hint where in the code this log come from.
2) Hint how to hide this message? or we should this into documentation.
3) Tell explicitly if the user can ignore it and the consequence.

Existing Python coding style specifications and guidelines
----------------------------------------------------------

  * Must-read
    * Official Python coding style guide: http://www.python.org/dev/peps/pep-0008
    * Official docstring conventions: http://www.python.org/dev/peps/pep-0257
    * Google Python Style Guide: http://google-styleguide.googlecode.com/svn/trunk/pyguide.html
  * Interesting
    * Code Like a Pythonista: http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html
    * Numpy notes on conversion to Python 3: http://projects.scipy.org/numpy/browser/trunk/doc/Py3K.txt
  * Can skip
    * Python style for university class: http://www.cs.caltech.edu/courses/cs11/material/python/misc/python_style_guide.html
    * Mailman coding style: http://barry.warsaw.us/software/STYLEGUIDE.txt
    * Some company coding style: http://self.maluke.com/style
    * Chandler coding style: http://chandlerproject.org/Projects/ChandlerCodingStyleGuidelines
    * Outdated recommendations: http://lists.osafoundation.org/pipermail/dev/2003-March/000479.html
    * Mostly some beginners tips: http://learnpython.pbworks.com/PythonTricks
    * More beginners tips: http://eikke.com/how-not-to-write-python-code/
    * Cogent coding guidelines: http://jaynes.colorado.edu/PythonGuidelines.html
    * Djangoo coding guidelines: http://docs.djangoproject.com/en/dev/internals/contributing/#coding-style
    * Numpy documentation style guidelines: http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines 
    * Some random guy guidelines (nothing special): http://www.voidspace.org.uk/python/articles/python_style_guide.shtml

We will probably want to take PEP-8 as starting point, and read what other
people think about it / how other coding guidelines differ from it.

OD: Things about PEP 8 I don't like (but it may be just me):

   * If necessary, you can add an extra pair of parentheses around an
     expression, but sometimes using a backslash looks better.
     --> I rarely find that backslash looks better. In most situations you can
     get rid of them. Typically I prefer:

..code-block:: python

            if (cond_1 and
                cond_2 and
                cond_3):

     to

..code-block:: python

            if cond_1 and \
               cond_2 and \
               cond_3:

   * You should use two spaces after a sentence-ending period.
     --> Looks weird to me.
     (DWF: This is an old convention from the typewriter era. It has more
     or less been wiped out by HTML's convention of ignoring extra 
     whitespace: see http://en.wikipedia.org/wiki/Sentence_spacing for
     more detail. I think it's okay to drop this convention in source code.)
     OD: Cool, thanks, I guess we can drop it then.

    * Missing in PEP 8:
        - How to indent multi-line statements? E.g. do we want
            x = my_func(a, b, c,
                d, e, f)
          or
            x = my_func(a, b, c,
                        d, e, f)
          or
            x = my_func(
                a, b, c, d, e, f)
          --> Probably depends on the specific situation, but we could have a
            few typical examples (and the same happens with multi-lines lists)
	  (Fred: I would do 2 or 3, but not 1. I find it more redable when the
	         indent is broken after a paranthesis then at any point.
      OD: After thinking about it, I agreee as well. My recommendation would
          be to go with 2 when it can fit on two lines, and 3 otherwise. Same
          with lists.

Documentation
-------------

How do we write docs?

Ideas (DE):

   * Most major Python projects suggest following PEP-257:
     http://www.python.org/dev/peps/pep-0257/, which contains conventions on
     writing docstrings (what they should contain, not the specific markup)
     for Python. These are very general conventions, however,.
   
   * Numpy, in particular, has a very nice page on how to document things if
     contributing: http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines
     (it's mostly about documentation, not coding style, despite the page
     name). 
  
   * A pretty good example from numpy, with relevant comments:
     http://github.com/numpy/numpy/blob/master/doc/example.py 
  
   * A real-life example (record arrays) from numpy:
     http://github.com/numpy/numpy/blob/master/numpy/core/records.py
 
   * The recommendations are quite sane and common-sense, we should follow them.
 
   * numpy's way of doing things is a bit different from the way we currently
     document Theano: they don't use param/type/rtype, for instance, but nice
     readable section titles. I personally find their approach better-looking
     and they do have a sphinx extension that would allow us to have the same
     style
     (http://github.com/numpy/numpy/blob/master/doc/sphinxext/numpydoc.py).
     The disadvantage of taking this approach is that Theano and Pylearn will
     be documented slightly differently

   * Make sure that what we write is compatible with tools like sphinx's
     autodoc extension:
     http://sphinx.pocoo.org/ext/autodoc.html#module-sphinx.ext.autodoc (which
     we will most probably use to generate semi-automatic pretty docs)
  
   * Nice cheat-sheet for docutils:
     http://docutils.sourceforge.net/docs/user/rst/quickref.html
  
   * http://docs.python.org/release/2.5.2/lib/module-doctest.html -
     in-documentation unit-testing: we should perhaps encourage people to
     write such things where warranted (where there are interesting usage
     examples).  notetests can automatically run those, so no configuration
     overhead is necessary.

Compatibility with various Python versions
------------------------------------------

    * Which Python 2.x version do we want to support?

    * Is it reasonable to have coding guidelines that would make the code as
      compatible as possible with Python 3?

C coding style
--------------

We also need a c-style coding style.

Meeting 2010/09/09
------------------

   * Coding guidelines
     PEP 8 & Google should be a good basis to start with.
     Task: Highlight the most important points in them (OD).

   * Documentation
     Use RST with Sphinx.
     Task: Provide specific examples on how to document a class, method, and some
     specific classes like Op (DE). Modify the theano documentation to include that.
     OD: May want to check out
     http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines

   * Python versions to be supported
     Support 2.4 (because some of the clusters are still running 2.4) and write
     code that can be converted to 3.x with 2to3 in a straightforward way.
     Task: Write to-do's and to-not-do's to avoid compatibility issues. (OD)

   * C coding style
     How to write C code (in particular for Numpy / Cuda), and how to mix C and
     Python.
     Task: See if there would be a sensible C code style to follow (maybe look how
     Numpy does it), and how projects that mix C and Python deal with it (e.g. use
     separate files, or be able to have mixed syntax highlighting?) (FB)

   * Program output
     Use the warning and logging modules. Avoid print as much as possible.
     Task: Look into these modules to define general guidelines e.g. to decide when
     to use warning instead of logging. (DWF)

   * Automatized code verification
     Use pychecker & friends to make sure everything is fine.
     Task: Look into the various options available (DE)
     Result: See sections 'Tools to help us out' and 'Automating and enforcing coding
     style'

   * Tests
     Force people to write tests. Automatic email reminder of code lines not
     covered by tests (see if we can get this from nosetests). Decorator to mark
     some classes / methods as not being tested yet, so as to be able to
     automatically warn the user when he is using untested stuff (and to remind
     ourselves we should add a test).
     Task: See feasibility. (OD)
     Result: See section 'Enforcing strict testing policy'.

   * VIM / Emacs plugins / config files
     To enforce good coding style automatically.
     Task: Look for existing options. (FB)
     (DWF: I have put some time into this for vim, I will send around my files)

Suggestion by PV
----------------

Have a sample code that showcases everything one should comply to.

Fred's suggestion to solve issue with hashlib not available in Python 2.4:
--------------------------------------------------------------------------

You can do as in theano.gof.cc:

    ..code::

	if sys.version_info[:2] >= (2,5):
	    import hashlib
    	    def hash_from_code(msg):
                return hashlib.md5(msg).hexdigest()
        else:
	    import md5
	    def hash_from_code(msg):
	        return md5.new(msg).hexdigest()


Mercurial commits
-----------------

   * How to write good commit messages?
    OD: Check Django's guidelines (link above)
   * Standardize the merge commit text (what is the message from fetch?)

During committee's meeting, Fred mentioned a bug with Assembla links for
multi-line commits.

OD: About what I mentioned in a meeting:
    - The size of the first line does not really matter (although keeping it
      short is better, it is not truncated when you do 'hg log')
    - The fetch commit message is like "Automated merge with
      https://theanoclone.googlecode.com/hg/". It is too long to ask people to
      use the same. I guess "Merge" would be the most logical message to use.

FB: The proposed guidelines
    * A one line summary. Try to keep it short, and provide the information
      that seems most useful to other developers: in particular the goal of
      a change is more useful than its description (which is always
      available through the changeset patch log). E.g. say "Improved stability
      of cost computation" rather than "Replaced log(exp(a) + exp(b)) by
      a * log(1 + exp(b -a)) in cost computation".
    * If needed a blank line followed by a more detailed summary
    * Make a commit for each logical modification
        * This makes reviews easier to do
        * This makes debugging easier as we can more easily pinpoint errors in commits with hg bisect
    * NEVER commit reformatting with functionality changes
    * HG RECORD/DIFF are your friend
        * hg record allows you to select which changes to a file should be
          committed
        * hg record / diff force you to review your code, never commit without
          running one of these two commands first
    * Stuff from django guide
        * Write detailed commit messages in the past tense, not present tense.
            * Good: "Fixed Unicode bug in RSS API."
            * Bad: "Fixes Unicode bug in RSS API."
            * Bad: "Fixing Unicode bug in RSS API."
        * Separate bug fixes from feature changes.
        * If fix a ticket, make the message start with "Fixed #abc"
            * Can make a system to change the ticket?
        * If reference a ticket, make the message start with "Refs #abc"
            * Can make a system to put a comment to the ticket?


Tools to help us out
---------------------

Dumi:

  * pylint: highly configurable and very popular tool, similar in spirit to lint
    for C. Can specify a config file, customize/disable warnings and errors, hook
    it to vim/emacs and include coding style convensions in the check too. A nice
    feature is that you can include a comment like "# pylint: disable-msg=C0103"
    into a file and disable a message locally. This is nice and dangerous at the
    same time. Another cool feature is incremental checking with caching of
    results, which also allows tracking of progress.

  * pyflakes: pylint alternative that is supposedly faster, but is I think more
    limited in the number of things it is good at: "PyFlakes will tell you when
    you have forgotten an import, mistyped a variable name, defined two functions
    with the same name, shadowed a variable from another scope, imported a module
    twice, or two different modules with the same name, and so on.". Most reviews
    found online praise the speed, but note that pylint is clearly superior in
    every other respect.

  * pychecker: it actually *imports* each module (not sure if pylint does this).
    It seems that pylint = pychecker + coding style and that pylint is more
    popular.

  * pep8: if all you care is about obeying PEP-8:
    http://pypi.python.org/pypi/pep8 (includes the actual PEP-8 snippets with the
    errors found, which is neat). Otherwise, pylint seems like a superset of this. 

  * http://www.doughellmann.com/articles/pythonmagazine/completely-different/2008-03-linters/index.html
  - article from 2008 comparing pylint, pychecker, and pyflakes. The conclusion
    is to use pylint, more or less.
 
I say we stick with pylint for now as it provides a great degree of flexibility
in a single mature package.

  * vim + pylint: http://www.vim.org/scripts/script.php?script_id=891
  * emcas + pylint: http://www.emacswiki.org/emacs/PythonProgrammingInEmacs#toc5

Automating and enforcing coding style
-------------------------------------

Ideally, we would like to have a uniform approach to this, where everyone tests
against the same tool(s) and uses the same list of disabled warnings etc.

Dumi: there are several ways of approaching this, independently of the tools used:

   * Create a precommit hook for mercurial, which runs the tool(s) of choice and
     generates warnings or aborts the commit process. This hook is a simple Python
     module (well, as simple as we want it to be), which we can include into
     everyone's hgrc, in the precommit.pylint variable, for instance. An example
     is http://github.com/jrburke/dvcs_jslint/blob/master/dvcs_jslint.js. The
     advantage of this approach is that the load is distributed and
     errors/warnings are caught client-side, before the commit.

   * Another client-side option is to have editor plugins for the various style
     checkers: vim and emacs can access pylint pretty easily for instance.

   * Instead of doing this client-side, one can do things server-side. On
     Assembla, this means using their Webhooks
     (http://www.assembla.com/spaces/demostuff/webhook_tool), since HTTP-based
     hooks that we would need to tie with our buildbot server (whichever server we
     choose that to be).

   * I (DE) prefer starting with the client-side approach, as it is easier to
     implement, has no single point of failure and is deployable fast. We could
     have a "batch" script that runs our lint tools in conjunction with hg
     annotate and sends hate-mail once a week to offenders who have somehow
     slipped things through the cracks. Also on the server-side we could run
     time-consuming checking (though how such checks would differ from tests is
     unclear).

Note that:

  * I haven't found anything ready-made online, so we need to write these
    hooks ourselves.
  * I think we should make it so that it is not possible to commit things if
    pylint reports an actual error.

Type checking
-------------

(Suggested by Francois Savard)

vu que vous êtes en train de vous occuper de l'aspect coding style, je
mentionne ceci, à faire ce que vous en voulez: j'aime bien éviter des
erreurs sur l'ordre de mes paramètres, sur les assumptions sur les
paramètres etc. en faisant des argument check. Ça remplace un peu le
static type checking des langages genre Java.

En Python y'a une façon élégante de définir ses propres typecheckers,
value checkers etc. et ensuite les passer en paramètre à un décorateur de
fonction:

http://code.activestate.com/recipes/454322-type-checking-decorator/

(Juste un exemple, vu que les checks peuvent être plus élaborés, inclure
des value checks (>0 etc.), être flexibles pour ne pas demander que ce
soit un type fixe mais plutôt que ça réponde à certaines contraintes (que
ça "ressemble" à un float, p. ex.). J'avais développé une lib pour faire
qqch du genre en Javascript).

Je ne sais pas si vous comptiez parler de ça, et si ça vaut la peine, mais
personnellement je préfère du code à des commentaires qui peuvent être out
of sync avec le contenu d'une méthode. Si vous croyez que ça vaut la peine,
vous pourriez p-e définir des type/value-checkers standards pour éviter que
tout le monde redéfinissent les siens à sa façon.

OD: This was discussed in committee's meeting. We agreed to provide ways to do
this, but not to enforce its usage.

Enforcing strict testing policy
-------------------------------

The `coverage` third-party module provides a way to gather code coverage
statistics in the test suite. `nosetests` has a plugin that can be activated
with the --with-coverage option to use this module.
It is possible to know which lines specifically lack coverage. However, we
will probably want to post-process this data to do more than a simple report
(which noone will care about). This could be done either by parsing nosetests'
coverage output, or modifying its coverage plugin, or writing our own version
of it. The main goal would be to identify who is responsible for writing lines
that are not currently covered (using 'hg annotate'), in order to send email
notifications.

We should aim at 100% code coverage in tests. This is realistic because
`coverage` offers ways to ignore coverage for lines we explicitely do not want
to cover (typically debug code, or AssertionError / NotImplementedError that
are not supposed to be triggered during normal usage).
We may need to do some advanced processing though to e.g. collect results from
multiple build bots, if for instance some bot is running tests without GPU
support, and another one is taking care of the GPU tests.

Code that should be tested but for which no test is currently written would
also require some decorator / helper function that would trigger a warning at
run-time (only once / execution). This could be enforced by adopting a
different policy about lack-of-coverage notification emails, depending on
whether or not the warning is present:
- if there is no warning, daily email notification (ADD A WARNING!!!)
- if there is a warning, weekly email notification (ADD A TEST!!!)

Meeting 2010/09/16
------------------

Tasks to be performed by tomorrow:
    * OD:
        * Write down summary of Python coding style recommendations
        * Start a file that showcases those guidelines
    * DWF:
        * Look into recommendations on how to document a class, method, ...
        * Write recommendations on when to use logging vs. warning
        * Make public some configuration files / plugins for vim
        * Come up with official common file header (license in particular)

Script usage documentation
--------------------------

OD: It would be nice to have some standardized way of parsing a script's
arguments and displaying the script usage doc to the user.

Recommendations for serialization
---------------------------------

We need to add coding style guidelines to make sure code is properly
serializable.

Meeting 2010/09/24
------------------

FB: Look into commit guidelines.
DE: Write guidelines on how to document a class / method (maybe also some
typical class like an Op)
DWF: Write guidelines on how to write serializable code
OD: Finish code sample that showcases all (or many) guidelines, look into
feasibility of passing arguments to super classes with **kw.

Meeting 2010/10/01
------------------
OD: Do your job!
DE: Send email to ask the lab whether we should go numpy style or theano style
for class / method documentation, and move guidelines to API file.
DWF: Finish guidelines for serializable code
FB: Write guidelines for commits

Doc format in header?
---------------------

OD: Should we add in each file's header something like:
    __docformat__ = "restructuredtext en"