view doc/v2_planning/API_coding_style.txt @ 1201:46527ae6db53

architecture: Clarified what I meant about saving the model
author Olivier Delalleau <delallea@iro>
date Mon, 20 Sep 2010 17:05:15 -0400
parents 5783948b3f8c
children d94d1675c7e6
line wrap: on
line source

=========================
 Coding Style Guidelines
=========================

Note: until the Pylearn documentation is properly compiled, you can view
the HTML version of this document `here
<http://www.iro.umontreal.ca/~delallea/tmp/coding_style/html/API_coding_style.html>`_.

Main Goals
==========

    * Code should be compatible with Python 2.4 and above (using 2to3 for
      conversion to Python 3.x). This may not be possible in the short term
      for Theano-dependent code.

    * Code should be easy to read, understand and update by developers and
      users.

    * Code should be well-documented and well-tested.

Python Coding Guidelines
========================

Official Guidelines
-------------------

Source Material
~~~~~~~~~~~~~~~

The four main documents describing our Python coding guidelines are:
    * `PEP 8 -- Style Guide for Python Code
      <http://www.python.org/dev/peps/pep-0008>`_
    * `Google Python Style Guide
      <http://google-styleguide.googlecode.com/svn/trunk/pyguide.html>`_
    * `PEP 257 -- Docstring Conventions
      <http://www.python.org/dev/peps/pep-0257>`_
    * `Numpy Docstring Standard
      <http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines#docstring-standard>`_


However, there are a few points mentioned in those documents that we decided
to do differently:

    * Use only one space (not two) after a sentence-ending period in comments.

      .. code-block:: python

        # Good.
        # This is the first sentence. It is followed by a single blank space.
        # Bad.
        # This is the first sentence.  It is followed by two blank spaces.

    * You do not need to add an extra blank line before the closing quotes of
      a multi-line docstring. Also, we ask that the first line of a multi-line
      docstring should contain only the opening quotes.

      .. code-block:: python

        # Good.
        """
        This is a multi-line docstring.

        Which means it has more than one line.
        """

        # Bad.
        """This is a multi-line docstring.

        Which means it has more than one line.

        """

    * Standard library imports can (and should) be on the same line, to avoid
      wasting space on straighforward imports:

      .. code-block:: python

        # Good.
        import os, sys, time
        # Good when it does not fit on a single line.
        import std_lib_module_1, std_lib_module_2, std_lib_module_3
        import std_lib_module_4, std_lib_module_5, std_lib_module_6
        # Bad.
        import os
        import sys
        import time

    * Importing class / functions from a module is allowed when these are
      used multiple times, and no ambiguity is possible.

      .. code-block:: python

        # Good when Bar and Blah are used many times.
        from foo import Bar, Blah
        do_something_with(Bar(), Blah(), Bar(), Blah(), Bar(), Blah())
        # Good in most situations.
        import foo
        do_something_with(foo.Bar(), foo.Blah())
        # Bad.
        from foo import *
        from numpy import any   # Potential ambiguity with __builtin__.any

Excerpts
~~~~~~~~

We emphasize here a few important topics that are found in the official
guidelines:

    * Only use ASCII characters in code files.

    * Code indent must be done with four blank characters (no tabs).

    * Limit lines to 79 characters.

    * Naming conventions: ``ClassName``, ``TOP_LEVEL_CONSTANT``,
      ``everything_else``.

    * Comments should start with a capital letter (unless the first word is a
      code identifier) and end with a period (short inline comments may skip
      the period at the end).

    * Imports should be listed in alphabetical order. It makes it easier to
      verify that something is imported, and avoids duplicated imports.

    * Use absolute imports only. This is compatible across a wider range of
      Python versions, and avoids confusion about what is being
      imported.

    * Avoid renaming imported modules. This makes code more difficult to
      re-use, and is not grep-friendly.

      .. code-block:: python

        # Good.
        from theano import tensor
        # Bad.
        from theano import tensor as T

    * Avoid using lists if all you care about is iterating on something. Using
      lists:

         - uses more memory (and possibly more CPU if the code may break out of
           the iteration),
         - can lead to ugly code when converted to Python 3 with 2to3,
         - can have a different behavior if evaluating elements in the list has
           side effects (if you want these side effects, make it explicit by
           assigning the list to some variable before iterating on it).

      +------------------------+------------------------+
      | Iterative version      |    List version        |
      +========================+========================+
      | .. code-block:: python | .. code-block:: python |
      |                        |                        |
      |   my_dict.iterkeys     |   my_dict.keys         |
      |   my_dict.itervalues   |   my_dict.values       |
      |   my_dict.iteritems    |   my_dict.items        |
      +------------------------+------------------------+
      | .. code-block:: python | .. code-block:: python |
      |                        |                        |
      |   itertools.ifilter    |   filter               |
      |   itertools.imap       |   map                  |
      |   itertools.izip       |   zip                  |
      +------------------------+------------------------+
      | .. code-block:: python | .. code-block:: python |
      |                        |                        |
      |   xrange               |   range                |
      +------------------------+------------------------+

      Code example with ``map``:

      .. code-block:: python

        # Good.
        for f_x in imap(f, x):
            ...
        all_f_x = map(f, x)
        map(f, x)   # f has some side effect.
        # Bad.
        for element in map(f, x):
            ...
        imap(f, x)

    * Generally prefer list comprehensions to ``map`` / ``filter``, as the former are
      easier to read.

      .. code-block:: python

        # Good.
        non_comments = [line.strip() for line in my_file.readlines()
                                     if not line.startswith('#')]
        # Bad.
        non_comments = map(str.strip,
                           ifilter(lambda line: not line.startswith('#'),
                                   my_file.readlines()))
 
    * Use ``in`` on container objects instead of using class-specific methods:
      it is easier to read and may allow you to re-use your code with different
      container types.

      .. code-block:: python

        # Good.
        has_key = key in my_dict
        has_substring = substring in my_string
        # Bad.
        has_key = my_dict.has_key(key)
        has_substring = my_string.find(substring) >= 0

    * Do not use mutable arguments as default values. Instead, use a helper
      function (conditional expressions are forbidden at this point, see
      below).

      .. code-block:: python

        # Good.
        def f(array=None):
            array = pylearn.if_none(array, [])
            ...
        # Bad.
        def f(array=[]): # Dangerous if `array` is modified down the road.
            ...

    * Always raise an exception with ``raise MyException(args)`` where ``MyException``
      inherits from ``Exception``. This is required for compatibility across
      all versions of Python.

      .. code-block:: python

        # Good.
        raise NotImplementedError('The Pylearn team is too lazy.')
        # Bad.
        raise NotImplementedError, 'The Pylearn team is too lazy.'
        raise 'The Pylearn team is too lazy to implement this.'

    * Use a leading underscore '_' in names of internal attributes / methods,
      but avoid the double underscore '__' unless you know what you are
      doing.


Additional Recommendations
--------------------------

Things you should do even if they are not listed in official guidelines:

    * All Python code files should start like this:

      .. code-block:: python

        """Module docstring as the first line, as usual."""

        __authors__ = "Olivier Delalleau, Frederic Bastien, David Warde-Farley"
        __copyright__ = "(c) 2010, Universite de Montreal"
        __license__ = "3-clause BSD License"
        __contact__ = "Name Of Current Guardian of this file <email@address>"

    * Use ``//`` for integer division and ``/ float(...)`` if you want the
      floating point operation (for readability and compatibility across all
      versions of Python).

      .. code-block:: python

        # Good.
        n_samples_per_split = n_samples // n_splits
        mean_x = sum(x) / float(len(x))
        # Bad.
        n_samples_per_split = n_samples / n_splits
        mean_x = sum(x) / len(x)

    * Use either ``try ... except`` or ``try ... finally``, but do not mix
      ``except`` with ``finally`` (which is not supported in Python 2.4).
      You can however embed one into the other to mimic the ``try ... except ...
      finally`` behavior.

      .. code-block:: python
    
        # Good.
        try:
            try:
                something_that_may_fail()
            except SomeError:
                do_something_if_it_failed()
        finally:
            always_do_this_regardless_of_what_happened()
        # Bad.
        try:
            something_that_may_fail()
        except SomeError:
            do_something_if_it_failed()
        finally:
            always_do_this_regardless_of_what_happened()

    * No conditional expression (not supported in Python 2.4). These are
      expressions of the form ``x = y if condition else z``.

    * Do not use the ``all`` and ``any`` builtin functions (they are not supported
      in Python 2.4). Instead, import them from ``theano.gof.python25`` (or
      use ``numpy.all`` / ``numpy.any`` for array data).

    * Do not use the ``hashlib`` module (not supported in Python 2.4). We will
      probably provide a wrapper around it to be compatible with all Python
      versions.

    * Use ``numpy.inf`` and ``numpy.nan`` rather than
      ``float('inf')`` / ``float('nan')`` (should be slightly more efficient even
      if efficiency is typically not an issue here, the main goal being code
      consistency). Also, always use ``numpy.isinf`` / ``numpy.isnan`` to
      test infinite / NaN values. This is important because ``numpy.nan !=
      float('nan')``.

    * Avoid backslashes whenever possible. They make it more
      difficult to edit code, and they are ugly (as well as potentially
      dangerous if there are trailing white spaces).

      .. code-block:: python

        # Good.
        if (cond_1 and
            cond_2 and
            cond_3):

            # Note that we added a blank line above to avoid confusion between
            # conditions and the rest of the code (this would not have been
            # needed if they were at significantly different indentation levels).
            ... 
        # Bad.
        if cond_1 and \
           cond_2 and \
           cond_3:

            ...

    * When indenting multi-line statements like lists or function arguments,
      keep elements of the same level aligned with each other.
      The position of the first
      element (on the same line or a new line) should be chosen depending on
      what is easiest to read (sometimes both can be ok).

      .. code-block:: python

        # Good.
        for my_very_long_variable_name in [my_foo, my_bar, my_love,
                                           my_everything]:
            ...
        for my_very_long_variable_name in [
                my_foo, my_bar, my_love, my_everything]:
            ...
        # Good iff the list needs to be frequently updated or is easier to
        # understand when each element is on its own line.
        for my_very_long_variable_name in [
                my_foo,
                my_bar,
                my_love,
                my_everything,
                ]:
            ...
        # Good as long as it does not require more than two lines.
        for my_very_long_variable_name in [my_foo,
                                           my_bar]:
            ...
        # Bad.
        for my_very_long_variable_name in [my_foo, my_bar, my_love,
                my_everything]:
            ...
        for my_very_long_variable_name in [my_foo,
                                           my_bar,
                                           my_love,
                                           my_everything]:
            ...

    * Use the ``key`` argument instead of ``cmp`` when sorting (for Python 3
      compatibility).

      .. code-block:: python

        # Good.
        my_list.sort(key=abs)
        # Bad.
        my_list.sort(cmp=lambda x, y: cmp(abs(x), abs(y)))

    * Whenever you read / write binary files, specify it in the mode ('rb' for
      reading, 'wb' for writing). This is important for cross-platform and
      Python 3 compatibility (e.g. when pickling / unpickling objects).

      .. code-block:: python

        # Good.
        cPickle.dump(obj, open('my_obj.pkl', 'wb', protocol=-1))
        # Bad.
        cPickle.dump(obj, open('my_obj.pkl', 'w', protocol=-1))

    * Avoid tuple parameter unpacking as it can lead to very ugly code when
      converting to Python 3.

      .. code-block:: python

        # Good.
        def f(x, y_z):
            y, z = y_z
            ...
        # Bad.
        def f(x, (y, z)):
            ...

    * Only use ``cPickle``, not ``pickle`` (except for debugging purpose since
      error messages from ``pickle`` are sometimes easier to understand).

    * A script's only top-level code should be something like:

      .. code-block:: python

        if __name__ == '__main__':
            sys.exit(main())


The ``logging`` Module vs. the ``warning`` Module
=================================================

The ``logging`` Module
----------------------

A central logging facility for Python capable of logging messages of various
categories/urgency and choosing with some granularity which messages are
displayed/suppressed, as well as where they are displayed or written. This
includes an ``INFO`` level for innocuous status information, a ``WARNING`` level
for unexpected state that is still recoverable, ``DEBUG`` for detailed
information which is only really of interest when things are going wrong, etc.

In addition to the `library documentation`_, see this helpful tutorial,
`Python Logging 101`_.

.. _library documentation: http://docs.python.org/library/logging.html
.. _Python Logging 101: http://plumberjack.blogspot.com/2009/09/python-logging-101.html

The ``warning`` Module
----------------------

The ``warning`` module in the standard library and its main interface, the
``warn()`` function, allows the programmer to issue warnings in situations where
they wish to alert the user to some condition, but the situation is not
urgent enough to throw an exception. By default, a warning issued at a given
line of the code will only be displayed the first time that line is executed.
By default, warnings are written to ``sys.stderr`` but the ``warning`` module
contains flexible facilities for altering the defaults, redirecting, etc.

Which? When?
------------

It is our feeling that the ``logging`` module's ``WARNING`` level be used to log
warnings more meant for *internal*, *developer* consumption, to log situations
where something unexpected happened that may be indicative of a problem but
is several layers of abstraction below what a user of the library would
care about.

By contrast, the warning module should be used for warnings intended for user
consumption, e.g. alerting them that their version of Pylearn is older than
this plugin requires, so things may not work as expected, or that a given
function/class/method is slated for deprecation in a coming release (early
in the library's lifetime, ``DeprecationWarning`` will likely be the most common
case). The warning message issued through this facility should avoid
referring to Pylearn internals.

Code Sample
===========

The following code sample illustrates some of the coding guidelines one should
follow in Pylearn. This is still a work-in-progress.

.. code-block:: python

    #! /usr/env/bin python

    """Sample code. There may still be mistakes / missing elements."""

    __authors__ = "Olivier Delalleau"
    __copyright__ = "(c) 2010, Universite de Montreal"
    __license__ = "3-clause BSD License"
    __contact__ = "Olivier Delalleau <delallea@iro>"

    # Standard library imports are on a single line.
    import os, sys, time

    # Third-party imports come after standard library imports, and there is
    # only one import per line. Imports are sorted lexicographically.
    import numpy
    import scipy
    import theano
    # Put 'from' imports below.
    from numpy import argmax
    from theano import tensor
    
    # Application-specific imports come last.
    from pylearn import dataset
    from pylearn.optimization import minimize

    def print_files_in(directory):
        """Print the first line of each file in given directory."""
        # TODO To be continued...

    def main():
        if len(sys.argv) != 2:
            # Note: conventions on how to display script documentation and
            # parse arguments are still to-be-determined.
            print("""\
    Usage: %s <directory>
    Print first line of each file in given directory (in alphabetic order)."""
                  % os.path.basename(sys.argv[0]))
            return 1
        print_files_in(sys.argv[1])
        return 0

    # Top-level executable code should be minimal.
    if __name__ == '__main__':
        sys.exit(main())
    

Automatic Code Verification
===========================

Tools will be available to make it easier to automatically ensure that code
committed to Pylearn complies to above specifications. This work is not
finalized yet, but David started a `Wiki page`_ with helpful configuration
tips for Vim.

.. _Wiki page: http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Divers/VimPythonRecommendations

TODO
====

Things still missing from this document, being discussed in coding_style.txt:
    - Proper style for C code and Mercurial commits
    - Enforcing 100% test coverage of the code base
    - Providing ways to add type checking for function arguments
    - Conventions for script usage documentation and argument parsing
    - Conventions for class / method / function documentation
    - Guidelines for serialization-friendly code (hint: nested and lambda
      functions, as well as instance methods, cannot be serialized, and
      apparently there are some issues with decorators -- to be investigated).