Mercurial > pylearn
view doc/v2_planning/coding_style.txt @ 1139:9f0502f8c7a5
Example of the plugin architecture I had in mind
author | gdesjardins |
---|---|
date | Thu, 16 Sep 2010 13:27:17 -0400 |
parents | 9583e908c572 |
children | 7d2e65249bf9 |
line wrap: on
line source
Discussion of Coding-Style ========================== Participants ------------ - Dumitru - Fred - David - Olivier D [leader] Open for public debate ---------------------- * Use imports for packages and modules only. I.e. avoid from foo import * from foo import Bar OD: Overall I agree with this. However we probably want to allow some exceptions, like: from itertools import imap, izip Also, some people may want to have shortcuts like from theano import tensor as T but I would prefer to forbid this. It is handy when trying stuff in the interactive interpreter, but in real code it can easily get messy when you want to copy / paste different pieces of code and they use different conventions. Typing tensor.* is a bit longer, but a lot more portable. JB: I thought that these are nice: - "from foo import Bar" - "from foo import Bar, Blah" What's wrong with them? They keep the code listing short and readable. I would discourage these forms when symbols 'Bar' and 'Blah' are ambiguous, in which case the parent module prefix serves to disambiguate them in the code. I agree that the "import A as B" form should be discouraged in general, because that's just confusing and makes code less grep-friendly. OD: I agree that "from foo import Bar, Blah" is sometimes convenient (typically when you re-use Bar / Blah many times in the same file), and would vote in favor of accepting it when it is appropriate. This guideline was taken from Google's coding recommendation: "from foo import * or from foo import Bar is very nasty and can lead to serious maintenance issues because it makes it hard to find module dependencies." * Imports should usually be on separate lines. OD: I would add an exception, saying it is ok to group multiple imports from the standard library on a single line, e.g. import os, sys, time I just don't see much benefit in putting them on separate lines (for third-party imports I agree it is best to keep them separate, as it makes dependencies clearer, and diffs look better when someone adds / removes an import). Does anyone see a good reason to keep standard library imports on different lines? JB: what does 'usually' mean here? The guideline seems vacuous. OD: Sorry my fault, I did not quote the whole guideline from PEP8. The 'usually' was because of what followed: it's okay to say this though: from subprocess import Popen, PIPE (which btw contradicts Google's recommendation mentioned previously) * The BDFL recommends inserting a blank line between the last paragraph in a multi-line docstring and its closing quotes, placing the closing quotes on a line by themselves. This way, Emacs' fill-paragraph command can be used on it. OD: I think it is ugly and I have not seen it used much. Any Emacs user believes it is a must? * Avoid contractions in code comments (particularly in documentation): "We do not add blue to red because it does not look good" rather than "We don't add blue to red because it doesn't look good". OD: I mostly find it to be cleaner (been used to it while writing scientific articles too). JB: +1 * Imperative vs. third-person comments. # Return the sum of elements in x. <-- imperative # Returns the sum of elements in x. <-- third-person OD: I am used to the imperative form and like it better only because it typically saves one letter (the 's') and is easier to conjugate. JB: What about being compatible with markup formats that have a :returns: tag? OD: That'd make sense. However, when I wrote the above I hadn't looked closely at PEP257 yet, and I just noticed the following official recommendation for one-line docstrings in it: The docstring is a phrase ending in a period. It prescribes the function or method's effect as a command ("Do this", "Return that"), not as a description; e.g. don't write "Returns the pathname ...". Anyone knows which style is most popular in the open-source community? * OD: I like always doing the following when subclassing a class A: class B(A): def __init__(self, b_arg_1, b_arg_2, **kw): super(B, self).__init__(**kw) ... The point here is that the constructor always allow for extra keyword arguments (except for the class at the very top of the hierarchy), which are automatically passed to the parent class. Pros: - You do not need to repeat the parent class arguments whenever you write a new subclass. - Whenever you add an argument to the parent class, all child classes can benefit from it without modifying their code. Cons: - One needs to look at the parent classes to see what these arguments are. - You cannot use a **kw argument in your constructor for your own selfish purpose. - I have no clue whether one could do this with multiple inheritance. - More? Question: Should we encourage this in Pylearn? JB: +0.5 * JB: How should we combine capitalization and underscores to name classes and functions related to an algorithm like 'SGD' or a model like 'RBM' whose common name is capitalized? Case in point: How should I name a Hybrid Monte Carlo Sampler? Should I use the common HMC abbreviation? OD: This one is answered by PEP8 (search HTTPServerError in it). You should use: RBMClassName rbm_function_name As far as using abbreviations is concerned: All identifiers in the Python standard library (...) SHOULD use English words wherever feasible (in many cases, abbreviations and technical terms are used which aren't English). so I guess HMC is ok when using Hybrid Monte Carlo is considered to make some names too long. Note about warnings ------------------- Fred: This is a refactored thing from James email of what we should put in message that we send to the user: 1) Hint where in the code this log come from. 2) Hint how to hide this message? or we should this into documentation. 3) Tell explicitly if the user can ignore it and the consequence. Existing Python coding style specifications and guidelines ---------------------------------------------------------- * Must-read * Official Python coding style guide: http://www.python.org/dev/peps/pep-0008 * Official docstring conventions: http://www.python.org/dev/peps/pep-0257 * Google Python Style Guide: http://google-styleguide.googlecode.com/svn/trunk/pyguide.html * Interesting * Code Like a Pythonista: http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html * Numpy notes on conversion to Python 3: http://projects.scipy.org/numpy/browser/trunk/doc/Py3K.txt * Can skip * Python style for university class: http://www.cs.caltech.edu/courses/cs11/material/python/misc/python_style_guide.html * Mailman coding style: http://barry.warsaw.us/software/STYLEGUIDE.txt * Some company coding style: http://self.maluke.com/style * Chandler coding style: http://chandlerproject.org/Projects/ChandlerCodingStyleGuidelines * Outdated recommendations: http://lists.osafoundation.org/pipermail/dev/2003-March/000479.html * Mostly some beginners tips: http://learnpython.pbworks.com/PythonTricks * More beginners tips: http://eikke.com/how-not-to-write-python-code/ * Cogent coding guidelines: http://jaynes.colorado.edu/PythonGuidelines.html * Djangoo coding guidelines: http://docs.djangoproject.com/en/dev/internals/contributing/#coding-style * Numpy documentation style guidelines: http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines * Some random guy guidelines (nothing special): http://www.voidspace.org.uk/python/articles/python_style_guide.shtml We will probably want to take PEP-8 as starting point, and read what other people think about it / how other coding guidelines differ from it. Dumi: we should also try to find tools that automate these processes: pylint, pyflakes, pychecker, pythontidy OD: Things about PEP 8 I don't like (but it may be just me): * If necessary, you can add an extra pair of parentheses around an expression, but sometimes using a backslash looks better. --> I rarely find that backslash looks better. In most situations you can get rid of them. Typically I prefer: if (cond_1 and cond_2 and cond_3): to if cond_1 and \ cond_2 and \ cond_3: * You should use two spaces after a sentence-ending period. --> Looks weird to me. (DWF: This is an old convention from the typewriter era. It has more or less been wiped out by HTML's convention of ignoring extra whitespace: see http://en.wikipedia.org/wiki/Sentence_spacing for more detail. I think it's okay to drop this convention in source code.) OD: Cool, thanks, I guess we can drop it then. * Missing in PEP 8: - How to indent multi-line statements? E.g. do we want x = my_func(a, b, c, d, e, f) or x = my_func(a, b, c, d, e, f) or x = my_func( a, b, c, d, e, f) --> Probably depends on the specific situation, but we could have a few typical examples (and the same happens with multi-lines lists) (Fred: I would do 2 or 3, but not 1. I find it more redable when the indent is broken after a paranthesis then at any point. OD: After thinking about it, I agreee as well. My recommendation would be to go with 2 when it can fit on two lines, and 3 otherwise. Same with lists. Documentation ------------- How do we write doc? Compatibility with various Python versions ------------------------------------------ * Which Python 2.x version do we want to support? * Is it reasonable to have coding guidelines that would make the code as compatible as possible with Python 3? C coding style -------------- We also need a c-style coding style. Meeting 2010/09/09 ------------------ * Coding guidelines PEP 8 & Google should be a good basis to start with. Task: Highlight the most important points in them (OD). * Documentation Use RST with Sphinx. Task: Provide specific examples on how to document a class, method, and some specific classes like Op (DE). Modify the theano documentation to include that. OD: May want to check out http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines * Python versions to be supported Support 2.4 (because some of the clusters are still running 2.4) and write code that can be converted to 3.x with 2to3 in a straightforward way. Task: Write to-do's and to-not-do's to avoid compatibility issues. (OD) * C coding style How to write C code (in particular for Numpy / Cuda), and how to mix C and Python. Task: See if there would be a sensible C code style to follow (maybe look how Numpy does it), and how projects that mix C and Python deal with it (e.g. use separate files, or be able to have mixed syntax highlighting?) (FB) * Program output Use the warning and logging modules. Avoid print as much as possible. Task: Look into these modules to define general guidelines e.g. to decide when to use warning instead of logging. (DWF) * Automatized code verification Use pychecker & friends to make sure everything is fine. Task: Look into the various options available (DE) * Tests Force people to write tests. Automatic email reminder of code lines not covered by tests (see if we can get this from nosetests). Decorator to mark some classes / methods as not being tested yet, so as to be able to automatically warn the user when he is using untested stuff (and to remind ourselves we should add a test). Task: See feasibility. (OD) Result: See section 'Enforcing strict testing policy'. * VIM / Emacs plugins / config files To enforce good coding style automatically. Task: Look for existing options. (FB) (DWF: I have put some time into this for vim, I will send around my files) Suggestion by PV ---------------- Have a sample code that showcases everything one should comply to. Some coding guidelines (work-in-progress from OD) ------------------------------------------------- * Avoid using lists if all you care about is iterating on something. Using lists: - uses more memory (and possibly more CPU if the code may break out of the iteration) - can lead to ugly code when converted to Python 3 with 2to3 - can have a different behavior if evaluating elements in the list has side effects (if you want these side effects, make it explicit by assigning the list to some variable before iterating on it) Iterative version List version my_dict.iterkeys() my_dict.keys() my_dict.itervalues() my_dict.values() my_dict.iteritems() my_dict.items() itertools.imap map itertools.ifilter filter itertools.izip zip xrange range * Use `in` on container objects instead of using class-specific methods. It is easier to read and may allow you to use your code with different container types. Yes No --- -- key in my_dict my_dict.has_key(key) sub_string in my_string my_string.find(sub_string) >= 0 * Generally prefer list comprehensions to map / filter, as the former are easier to read. Yes: non_comments = [line.strip() for line in my_file.readlines() if not line.startswith('#')] No: non_comments = map(str.strip, filter(lambda line: not line.startswith('#'), my_file.readlines())) * Use the `key` argument instead of `cmp` when sorting (for Python 3 compatibility). Yes: my_list.sort(key=abs) No: my_list.sort(cmp=lambda x, y: cmp(abs(x), abs(y))) * Use // for integer division (for readability and Python 3 compatibility). Yes: n_samples_per_split = n_samples // n_splits No: n_samples_per_split = n_samples / n_splits * Only use ASCII characters in code files. * Code indent must be done with four blank characters (not with tabs). * Limit lines to 79 characters. * Comments should start with a capital letter (unless the first word is a code identifier) and end with a period (very short inline comments may ignore this rule). * Whenever you read / write binary files, specify it in the mode ('rb' for reading, 'wb' for writing). This is important for cross-platform and Python 3 compatibility (e.g. when pickling / unpickling objects). * Avoid tuple parameter unpacking to avoid very ugly code when converting to Python 3. Yes: def f(x, y_z): y, z = y_z No: def f(x, (y, z)) * Only use cPickle, not pickle. * Always raise exception with raise MyException(args) where MyException inherits from Exception. * Imports should be listed in alphabetical order. It makes it easier to verify that something is imported, and avoids duplicated imports. * Use absolute imports only. This is compatible across a wider range of Python versions, and avoids confusion about what is being imported. * Use a leading underscore '_' for internal attributes / methods, but avoid the double underscore '__' unless you know what you are doing. * A script's only top-level code should be something like: if __name__ == '__main__': sys.exit(main()) * No conditional expression (not supported in Python 2.4). These are expressions of the form x = y if condition else z * Use either "try ... except" or "try ... finally", but do not mix "except" with "finally" (which is not supported in Python 2.4). You can make a try... except inside a try... finally if you need both. * Do not use the `all` and `any` builtin functions (they are not supported in Python 2.4). You can use numpy.{all,any} instead of import theano.gof.python25 that define all and any. OD: I think we should have something like pylearn.compat.{all,any}. numpy.{all,any} are meant to be used on arrays only. * Do not use the `hashlib` module (not supported in Python 2.4). You can do as in theano.gof.cc: ..code:: if sys.version_info[:2] >= (2,5): import hashlib def hash_from_code(msg): return hashlib.md5(msg).hexdigest() else: import md5 def hash_from_code(msg): return md5.new(msg).hexdigest() OD: Yep, we could probably come up with such a wrapper in a pylearn.compat module. Mercurial commits ----------------- * How to write good commit messages? OD: Check Django's guidelines (link above) * Standardize the merge commit text (what is the message from fetch?) Type checking ------------- (Suggested by Francois Savard) vu que vous êtes en train de vous occuper de l'aspect coding style, je mentionne ceci, à faire ce que vous en voulez: j'aime bien éviter des erreurs sur l'ordre de mes paramètres, sur les assumptions sur les paramètres etc. en faisant des argument check. Ça remplace un peu le static type checking des langages genre Java. En Python y'a une façon élégante de définir ses propres typecheckers, value checkers etc. et ensuite les passer en paramètre à un décorateur de fonction: http://code.activestate.com/recipes/454322-type-checking-decorator/ (Juste un exemple, vu que les checks peuvent être plus élaborés, inclure des value checks (>0 etc.), être flexibles pour ne pas demander que ce soit un type fixe mais plutôt que ça réponde à certaines contraintes (que ça "ressemble" à un float, p. ex.). J'avais développé une lib pour faire qqch du genre en Javascript). Je ne sais pas si vous comptiez parler de ça, et si ça vaut la peine, mais personnellement je préfère du code à des commentaires qui peuvent être out of sync avec le contenu d'une méthode. Si vous croyez que ça vaut la peine, vous pourriez p-e définir des type/value-checkers standards pour éviter que tout le monde redéfinissent les siens à sa façon. Consistent inf / nan -------------------- OD: Use numpy.inf and numpy.nan rather than float('inf') / float('nan')? (should be slightly more efficient even if efficiency usually doesn't matter here - the main goal would be for everyone to use the same inf / nan to make the code consistent). Enforcing strict testing policy ------------------------------- The `coverage` third-party module provides a way to gather code coverage statistics in the test suite. `nosetests` has a plugin that can be activated with the --with-coverage option to use this module. It is possible to know which lines specifically lack coverage. However, we will probably want to post-process this data to do more than a simple report (which noone will care about). This could be done either by parsing nosetests' coverage output, or modifying its coverage plugin, or writing our own version of it. The main goal would be to identify who is responsible for writing lines that are not currently covered (using 'hg annotate'), in order to send email notifications. We should aim at 100% code coverage in tests. This is realistic because `coverage` offers ways to ignore coverage for lines we explicitely do not want to cover (typically debug code, or AssertionError / NotImplementedError that are not supposed to be triggered during normal usage). We may need to do some advanced processing though to e.g. collect results from multiple build bots, if for instance some bot is running tests without GPU support, and another one is taking care of the GPU tests. Code that should be tested but for which no test is currently written would also require some decorator / helper function that would trigger a warning at run-time (only once / execution). This could be enforced by adopting a different policy about lack-of-coverage notification emails, depending on whether or not the warning is present: - if there is no warning, daily email notification (ADD A WARNING!!!) - if there is a warning, weekly email notification (ADD A TEST!!!)