Python 2 Assumes Different Source Code Encodings

November 30, 2023 Post a Comment

I noticed that without source code encoding declaration, the Python 2 interpreter assumes the source code is encoded in ASCII with scripts and standard input: $ python test.py # w

Solution 1:

The -c and -m switches, ultimately run the code supplied with the exec statement or the compile() function, both of which take Latin-1 source code:

The first expression should evaluate to either a Unicode string, a Latin-1 encoded string, an open file object, a code object, or a tuple.

This is not documented, it's an implementation detail, that may or may not be considered a bug.

I don't think it is something that is worth fixing however, and Latin-1 is a superset of ASCII so little is lost. How code from -c and -m is handled has been cleaned up in Python 3 and is much more consistent there; code passed in with -c is decoded using the current locale, and modules loaded with the -m switch default to UTF-8, as usual.

If you want to know the exact implementations used, start at the Py_Main() function in Modules/main.c, which handles both -c and -m as:

if (command) {
    sts = PyRun_SimpleStringFlags(command, &cf) != 0;
    free(command);
} elseif (module) {
    sts = RunModule(module, 1);
    free(module);
}

-c is executed through the PyRun_SimpleStringFlags() function, which in turn calls PyRun_StringFlags(). When you use exec a bytestring object is passed to PyRun_StringFlags() too, and the source code is then assumed to contain Latin-1-encoded bytes.
-m uses the RunModule() function to pass the module name to the private function _run_module_as_main() in the runpy module, which uses pkgutil.get_loader() to load the module metadata, and fetches the module code object with the loader.get_code() function on the PEP 302 loader; if no cached bytecode is available then the code object is produced by using the compile() function with the mode set to exec.

Python Channel

Python 2 Assumes Different Source Code Encodings

Solution 1:

Post a Comment for "Python 2 Assumes Different Source Code Encodings"