Skip to content

Commit 0dbacfe

Browse files
committed
Restructure re.sub docs, clarify aspects of repl notation
- `flags` are only relevant when `pattern` is a string (followup to #119960). - Extended "beans and spam" example to demonstrate both string & re.compile flags usage, `\1` templating, and moved it close to start. - Discuss all how-we-match parameters before what-we-do-with-matches. TODO: Is important info close enough to start? - Explain callback before backslash notation because it's shorter but also to promote it. IMHO, people fear it as a "last-resort escape hatch" while it's actually *simpler* than backslashes. - Consolidated `repl` notation from two far-away paragraphs to one place. - Starting from `\1` and `\g` which are the whole purpose of dealing with backslashes! - Briefly mention `\octal` wart, 99 limit and `\g<100>` avoiding them. - Draw attention to `\\` for getting a literal backslash. - Clarify that *most* escapes are supported but `\x\u\U\N` aren't. - Move "Unknown escapes of ASCII letters" *after* listing all the known ones. - Added a note promoting raw string notation for `repl` too.
1 parent 0a39730 commit 0dbacfe

1 file changed

Lines changed: 56 additions & 41 deletions

File tree

Doc/library/re.rst

Lines changed: 56 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1080,34 +1080,19 @@ Functions
10801080

10811081
Return the string obtained by replacing the leftmost non-overlapping occurrences
10821082
of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
1083-
*string* is returned unchanged. *repl* can be a string or a function; if it is
1084-
a string, any backslash escapes in it are processed. That is, ``\n`` is
1085-
converted to a single newline character, ``\r`` is converted to a carriage return, and
1086-
so forth. Unknown escapes of ASCII letters are reserved for future use and
1087-
treated as errors. Other unknown escapes such as ``\&`` are left alone.
1088-
Backreferences, such
1089-
as ``\6``, are replaced with the substring matched by group 6 in the pattern.
1090-
For example::
1091-
1092-
>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
1093-
... r'static PyObject*\npy_\1(void)\n{',
1094-
... 'def myfunc():')
1095-
'static PyObject*\npy_myfunc(void)\n{'
1096-
1097-
If *repl* is a function, it is called for every non-overlapping occurrence of
1098-
*pattern*. The function takes a single :class:`~re.Match` argument, and returns
1099-
the replacement string. For example::
1083+
*string* is returned unchanged.
1084+
The pattern may be a string or a :class:`~re.Pattern`.
1085+
A string pattern's behaviour may be modified by specifying a *flags* value,
1086+
which can be any of the `flags`_ variables, combined using bitwise OR
1087+
(the ``|`` operator).
11001088

1101-
>>> def dashrepl(matchobj):
1102-
... if matchobj.group(0) == '-': return ' '
1103-
... else: return '-'
1104-
...
1105-
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
1106-
'pro--gram files'
1107-
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
1108-
'Baked Beans & Spam'
1089+
>>> re.sub(r'(and)', r'*\1*', 'Contraband Andalusian Beans AND Spam',
1090+
... flags=re.IGNORECASE)
1091+
'Contrab*and* *And*alusian Beans *AND* Spam'
11091092

1110-
The pattern may be a string or a :class:`~re.Pattern`.
1093+
>>> pattern = re.compile(r'(and)', flags=re.IGNORECASE)
1094+
>>> re.sub(pattern, r'*\1*', 'Contraband Andalusian Beans AND Spam')
1095+
'Contrab*and* *And*alusian Beans *AND* Spam'
11111096

11121097
The optional argument *count* is the maximum number of pattern occurrences to be
11131098
replaced; *count* must be a non-negative integer. If omitted or zero, all
@@ -1118,21 +1103,51 @@ Functions
11181103
As a result, ``sub('x*', '-', 'abxd')`` returns ``'-a-b--d-'``
11191104
instead of ``'-a-b-d-'``.
11201105

1121-
.. index:: single: \g; in regular expressions
1122-
1123-
In string-type *repl* arguments, in addition to the character escapes and
1124-
backreferences described above,
1125-
``\g<name>`` will use the substring matched by the group named ``name``, as
1126-
defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
1127-
group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
1128-
in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
1129-
reference to group 20, not a reference to group 2 followed by the literal
1130-
character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
1131-
substring matched by the RE.
1132-
1133-
The expression's behaviour can be modified by specifying a *flags* value.
1134-
Values can be any of the `flags`_ variables, combined using bitwise OR
1135-
(the ``|`` operator).
1106+
*repl* can be a string template or a function:
1107+
1108+
* If it is callable, it is called for every non-overlapping occurrence of
1109+
*pattern*. The function takes a single :class:`~re.Match` argument, and
1110+
returns the replacement string. For example::
1111+
1112+
>>> def dashrepl(matchobj):
1113+
... if matchobj.group(0) == '-': return ' '
1114+
... else: return '-'
1115+
...
1116+
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
1117+
'pro--gram files'
1118+
1119+
* If *repl* is a string, it's processed as a template based on backslash escapes:
1120+
1121+
.. index:: single: \g; in regular expressions
1122+
1123+
- ``\1`` .. ``\99`` are replaced by the substring matched by corresponding
1124+
``(...)`` groups in the pattern.
1125+
- However other ``\numbers`` get interpretted as *octal* character literals.
1126+
- ``\g<name>`` are replaced by the substring matched by named ``(?P<name>...)``
1127+
groups.
1128+
- ``\g<number>`` is another way to refer to numbered groups.
1129+
``\g<2>0`` inserts group 2 followed by the literal character ``'0'``,
1130+
whereas ``\20`` can only express a reference to group 20. ``\g<100>`` etc.
1131+
can refer to groups higher than 99, and the backreference ``\g<0>``
1132+
substitutes in the entire substring matched by the RE.
1133+
- ``\\`` is converted to a single backslash.
1134+
- Basic escapes ``\n\r\t\v\f\a\b`` work like in Python string literals.
1135+
That is, ``\n`` is converted to a single newline character, and so forth.
1136+
- Unknown escapes of ASCII letters are reserved for future use and
1137+
treated as errors. This includes ``\x..``, ``\u...``, ``\U...`` and
1138+
``\N{...}`` which are not presently supported.
1139+
- Other unknown escapes such as ``\&`` are left alone.
1140+
1141+
For example::
1142+
1143+
>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
1144+
... r'static PyObject*\npy_\1(void)\n{',
1145+
... 'def myfunc():')
1146+
'static PyObject*\npy_myfunc(void)\n{'
1147+
1148+
(Note the use of raw string notation for *repl* as well. Otherwise you'd have
1149+
to write ``'\\1'`` for Python to parse it into ``\1`` to be replaced by
1150+
``myfunc`` at substitution time...)
11361151

11371152
.. versionchanged:: 3.1
11381153
Added the optional flags argument.

0 commit comments

Comments
 (0)