This is a flexible class for comparing pairs of sequences of any
type, so long as the sequence elements are hashable. The basic
algorithm predates, and is a little fancier than, an algorithm
published in the late 1980's by Ratcliff and Obershelp under the
hyperbolic name ``gestalt pattern matching.'' The idea is to find
the longest contiguous matching subsequence that contains no
``junk'' elements (the Ratcliff and Obershelp algorithm doesn't
address junk). The same idea is then applied recursively to the
pieces of the sequences to the left and to the right of the matching
subsequence. This does not yield minimal edit sequences, but does
tend to yield matches that ``look right'' to people.
Timing: The basic Ratcliff-Obershelp algorithm is cubic
time in the worst case and quadratic time in the expected case.
SequenceMatcher is quadratic time for the worst case and has
expected-case behavior dependent in a complicated way on how many
elements the sequences have in common; best case time is linear.
classDiffer
This is a class for comparing sequences of lines of text, and
producing human-readable differences or deltas. Differ uses
SequenceMatcher both to compare sequences of lines, and to
compare sequences of characters within similar (near-matching)
lines.
Each line of a Differ delta begins with a two-letter code:
Code
Meaning
'- '
line unique to sequence 1
'+ '
line unique to sequence 2
' '
line common to both sequences
'? '
line not present in either input sequence
Lines beginning with `? ' attempt to guide the eye to
intraline differences, and were not present in either input
sequence. These lines can be confusing if the sequences contain tab
characters.
context_diff(
a, b[, fromfile[, tofile
[, fromfiledate[, tofiledate[, n
[, lineterm]]]]]])
Compare a and b (lists of strings); return a
delta (a generator generating the delta lines) in context diff
format.
Context diffs are a compact way of showing just the lines that have
changed plus a few lines of context. The changes are shown in a
before/after style. The number of context lines is set by n
which defaults to three.
By default, the diff control lines (those with *** or ---)
are created with a trailing newline. This is helpful so that inputs created
from file.readlines() result in diffs that are suitable for use
with file.writelines() since both the inputs and outputs have
trailing newlines.
For inputs that do not have trailing newlines, set the lineterm
argument to "" so that the output will be uniformly newline free.
The context diff format normally has a header for filenames and
modification times. Any or all of these may be specified using strings for
fromfile, tofile, fromfiledate, and tofiledate.
The modification times are normally expressed in the format returned by
time.ctime(). If not specified, the strings default to blanks.
Tools/scripts/diff.py is a command-line front-end for this
function.
New in version 2.3.
get_close_matches(
word, possibilities[,
n[, cutoff]])
Return a list of the best ``good enough'' matches. word is a
sequence for which close matches are desired (typically a string),
and possibilities is a list of sequences against which to
match word (typically a list of strings).
Optional argument n (default 3) is the maximum number
of close matches to return; n must be greater than 0.
Optional argument cutoff (default 0.6) is a float in
the range [0, 1]. Possibilities that don't score at least that
similar to word are ignored.
The best (no more than n) matches among the possibilities are
returned in a list, sorted by similarity score, most similar first.
Compare a and b (lists of strings); return a
Differ-style delta (a generator generating the delta lines).
Optional keyword parameters linejunk and charjunk are
for filter functions (or None):
linejunk: A function that accepts a single string
argument, and returns true if the string is junk, or false if not.
The default is (None), starting with Python 2.3. Before then,
the default was the module-level function
IS_LINE_JUNK(), which filters out lines without visible
characters, except for at most one pound character ("#").
As of Python 2.3, the underlying SequenceMatcher class
does a dynamic analysis of which lines are so frequent as to
constitute noise, and this usually works better than the pre-2.3
default.
charjunk: A function that accepts a character (a string of
length 1), and returns if the character is junk, or false if not.
The default is module-level function IS_CHARACTER_JUNK(),
which filters out whitespace characters (a blank or tab; note: bad
idea to include newline in this!).
Tools/scripts/ndiff.py is a command-line front-end to this
function.
>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
... 'ore\ntree\nemu\n'.splitlines(1))
>>> print ''.join(diff),
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu
restore(
sequence, which)
Return one of the two sequences that generated a delta.
Given a sequence produced by Differ.compare() or
ndiff(), extract lines originating from file 1 or 2
(parameter which), stripping off line prefixes.
Example:
>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
... 'ore\ntree\nemu\n'.splitlines(1))
>>> diff = list(diff) # materialize the generated delta into a list
>>> print ''.join(restore(diff, 1)),
one
two
three
>>> print ''.join(restore(diff, 2)),
ore
tree
emu
unified_diff(
a, b[, fromfile[, tofile
[, fromfiledate[, tofiledate[, n
[, lineterm]]]]]])
Compare a and b (lists of strings); return a
delta (a generator generating the delta lines) in unified diff
format.
Unified diffs are a compact way of showing just the lines that have
changed plus a few lines of context. The changes are shown in a
inline style (instead of separate before/after blocks). The number
of context lines is set by n which defaults to three.
By default, the diff control lines (those with ---, +++,
or @@) are created with a trailing newline. This is helpful so
that inputs created from file.readlines() result in diffs
that are suitable for use with file.writelines() since both
the inputs and outputs have trailing newlines.
For inputs that do not have trailing newlines, set the lineterm
argument to "" so that the output will be uniformly newline free.
The context diff format normally has a header for filenames and
modification times. Any or all of these may be specified using strings for
fromfile, tofile, fromfiledate, and tofiledate.
The modification times are normally expressed in the format returned by
time.ctime(). If not specified, the strings default to blanks.
Tools/scripts/diff.py is a command-line front-end for this
function.
New in version 2.3.
IS_LINE_JUNK(
line)
Return true for ignorable lines. The line line is ignorable
if line is blank or contains a single "#",
otherwise it is not ignorable. Used as a default for parameter
linejunk in ndiff() before Python 2.3.
IS_CHARACTER_JUNK(
ch)
Return true for ignorable characters. The character ch is
ignorable if ch is a space or tab, otherwise it is not
ignorable. Used as a default for parameter charjunk in
ndiff().