antropy.lziv_complexity#

antropy.lziv_complexity(sequence, normalize=False)[source]#

Lempel-Ziv (LZ) complexity of a sequence.

Added in version 0.1.1.

Parameters:
sequencestr or array

A sequence of characters, e.g. '1001111011000010', [0, 1, 0, 1, 1], or 'Hello World!'.

normalizebool

If True, returns the normalized LZ (see Notes).

Returns:
lzint or float

LZ complexity, which corresponds to the number of different substrings encountered as the stream is viewed from the beginning to the end. If normalize=False, the output is an integer (counts), otherwise the output is a float.

Notes

LZ complexity is defined as the number of different substrings encountered as the sequence is viewed from beginning to the end.

Although the raw LZ is an important complexity indicator, it is heavily influenced by sequence length (longer sequence will result in higher LZ). Zhang and colleagues (2009) have therefore proposed the normalized LZ, which is defined by

\[\text{LZn} = \frac{\text{LZ}}{(n / \log_b{n})}\]

where \(n\) is the length of the sequence and \(b\) the number of unique characters in the sequence.

Warning

Float and integer arrays are cast to uint32 before processing (values are truncated, not discretized into bins). For continuous-valued signals, binarize the sequence first, e.g.:

(x >= np.median(x)).astype(int)

References

Examples

>>> from antropy import lziv_complexity
>>> # Substrings = 1 / 0 / 01 / 1110 / 1100 / 0010
>>> s = "1001111011000010"
>>> lziv_complexity(s)
6

Using a list of integer / boolean instead of a string

>>> # 1 / 0 / 10
>>> lziv_complexity([1, 0, 1, 0, 1, 0, 1, 0, 1, 0])
3

With normalization

>>> lziv_complexity(s, normalize=True)
1.5

This function also works with characters and words

>>> s = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
>>> lziv_complexity(s), lziv_complexity(s, normalize=True)
(26, 1.0)
>>> s = "HELLO WORLD! HELLO WORLD! HELLO WORLD! HELLO WORLD!"
>>> lziv_complexity(s), lziv_complexity(s, normalize=True)
(11, 0.38596001132145313)