antropy.lziv_complexity#

antropy.lziv_complexity(sequence, normalize=False)[source]#

Lempel-Ziv (LZ) complexity of a sequence.

Added in version 0.1.1.

Parameters:

sequencestr or array: A sequence of characters, e.g. '1001111011000010', [0, 1, 0, 1, 1], or 'Hello World!'.
normalizebool: If True, returns the normalized LZ (see Notes).

Returns:

lzint or float: LZ complexity, which corresponds to the number of different substrings encountered as the stream is viewed from the beginning to the end. If normalize=False, the output is an integer (counts), otherwise the output is a float.

Notes

LZ complexity is defined as the number of different substrings encountered as the sequence is viewed from beginning to the end.

Although the raw LZ is an important complexity indicator, it is heavily influenced by sequence length (longer sequence will result in higher LZ). Zhang and colleagues (2009) have therefore proposed the normalized LZ, which is defined by

\[\text{LZn} = \frac{\text{LZ}}{(n / \log_b{n})}\]

where \(n\) is the length of the sequence and \(b\) the number of unique characters in the sequence.

Warning

Float and integer arrays are cast to uint32 before processing (values are truncated, not discretized into bins). For continuous-valued signals, binarize the sequence first, e.g.:

(x >= np.median(x)).astype(int)

References

Lempel, A., & Ziv, J. (1976). On the Complexity of Finite Sequences. IEEE Transactions on Information Theory / Professional Technical Group on Information Theory, 22(1), 75–81. https://doi.org/10.1109/TIT.1976.1055501
Zhang, Y., Hao, J., Zhou, C., & Chang, K. (2009). Normalized Lempel-Ziv complexity and its application in bio-sequence analysis. Journal of Mathematical Chemistry, 46(4), 1203–1212. https://doi.org/10.1007/s10910-008-9512-2
https://en.wikipedia.org/wiki/Lempel-Ziv_complexity
Naereen/Lempel-Ziv_Complexity

Examples

>>> from antropy import lziv_complexity
>>> # Substrings = 1 / 0 / 01 / 1110 / 1100 / 0010
>>> s = "1001111011000010"
>>> lziv_complexity(s)
6

Using a list of integer / boolean instead of a string

>>> # 1 / 0 / 10
>>> lziv_complexity([1, 0, 1, 0, 1, 0, 1, 0, 1, 0])
3

With normalization

>>> lziv_complexity(s, normalize=True)
1.5

This function also works with characters and words

>>> s = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
>>> lziv_complexity(s), lziv_complexity(s, normalize=True)
(26, 1.0)

>>> s = "HELLO WORLD! HELLO WORLD! HELLO WORLD! HELLO WORLD!"
>>> lziv_complexity(s), lziv_complexity(s, normalize=True)
(11, 0.38596001132145313)