
class yasa.EpochByEpochAgreement(ref_hyps, obs_hyps)[source]

Evaluate agreement between two hypnograms or two collections of hypnograms.

Evaluation includes averaged agreement scores, one-vs-rest agreement scores, agreement scores summarized across all sleep and summarized by sleep stage, and various plotting options to visualize the two hypnograms simultaneously. See examples for more detail.

New in version 0.7.0.

ref_hypsiterable of yasa.Hypnogram

A collection of reference hypnograms (i.e., those considered ground-truth).

Each yasa.Hypnogram in ref_hyps must have the same scorer.

If a dict, key values are use to generate unique sleep session IDs. If any other iterable (e.g., list or tuple), then unique sleep session IDs are automatically generated.

obs_hypsiterable of yasa.Hypnogram

A collection of observed hypnograms (i.e., those to be evaluated).

Each yasa.Hypnogram in obs_hyps must have the same scorer, and this scorer must be different than the scorer of hypnograms in ref_hyps.

If a dict, key values must match those of ref_hyps.

.. important::

It is assumed that the order of hypnograms are the same in ref_hyps and obs_hyps. For example, the third hypnogram in ref_hyps and obs_hyps must come from the same sleep session, and they must only differ in that they have different scorers.

.. seealso:: For comparing just two hypnograms, use :py:meth:`yasa.Hynogram.evaluate`.


Many steps here are influenced by guidelines proposed in Menghini et al., 2021 [Menghini2021]. See https://sri-human-sleep.github.io/sleep-trackers-performance/AnalyticalPipeline_v1.0.0.html



Menghini, L., Cellini, N., Goldstone, A., Baker, F. C., & de Zambotti, M. (2021). A standardized framework for testing the performance of sleep-tracking technology: step-by-step guidelines and open-source code. SLEEP, 44(2), zsaa170. https://doi.org/10.1093/sleep/zsaa170


>>> import yasa
>>> ref_hyps = [yasa.simulate_hypnogram(tib=600, scorer="Human", seed=i) for i in range(10)]
>>> obs_hyps = [h.simulate_similar(scorer="YASA", seed=i) for i, h in enumerate(ref_hyps)]
>>> ebe = yasa.EpochByEpochAgreement(ref_hyps, obs_hyps)
>>> agr = ebe.get_agreement()
>>> agr.head(5).round(2)
          accuracy  balanced_acc  kappa   mcc  precision  recall     f1
1             0.31          0.26   0.07  0.07       0.31    0.31   0.31
2             0.33          0.33   0.14  0.14       0.35    0.33   0.34
3             0.35          0.24   0.06  0.06       0.35    0.35   0.35
4             0.22          0.21   0.01  0.01       0.21    0.22   0.21
5             0.21          0.17  -0.06 -0.06       0.20    0.21   0.21
>>> ebe.get_agreement_bystage().head(12).round(3)
                fbeta  precision  recall  support
stage sleep_id
WAKE  1         0.391      0.371   0.413    189.0
      2         0.299      0.276   0.326    184.0
      3         0.234      0.204   0.275    255.0
      4         0.268      0.285   0.252    321.0
      5         0.228      0.230   0.227    181.0
      6         0.407      0.384   0.433    284.0
      7         0.362      0.296   0.467    287.0
      8         0.298      0.519   0.209    263.0
      9         0.210      0.191   0.233    313.0
      10        0.369      0.420   0.329    362.0
N1    1         0.185      0.185   0.185    124.0
      2         0.121      0.131   0.112    160.0
>>> ebe.get_confusion_matrix(sleep_id=1)
YASA   WAKE  N1   N2  N3  REM
WAKE     78  24   50   3   34
N1       23  23   43  15   20
N2       60  58  183  43  139
N3       30  10   50   5   32
REM      19   9  121  50   78
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(figsize=(6, 3), constrained_layout=True)
>>> ebe.plot_hypnograms(sleep_id=10)
>>> fig, ax = plt.subplots(figsize=(6, 3))
>>> ebe.plot_hypnograms(
>>>     sleep_id=8, ax=ax, obs_kwargs={"color": "red", "lw": 2, "ls": "dotted"}
>>> )
>>> plt.tight_layout()
>>> session = 8
>>> fig, ax = plt.subplots(figsize=(6.5, 2.5), constrained_layout=True)
>>> style_a = dict(alpha=1, lw=2.5, ls="solid", color="gainsboro", label="Michel")
>>> style_b = dict(alpha=1, lw=2.5, ls="solid", color="cornflowerblue", label="Jouvet")
>>> legend_style = dict(
>>>     title="Scorer", frameon=False, ncol=2, loc="lower center", bbox_to_anchor=(0.5, 0.9)
>>> )
>>> ax = ebe.plot_hypnograms(
>>>     sleep_id=session, ref_kwargs=style_a, obs_kwargs=style_b, legend=legend_style, ax=ax
>>> )
>>> acc = ebe.get_agreement().multiply(100).at[session, "accuracy"]
>>> ax.text(
>>>     0.01, 1, f"Accuracy = {acc:.0f}%", ha="left", va="bottom", transform=ax.transAxes
>>> )

When comparing only 2 hypnograms, use the evaluate() method:

>>> hypno_a = yasa.simulate_hypnogram(tib=90, scorer="RaterA", seed=8)
>>> hypno_b = hypno_a.simulate_similar(scorer="RaterB", seed=9)
>>> ebe = hypno_a.evaluate(hypno_b)
>>> ebe.get_confusion_matrix()
RaterB  WAKE  N1  N2  N3
WAKE      71   2  20   8
N1         1   0   9   0
N2        12   4  25   0
N3        24   0   1   3
__init__(ref_hyps, obs_hyps)[source]


__init__(ref_hyps, obs_hyps)

get_agreement([sample_weight, scorers])

Return a pandas.DataFrame of weighted (i.e., averaged) agreement scores.


Return a pandas.DataFrame of unweighted (i.e., one-vs-rest) agreement scores.

get_confusion_matrix([sleep_id, agg_func])

Return a ref_hyp/``obs_hyp``confusion matrix from either a single session or all sessions concatenated together.


Return a pandas.DataFrame of sleep statistics for each hypnogram derived from both reference and observed scorers.

multi_scorer(df, scorers)

Compute multiple agreement scores from a 2-column dataframe (an optional 3rd column may contain sample weights).

plot_hypnograms([sleep_id, legend, ax, ...])

Plot the two hypnograms of one session overlapping on the same axis.


Return group-level agreement scores.



A pandas.DataFrame including all hypnograms.


The number of unique sleep sessions.


The name of the observed scorer.


The name of the reference scorer.

get_agreement(sample_weight=None, scorers=None)[source]

Return a pandas.DataFrame of weighted (i.e., averaged) agreement scores.


A EpochByEvaluation instance.

sample_weightNone or pandas.Series

Sample weights passed to underlying sklearn.metrics functions where possible. If a pandas.Series, the index must match exactly that of data.

scorersNone, list, or dictionary

The scorers to be used for evaluating agreement. If None (default), default scorers are used. If a list, the list must contain strings that represent metrics from the sklearn metrics module (e.g., accuracy, precision). If more customization is desired, a dictionary can be passed with scorer names (str) as keys and custom functions as values. The custom functions should take 3 positional arguments (true values, predicted values, and sample weights).


A DataFrame with agreement metrics as columns and sessions as rows.


Return a pandas.DataFrame of unweighted (i.e., one-vs-rest) agreement scores.


A EpochByEvaluation instance.


See sklearn.metrics.precision_recall_fscore_support().


A DataFrame with agreement metrics as columns and a MultiIndex with session and sleep stage as rows.

get_confusion_matrix(sleep_id=None, agg_func=None, **kwargs)[source]

Return a ref_hyp/``obs_hyp``confusion matrix from either a single session or all sessions concatenated together.


A yasa.EpochByEpochAgreement instance.

sleep_idNone or a valid sleep ID

If None (default), cross-tabulation is derived from the entire group dataset. If a valid sleep ID, cross-tabulation is derived using only the reference and observed scored hypnograms from that sleep session.

agg_funcNone or str

If None (default), group results returns a DataFrame complete with all individual session results. If not None, group results returns a DataFrame aggregated across sessions where agg_func is passed as func parameter in pandas.DataFrame.groupby.agg(). For example, set agg_func="sum" to get a single confusion matrix across all epochs that does not take session into account.

**kwargskey, value pairs

Additional keyword arguments are passed to sklearn.metrics.confusion_matrix().


A confusion matrix with stages from the reference scorer as indices and stages from the test scorer as columns.


>>> import yasa
>>> ref_hyps = [yasa.simulate_hypnogram(tib=90, scorer="Rater1", seed=i) for i in range(3)]
>>> obs_hyps = [h.simulate_similar(scorer="Rater2", seed=i) for i, h in enumerate(ref_hyps)]
>>> ebe = yasa.EpochByEpochAgreement(ref_hyps, obs_hyps)
>>> ebe.get_confusion_matrix(sleep_id=2)
Rater2  WAKE  N1  N2  N3  REM
WAKE       1   2  23   0    0
N1         0   9  13   0    0
N2         0   6  71   0    0
N3         0  13  42   0    0
REM        0   0   0   0    0
>>> ebe.get_confusion_matrix()
Rater2           WAKE  N1  N2  N3  REM
sleep_id Rater1
1        WAKE      30   0   3   0   35
         N1         3   2   7   0    0
         N2        21  12   7   0    4
         N3         0   0   0   0    0
         REM        2   8  29   0   17
2        WAKE       1   2  23   0    0
         N1         0   9  13   0    0
         N2         0   6  71   0    0
         N3         0  13  42   0    0
         REM        0   0   0   0    0
3        WAKE      16   0   7  19   19
         N1         0   7   2   0    5
         N2         0  10  12   7    5
         N3         0   0  16  11    0
         REM        0  15  11  18    0
>>> ebe.get_confusion_matrix(agg_func="sum")
Rater2  WAKE  N1  N2  N3  REM
WAKE      47   2  33  19   54
N1         3  18  22   0    5
N2        21  28  90   7    9
N3         0  13  58  11    0
REM        2  23  40  18   17

Return a pandas.DataFrame of sleep statistics for each hypnogram derived from both reference and observed scorers.


A yasa.EpochByEpochAgreement instance.


A DataFrame with sleep statistics as columns and two rows for each individual (one for reference scorer and another for test scorer).

static multi_scorer(df, scorers)[source]

Compute multiple agreement scores from a 2-column dataframe (an optional 3rd column may contain sample weights).

This function offers convenience when calculating multiple agreement scores using pandas.DataFrame.groupby.apply(). Scikit-learn doesn’t include a function that returns multiple scores, and the GroupBy implementation of apply in pandas does not accept multiple functions.


A DataFrame with 2 columns and length of n_samples. The first column contains reference values and second column contains observed values. If a third column, it must contain sample weights to be passed to underlying sklearn.metrics functions as sample_weight where applicable.


The scorers to be used for evaluating agreement. A dictionary with scorer names (str) as keys and functions as values.


A dictionary with scorer names (str) as keys and scores (float) as values.

plot_hypnograms(sleep_id=None, legend=True, ax=None, ref_kwargs={}, obs_kwargs={})[source]

Plot the two hypnograms of one session overlapping on the same axis.


A yasa.EpochByEpochAgreement instance.

sleep_ida valid sleep ID or None

The sleep session to plot. If multiple sessions are included in the EpochByEpochAgreement instance, a sleep_id must be provided. If only one session is present, None (default) will plot the two hypnograms of the only session.

legendbool or dict

If True (default) or a dictionary, a legend is added. If a dictionary, all key/value pairs are passed as keyword arguments to the matplotlib.pyplot.legend() call.

axmatplotlib.axes.Axes or None

Axis on which to draw the plot, optional.


Keyword arguments passed to yasa.plot_hypnogram() when plotting the reference hypnogram.


Keyword arguments passed to yasa.plot_hypnogram() when plotting the observed hypnogram.


Matplotlib Axes


>>> from yasa import simulate_hypnogram
>>> hyp = simulate_hypnogram(scorer="Anthony", seed=19)
>>> ax = hyp.evaluate(hyp.simulate_similar(scorer="Alan", seed=68)).plot_hypnograms()
summary(by_stage=False, **kwargs)[source]

Return group-level agreement scores.

Default aggregated measures are


A EpochByEpochAgreement instance.


If False (default), summary will include agreement scores derived from average-based metrics. If True, returned summary DataFrame will include agreement scores for each sleep stage, derived from one-vs-rest metrics.

**kwargskey, value pairs

Additional keyword arguments are passed to pandas.DataFrame.groupby.agg(). This can be used to customize the descriptive statistics returned.


A pandas.DataFrame summarizing agreement scores across the entire dataset with descriptive statistics.

>>> ebe = yasa.EpochByEpochAgreement(...)
>>> agreement = ebe.get_agreement()
>>> ebe.summary()

This will give a DataFrame where each row is an agreement metric and each column is a descriptive statistic (e.g., mean, standard deviation). To control the descriptive statistics included as columns:

>>> ebe.summary(func=["count", "mean", "sem"])
property data

A pandas.DataFrame including all hypnograms.

property n_sleeps

The number of unique sleep sessions.

property obs_scorer

The name of the observed scorer.

property ref_scorer

The name of the reference scorer.