yasa.compare_detection(indices_detection, indices_groundtruth, max_distance=0)[source]

Determine correctness of detected events against ground-truth events.


Indices of the detected events. For example, this could be the indices of the start of the spindles, or the negative peak of the slow-waves. The indices must be in samples, and not in seconds.


Indices of the ground-truth events, in samples.

max_distanceint, optional

Maximum distance between indices, in samples, to consider as the same event (default = 0). For example, if the sampling frequency of the data is 100 Hz, using max_distance=100 will search for a matching event 1 second before or after the current event.


A dictionary with the comparison results:

  • tp: True positives, i.e. actual events detected as events.

  • fp: False positives, i.e. non-events detected as events.

  • fn: False negatives, i.e. actual events not detected as events.

  • precision: Precision score, aka positive predictive value (see Notes)

  • recall: Recall score, aka sensitivity (see Notes)

  • f1: F1-score (see Notes)


  • The precision score is calculated as TP / (TP + FP).

  • The recall score is calculated as TP / (TP + FN).

  • The F1-score is calculated as TP / (TP + 0.5 * (FP + FN)).

This function is inspired by the sleepecg.compare_heartbeats function.


A simple example. Here, detected refers to the indices (in the data) of the detected events. These could be for example the index of the onset of each detected spindle. grndtrth refers to the ground-truth (e.g. human-annotated) events.

>>> from yasa import compare_detection
>>> detected = [5, 12, 20, 34, 41, 57, 63]
>>> grndtrth = [5, 12, 18, 26, 34, 41, 55, 63, 68]
>>> compare_detection(detected, grndtrth)
{'tp': array([ 5, 12, 34, 41, 63]),
 'fp': array([20, 57]),
 'fn': array([18, 26, 55, 68]),
 'precision': 0.7142857142857143,
 'recall': 0.5555555555555556,
 'f1': 0.625}

There are 4 true positives, 2 false positives and 4 false negatives. This gives a precision score of 0.71 (= 5 / (5 + 2)), a recall score of 0.55 (= 5 / (5 + 4)) and a F1-score of 0.625. The F1-score is the harmonic average of precision and recall, and should be the preferred metric when comparing the performance of a detection against a ground-truth.

Order matters! If we set detected as the ground-truth, FP and FN are inverted, and same for precision and recall. The TP and F1-score remain the same though. Therefore, when comparing two detections (and not a detection against a ground-truth), the F1-score is the preferred metric because it is independent of the order.

>>> compare_detection(grndtrth, detected)
{'tp': array([ 5, 12, 34, 41, 63]),
 'fp': array([18, 26, 55, 68]),
 'fn': array([20, 57]),
 'precision': 0.7142857142857143,
 'recall': 0.7142857142857143,
 'f1': 0.625}

There might be some events that are very close to each other, and we would like to count them as true positive even though they do not occur exactly at the same index. This is possible with the max_distance argument, which defines the lookaround window (in samples) for each event.

>>> compare_detection(detected, grndtrth, max_distance=2)
{'tp': array([ 5, 12, 20, 34, 41, 57, 63]),
 'fp': array([], dtype=int64),
 'fn': array([26, 68]),
 'precision': 1.0,
 'recall': 0.7777777777777778,
 'f1': 0.875}

Finally, if detected is empty, all performance metrics will be set to zero, and a copy of the groundtruth array will be returned as false negatives.

>>> compare_detection([], grndtrth)
{'tp': array([], dtype=int64),
 'fp': array([], dtype=int64),
 'fn': array([ 5, 12, 18, 26, 34, 41, 55, 63, 68]),
 'precision': 0,
 'recall': 0,
 'f1': 0}