Kaplan-Meier Estimator

Introduction

The Kaplan-Meier estimator is a non-parametric estimator of the survival function used in the presence of censoring and truncation. It is graphically characterized by a decreasing plateau curve, with the points of discontinuity corresponding to the points at which an event occurs, in this case the exit of an individual. The Kaplan-Meier estimator is based on the estimation of instantaneous survival rates, which historically justifies the use of rate tables. When a sufficiently large sample is used, it approximates the true survival function in that population.

In survival analysis, density (as opposed to the Lebesgue measure, or the counting measure in the discrete case) is often not the best way to understand the distribution of a variable. A duration corresponds to the time before the occurrence of an event. In our case, the duration corresponds to the length of the work stoppage, i.e. the difference between the end and the beginning of the work stoppage. The hazard rate is intended to express the risk of this event occurring in the immediate future, given that it has not occurred before.

Construction

Recall that the distribution of a random variable T is uniquely determined by its distribution function: F(t) = P(T ≤ t). The survival function of T is defined by: S(t) = P(T ≥ t). We have S(t) = 1 - F(t) if the variable is continuous and S(t) uniquely defines the distribution of T. The conditional survival function of T is S(t|x) = P(T ≥ t | T ≥ x). For t ≥ x, S(t|x) = S(t)/S(x).

In the case of lifespans, let's say an individual is x years old. We want to define a quantity that tells us whether this individual is highly "at risk" at the moment, or whether the risk of dying in the near future is low. The definitions of risk rate are slightly different depending on whether T is discrete or continuous.

Properties

The Kaplan-Meier estimator can be seen as an estimate of the survival function, giving greater weight to long stops since they are more likely to be censored. These weights tend to explode for the highest uncensored observations, leading to instability and a high approximation error at the end of the tail.