Reservoir sampling is a family of randomized algorithms for randomly choosing K samples from a list of S items, where S is either a very large or unknown number.
A very elegant algorithm titled Algorithm R by Alan G. Waterman, is summarized as follows:
Fill up the 'reservoir' with the first k items from S
For every item S[j] where j > K
Choose an integer r between 0 and j
If r < K then replace element r in the reservoir with S[j]
One very interesting
discussion with a friend led me to this problem. In this post we are going to prove that at all times, the probability of each processed item to be in the reservoir is the same for all items. In other words, after each iteration and when the algorithms terminates, all processed items will have the same probability of being included.
To initialize the reservoir, the first K items are included. At this point of time, each item is included by a probability of 1.
Later on, to process item number A, where A > K, we include it by a probability of K / A. (1)
For any of the K items in the reservoir, it will remain in the reservoir after item A is processed only in 2 cases:
1) Item A does not get included. This has a probability of ( A - K ) / A = 1 - K / A
2) Item A gets included but it does not replace the item in question, i.e. it can replace any of the other ( K - 1 ) items in the reservoir. This has a probability of ( K / A ) * [ ( K - 1 ) / K ] = ( K - 1 ) / A
So each of the K items remains in the reservoir by a probability of 1 - K / A + ( K - 1 ) / A = 1 - 1 / A
This is the same as the probability of A not replacing our item. Since the probability of A replacing any one item is 1 / A. The complement is directly found as 1 - 1 / A
Now, let P( A, B ) be the probability of item A still being included in the reservoir after item B is processed, where B >= A. Without loss of generality, we may assume that B is the next item to be processed.
For item A to remain in the reservoir after item B is processed, item A must have been included up to the point where B is to be processed. Equivalently, item A was still included after item (B - 1) was processed. In addition, for A to remain in the reservoir B must not replace it
From the discussion above, it follows that:
P( A, B ) = [ 1 - 1 / B ] * P( A, B - 1 )
Which can be expanded as:
P( A, B ) = [ 1 - 1 / B ] * [ 1 - 1 / ( B - 1 ) ] * ... * [ 1 - 1 / ( A + 2 ) ] * [ 1 - 1 / ( A + 1 ) ] * p( A, A )
But from (1):
P( A, A ) = K / A
As a result:
P( A, B ) = [ 1 - 1 / B ] * [ 1 - 1 / ( B - 1 ) ] * ... * [ 1 - 1 / ( A + 2 ) ] * [ 1 - 1 / ( A + 1 ) ] * K / A
But,
[ 1 - 1 / ( A + 1 ) ] = A / ( A + 1 )
Therefore:
P( A, B ) = [ 1 - 1 / B ] * [ 1 - 1 / ( B - 1 ) ] * ... * [ 1 - 1 / ( A + 2 ) ] * [ A / ( A + 1 ) ] * K / A
P( A, B ) = [ 1 - 1 / B ] * [ 1 - 1 / ( B - 1 ) ] * ... * [ 1 - 1 / ( A + 2 ) ] * K / ( A + 1 )
Again:
P( A, B ) = [ 1 - 1 / B ] * [ 1 - 1 / ( B - 1 ) ] * ... * K / ( A + 2 )
We can see that this recurrence reduces to:
P( A, B ) = K / B
This is the same as the probability of item B being included. In addition, P( A, B ) is not a function of A, it only depends on B, which is the number of elements processed so far.
This means that for the next item to be processed it will have the same probability of being included as all the items that were processed before it. By maintaining this property, when the algorithm terminates, all items will have the same probability of being included in the reservoir, regardless of the size of the list.