Coupon collector's problem

07/16/2025

Naming is hard! A more accurate, albeit unforgivingly long, title would probably be Expressing the expected number of unfair die rolls until each face is rolled at least a specific number of times as an integral through Poissonization¹.

Here's the setup:

Not the setup

Actually before the setup let me quickly explain why this problem and approach are special (why you should read this post). This is a scenario where you solve a problem using a technique that normally introduces approximation errors, but doesn't for this specific case.

~~Errr tbf this doesn't sound too exciting tbh but anyways~~ Here's the actual setup:

Setup

We first define our parameters.

Let $k \in N$ denote the number of faces on the unfair die.
$\forall i \in [k]$ , let $p_{i} \in (0, 1]$ denote the probability that a roll yields face $i$ .
$\forall i \in [k]$ , let $c_{i} \in N$ denote the minimum number of times we need to see face $i$ before stopping.

We also define additional variables.

$\forall i \in [k]$ , let $(C_{n}^{i})_{n \in N}$ denote the counting process for face $i$ . For example, $C_{5}^{3}$ denotes the number of times we have rolled a three after five rolls.
Let $N = inf {n \in N : ⋀_{i = 1}^{k} {C_{n}^{i} \geq c_{i}}}$ denote the stopping time at which point we have rolled each face a sufficient number of times.

Here's the conclusion we want to eventually prove:

Conclusion

E [N] = \int_{0}^{\infty} 1 - \prod_{i = 1}^{k} [1 - e^{- t p_{i}} \sum_{j = 0}^{c_{i} - 1} \frac{(t p_{i})^{j}}{j!}] d t

Feel free to attempt to derive this yourself! A hint would be the magic word Poissonization.

Poissonization

At this point, there is sort of an implicit time measure embedded in our counting processes $(C_{n}^{i})$ ; one might be inclined to model this as rolls coming in every second. However, it would be helpful to not couple the discrete rolls and the continuous time this way: the counting processes work perfectly fine if we anchor them only to the rolls.

Instead, let's specify an explicit time measure for the rolls: let their interarrival times be independent $Exponential (1)$ 's. Suddenly, we have a Poisson process, which is incredibly magical and affords us the following nice properties:

∀t∈R≥0, let Ct denote the number of die rolls occurring at or before time t.
- We have $C_{t} ~ Poisson (t)$ .
∀i∈[k],∀t∈R≥0, let Cti denote the number of die rolls yielding face i at or before time t.
- We have $C_{t}^{i} ~ Poisson (t p_{i})$ .
- Somewhat counterintuitively, for a given $t \in R_{\geq 0}$ , $C_{t}^{i}$ are independent across different $i \in [k]$ . See Theorem 12.14, Poisson splitting here.

Given this, let us additionally denote $T = inf {t \in R_{\geq 0} : ⋀_{i = 1}^{k} {C_{t}^{i} \geq c_{i}}}$ the continuous counterpart to our discrete stopping time $N$ above. We now solve $E [T]$ by layer cake:

E [T] = \int_{0}^{\infty} P {T > t} d t = \int_{0}^{\infty} 1 - P {T \leq t} d t

Note that $T$ is a stopping time marking the conjunction of some (independent) statements. We have

E [T] = \int_{0}^{\infty} 1 - P {⋀_{i = 1}^{k} {C_{t}^{i} \geq c_{i}}} d t = \int_{0}^{\infty} 1 - \prod_{i = 1}^{k} P {C_{t}^{i} \geq c_{i}} d t

Uhh, seems like those are just some Poisson tails?

E [T] = \int_{0}^{\infty} 1 - \prod_{i = 1}^{k} (1 - P {C_{t}^{i} < c_{i}}) d t = \int_{0}^{\infty} 1 - \prod_{i = 1}^{k} [1 - e^{- t p_{i}} \sum_{j = 0}^{c_{i} - 1} \frac{(t p_{i})^{j}}{j!}] d t

At this point, we have successfully expressed the expectation of the continuous, Poissonized stopping time as an integral. For most problems involving Poissonization, such as those investigating tails and limiting behaviors, this is where we stop; we pat ourselves on the back, say something like "the original, fixed number of trials is large enough, such that the Poissonized counterpart provides a sufficiently accurate estimate to the desired expectation", and go on with our lives. This is, in fact, also what ChatGPT did (highlight mine).

This is a generalization of the coupon collector problem with unequal probabilities and required multiplicities.

The integral converges because the probability approaches 1 as $t \to \infty$ .

This Poissonized expression is often used in analysis and can sometimes be inverted to get bounds or approximations for the original (non-Poisson) problem.

De-Poissonization: what ChatGPT missed

It turns out that for this specific problem (and likely other similar ones involving expected number of multinoulli trials), the original and the Poissonized expectations $E [N]$ and $E [T]$ are exactly equal. When I hinted ChatGPT towards this, it essentially threw away our Poissonized results and tried to brute-force/direct-solve the un-Poissonized version.

We are smarter than this (for now, at least, maybe?)! Between the discrete rolls and the continuous time, there is a direction of conditionality that feels easier to explore; namely, $E [N ∣ T]$ seems hard to solve, but what about $E [T ∣ N]$ ? Given that we reached our stopping condition in $N$ rolls, we, uhhh, don't really know anything about the interarrival times? The expectation should just be of those iid Exponentials. Thus, we have

E [T ∣ N] = N

Slapping an expectation on both sides and applying tower rule, we have

E [N] = E [E [T ∣ N]] = E [T]

We get the benefit of Poissonization for free.

E [N] = \int_{0}^{\infty} 1 - \prod_{i = 1}^{k} [1 - e^{- t p_{i}} \sum_{j = 0}^{c_{i} - 1} \frac{(t p_{i})^{j}}{j!}] d t

Backstory

I have known the classic result for a symmetric single coupon collector's problem since my sophomore year probability class. Essentially, for $k \in N$ coupons each equally likely to be drawn, the expected number of draws is, by linearity and summing expectations of Geometrics,

\frac{k}{k} + \frac{k}{k - 1} + \dots + \frac{k}{1} = k \sum_{i = 1}^{k} \frac{1}{i}

Somewhere along prepping for quantitative trading recruiting, I came across and bookmarked this old paper on the symmetric double coupon collector's problem; the authors essentially derive the same integral without using Poissonization, and further investigate the asymptotic behavior of the expectation.

Fast forward to summer 2025, a member of a RedNote-based puzzle solving group chat asked the same symmetric double coupon collector's problem in the context of a fair six-sided die. Markov-chain-pilled and Python-maxxing, I whipped up the following script; since the probabilities are symmetric, it suffices to keep track of the number of faces not visited yet and visited only once.

import functools

@functools.cache
def calc_ev(num_zeros, num_ones):
    assert num_zeros >= 0 and num_ones >= 0
    assert num_zeros + num_ones <= 6

    num_at_least_twos = 6 - num_zeros - num_ones

    if num_at_least_twos == 6:
        return 0

    # ev(c0, c1) = 1
    #   + ev(c0 - 1, c1 + 1) * c0/6
    #   + ev(c0, c1 - 1) * c1/6
    #   + ev(c0, c1) * c2/6

    ev = 1

    if num_zeros > 0:
        ev += calc_ev(num_zeros - 1, num_ones + 1) * (num_zeros / 6)

    if num_ones > 0:
        ev += calc_ev(num_zeros, num_ones - 1) * (num_ones / 6)

    if num_at_least_twos > 0:
        ev /= 1 - num_at_least_twos / 6

    return ev

print(calc_ev(6, 0)) # 24.13386919753085

Another member then linked this math StackExchange post, in which the word Poissonization is mentioned and a similar integral is given, although without much explanation on how it is derived. I then used this online textbook to learn about Poissonization (and its usual usage for approximations), and figured out the final connection between the two expectations on my own.

References

textbooks
- Probability for Data Science at UC Berkeley
- Probability and Computing at CMU
papers
Wikipedia
- Coupon collector's problem
Mathematics StackExchange posts
blog posts
- The multiple coupon collector problem
- Modified coupon collector's problem

Uhh, so as I was writing this part (literally the second sentence of this post), I thought "hey I should totally throw this (the long title, after changing 'Expressing' to 'Express') to ChatGPT and see how it does"; it turns out ChatGPT basically one-shot'ed pretty much the entire post (which makes me sad!), except for the realization that the expectations are equal before and after Poissonization (which makes me a bit happier!) (although to be fair conditional on you seeing this post, I probably wasn't too too sad to the point where I decide not to post about this completely, idk) (man I love using parentheses).↩