siy[uan] che[n]

Coupon collector's problem

Naming is hard! A more accurate, albeit unforgivingly long, title would probably be Expressing the expected number of unfair die rolls until each face is rolled at least a specific number of times as an integral through Poissonization1.

Here's the setup:

Not the setup

Actually before the setup let me quickly explain why this problem and approach are special (why you should read this post). This is a scenario where you solve a problem using a technique that normally introduces approximation errors, but doesn't for this specific case.

Errr tbf this doesn't sound too exciting tbh but anyways Here's the actual setup:

Setup

We first define our parameters.

We also define additional variables.

Here's the conclusion we want to eventually prove:

Conclusion

E[N]=01i=1k[1etpij=0ci1(tpi)jj!] dt

Feel free to attempt to derive this yourself! A hint would be the magic word Poissonization.

Poissonization

At this point, there is sort of an implicit time measure embedded in our counting processes (Cni); one might be inclined to model this as rolls coming in every second. However, it would be helpful to not couple the discrete rolls and the continuous time this way: the counting processes work perfectly fine if we anchor them only to the rolls.

Instead, let's specify an explicit time measure for the rolls: let their interarrival times be independent Exponential(1)'s. Suddenly, we have a Poisson process, which is incredibly magical and affords us the following nice properties:

Given this, let us additionally denote T=inf{tR0:i=1k{Ctici}} the continuous counterpart to our discrete stopping time N above. We now solve E[T] by layer cake:

E[T]=0P{T>t} dt=01P{Tt} dt

Note that T is a stopping time marking the conjunction of some (independent) statements. We have

E[T]=01P{i=1k{Ctici}} dt=01i=1kP{Ctici} dt

Uhh, seems like those are just some Poisson tails?

E[T]=01i=1k(1P{Cti<ci}) dt=01i=1k[1etpij=0ci1(tpi)jj!] dt

At this point, we have successfully expressed the expectation of the continuous, Poissonized stopping time as an integral. For most problems involving Poissonization, such as those investigating tails and limiting behaviors, this is where we stop; we pat ourselves on the back, say something like "the original, fixed number of trials is large enough, such that the Poissonized counterpart provides a sufficiently accurate estimate to the desired expectation", and go on with our lives. This is, in fact, also what ChatGPT did (highlight mine).

De-Poissonization: what ChatGPT missed

It turns out that for this specific problem (and likely other similar ones involving expected number of multinoulli trials), the original and the Poissonized expectations E[N] and E[T] are exactly equal. When I hinted ChatGPT towards this, it essentially threw away our Poissonized results and tried to brute-force/direct-solve the un-Poissonized version.

We are smarter than this (for now, at least, maybe?)! Between the discrete rolls and the continuous time, there is a direction of conditionality that feels easier to explore; namely, E[NT] seems hard to solve, but what about E[TN]? Given that we reached our stopping condition in N rolls, we, uhhh, don't really know anything about the interarrival times? The expectation should just be of those iid Exponentials. Thus, we have

E[TN]=N

Slapping an expectation on both sides and applying tower rule, we have

E[N]=E[E[TN]]=E[T]

We get the benefit of Poissonization for free.

E[N]=01i=1k[1etpij=0ci1(tpi)jj!] dt

Backstory

I have known the classic result for a symmetric single coupon collector's problem since my sophomore year probability class. Essentially, for kN coupons each equally likely to be drawn, the expected number of draws is, by linearity and summing expectations of Geometrics,

kk+kk1++k1=ki=1k1i

Somewhere along prepping for quantitative trading recruiting, I came across and bookmarked this old paper on the symmetric double coupon collector's problem; the authors essentially derive the same integral without using Poissonization, and further investigate the asymptotic behavior of the expectation.

Fast forward to summer 2025, a member of a RedNote-based puzzle solving group chat asked the same symmetric double coupon collector's problem in the context of a fair six-sided die. Markov-chain-pilled and Python-maxxing, I whipped up the following script; since the probabilities are symmetric, it suffices to keep track of the number of faces not visited yet and visited only once.

import functools

@functools.cache
def calc_ev(num_zeros, num_ones):
    assert num_zeros >= 0 and num_ones >= 0
    assert num_zeros + num_ones <= 6

    num_at_least_twos = 6 - num_zeros - num_ones

    if num_at_least_twos == 6:
        return 0

    # ev(c0, c1) = 1
    #   + ev(c0 - 1, c1 + 1) * c0/6
    #   + ev(c0, c1 - 1) * c1/6
    #   + ev(c0, c1) * c2/6

    ev = 1

    if num_zeros > 0:
        ev += calc_ev(num_zeros - 1, num_ones + 1) * (num_zeros / 6)

    if num_ones > 0:
        ev += calc_ev(num_zeros, num_ones - 1) * (num_ones / 6)

    if num_at_least_twos > 0:
        ev /= 1 - num_at_least_twos / 6

    return ev

print(calc_ev(6, 0)) # 24.13386919753085

Another member then linked this math StackExchange post, in which the word Poissonization is mentioned and a similar integral is given, although without much explanation on how it is derived. I then used this online textbook to learn about Poissonization (and its usual usage for approximations), and figured out the final connection between the two expectations on my own.

References

  1. Uhh, so as I was writing this part (literally the second sentence of this post), I thought "hey I should totally throw this (the long title, after changing 'Expressing' to 'Express') to ChatGPT and see how it does"; it turns out ChatGPT basically one-shot'ed pretty much the entire post (which makes me sad!), except for the realization that the expectations are equal before and after Poissonization (which makes me a bit happier!) (although to be fair conditional on you seeing this post, I probably wasn't too too sad to the point where I decide not to post about this completely, idk) (man I love using parentheses).