Coupon collector's problem
Naming is hard! A more accurate, albeit unforgivingly long, title would probably be Expressing the expected number of unfair die rolls until each face is rolled at least a specific number of times as an integral through Poissonization1.
Here's the setup:
Not the setup
Actually before the setup let me quickly explain why this problem and approach are special (why you should read this post). This is a scenario where you solve a problem using a technique that normally introduces approximation errors, but doesn't for this specific case.
Errr tbf this doesn't sound too exciting tbh but anyways Here's the actual setup:
Setup
We first define our parameters.
- Let denote the number of faces on the unfair die.
- , let denote the probability that a roll yields face .
- , let denote the minimum number of times we need to see face before stopping.
We also define additional variables.
- , let denote the counting process for face . For example, denotes the number of times we have rolled a three after five rolls.
- Let denote the stopping time at which point we have rolled each face a sufficient number of times.
Here's the conclusion we want to eventually prove:
Conclusion
Feel free to attempt to derive this yourself! A hint would be the magic word Poissonization.
Poissonization
At this point, there is sort of an implicit time measure embedded in our counting processes ; one might be inclined to model this as rolls coming in every second. However, it would be helpful to not couple the discrete rolls and the continuous time this way: the counting processes work perfectly fine if we anchor them only to the rolls.
Instead, let's specify an explicit time measure for the rolls: let their interarrival times be independent 's. Suddenly, we have a Poisson process, which is incredibly magical and affords us the following nice properties:
- ,
let denote the number of die rolls occurring at or before time .
- We have .
- ,
let denote the number of die rolls yielding face at or before time .
- We have .
- Somewhat counterintuitively, for a given , are independent across different . See Theorem 12.14, Poisson splitting here.
Given this, let us additionally denote the continuous counterpart to our discrete stopping time above. We now solve by layer cake:
Note that is a stopping time marking the conjunction of some (independent) statements. We have
Uhh, seems like those are just some Poisson tails?
At this point, we have successfully expressed the expectation of the continuous, Poissonized stopping time as an integral. For most problems involving Poissonization, such as those investigating tails and limiting behaviors, this is where we stop; we pat ourselves on the back, say something like "the original, fixed number of trials is large enough, such that the Poissonized counterpart provides a sufficiently accurate estimate to the desired expectation", and go on with our lives. This is, in fact, also what ChatGPT did (highlight mine).
- This is a generalization of the coupon collector problem with unequal probabilities and required multiplicities.
- The integral converges because the probability approaches 1 as .
- This Poissonized expression is often used in analysis and can sometimes be inverted to get bounds or approximations for the original (non-Poisson) problem.
De-Poissonization: what ChatGPT missed
It turns out that for this specific problem (and likely other similar ones involving expected number of multinoulli trials), the original and the Poissonized expectations and are exactly equal. When I hinted ChatGPT towards this, it essentially threw away our Poissonized results and tried to brute-force/direct-solve the un-Poissonized version.
We are smarter than this (for now, at least, maybe?)! Between the discrete rolls and the continuous time, there is a direction of conditionality that feels easier to explore; namely, seems hard to solve, but what about ? Given that we reached our stopping condition in rolls, we, uhhh, don't really know anything about the interarrival times? The expectation should just be of those iid Exponentials. Thus, we have
Slapping an expectation on both sides and applying tower rule, we have
We get the benefit of Poissonization for free.
Backstory
I have known the classic result for a symmetric single coupon collector's problem since my sophomore year probability class. Essentially, for coupons each equally likely to be drawn, the expected number of draws is, by linearity and summing expectations of Geometrics,
Somewhere along prepping for quantitative trading recruiting, I came across and bookmarked this old paper on the symmetric double coupon collector's problem; the authors essentially derive the same integral without using Poissonization, and further investigate the asymptotic behavior of the expectation.
Fast forward to summer 2025, a member of a RedNote-based puzzle solving group chat asked the same symmetric double coupon collector's problem in the context of a fair six-sided die. Markov-chain-pilled and Python-maxxing, I whipped up the following script; since the probabilities are symmetric, it suffices to keep track of the number of faces not visited yet and visited only once.
import functools
@functools.cache
def calc_ev(num_zeros, num_ones):
assert num_zeros >= 0 and num_ones >= 0
assert num_zeros + num_ones <= 6
num_at_least_twos = 6 - num_zeros - num_ones
if num_at_least_twos == 6:
return 0
# ev(c0, c1) = 1
# + ev(c0 - 1, c1 + 1) * c0/6
# + ev(c0, c1 - 1) * c1/6
# + ev(c0, c1) * c2/6
ev = 1
if num_zeros > 0:
ev += calc_ev(num_zeros - 1, num_ones + 1) * (num_zeros / 6)
if num_ones > 0:
ev += calc_ev(num_zeros, num_ones - 1) * (num_ones / 6)
if num_at_least_twos > 0:
ev /= 1 - num_at_least_twos / 6
return ev
print(calc_ev(6, 0)) # 24.13386919753085
Another member then linked this math StackExchange post, in which the word Poissonization is mentioned and a similar integral is given, although without much explanation on how it is derived. I then used this online textbook to learn about Poissonization (and its usual usage for approximations), and figured out the final connection between the two expectations on my own.
References
- textbooks
- papers
- Wikipedia
- Mathematics StackExchange posts
- blog posts
Uhh, so as I was writing this part (literally the second sentence of this post), I thought "hey I should totally throw this (the long title, after changing 'Expressing' to 'Express') to ChatGPT and see how it does"; it turns out ChatGPT basically one-shot'ed pretty much the entire post (which makes me sad!), except for the realization that the expectations are equal before and after Poissonization (which makes me a bit happier!) (although to be fair conditional on you seeing this post, I probably wasn't too too sad to the point where I decide not to post about this completely, idk) (man I love using parentheses).↩