Keeping OfficePoolStop Fast and Reliable: What Happened, What We Fixed, and What’s Next

This last Sunday (9/21/2025), many of you experienced slow pages, spinning loaders, or brief timeouts. The root cause was our database connections getting “jammed” during peak traffic. We continue to tune our systems and optimize the website so more work is done in memory and on our servers’ CPUs rather than repeatedly hitting the database.

For a historical perspective, we faced the same challenges in 2019 when we saw a similar jump in users and it took us through week 3 to tighten down the database. We also have many new users from OfficeFootballPool (purchased by Splash), and I know they went through similar pain during either the 2019 or 2020 season when they were down the entire afternoon.

What happened

Think of our database like the stadium gates on game day. When too many people try to go through the same gate at once, the line backs up. During picks and results updates, many parts of the site were asking the database for the same information at the same time. In database terms this is called a “thundering herd”, when hundreds or thousands of players all hit the same page (like the pick sheet or standings) at once.

Before week 1, certain pages were making unnecessary queries (like TeamStats on picks pages or the Manager Player Accounts page). Those could jam the system badly enough that we had to restart services. We fixed those, but the week 1 Sunday slowdown revealed a few more hidden “hot spots.” After patching those, week 2 went more smoothly, with only short jams during predictable rushes (just before 11:00am MST and again during early afternoon standings checks).

Week 3 saw an even bigger wave—lots of players submitting picks at the last minute. That exposed another issue: our NFL schedule data wasn’t being cached correctly. Instead of serving the schedule once from memory, the system was asking the database for the full schedule for every single player request. That turned into another thundering herd.

The first graph below shows the spikes when "thundering herd" is happening (the small spikes are code uploads and can be ignored). Note in the graph below that CPU utilization is only roughly 40% during the busy Sundays, so we have more than enough headroom. The problem is when the herd phenomenon begins, everyone visiting the site is served any pages and data available in server memory, and much is then taken up in connection retry logic. When the thundering herd isn't happening (players are not all doing the same thing), server usage remains elevated (~40%), there are no outages/slowdowns.

What We Changed

Fixing the caching bug of the NFL schedule is a big "win" in moving work from database connections to the server CPU.

NFL schedule caching bug – Now properly cached in memory, so all players share the same copy instead of hammering the database. This is especially important to the hot spot just before 11:00am when people are scrambling to make picks.
Pick Confirmation Email was set to Immediate - This causes an extra database write. We are moving this back to Medium priority which means the emails will be sent via a background task. This was set to Immediate AFTER week 2 (which had no outages), so this may have been a larger contributor to the problem than the NFL schedule caching issue above.
Indexes added – We optimized more queries with indexes (like putting tabs in a playbook) so the database finds what it needs much faster.

Serialized expensive operations – During peak moments, we prevent multiple requests from all trying to do the same heavy work at once.

Can We Do More?

The latest fix is significant. Is it hiding the next weak query structure that could lead to another thundering herd? The odds are less but we can't guarantee we have snuffed them all out. Can't we just do heavy load testing? This is not an easy task for our small team of two developers, but truth be told, we had so many new users from RYP and OFP that we made the decision to work on new features they wanted (such as the enhanced Roster page), and forego load testing. There was also no guarantee that load testing would have been sufficiently set up to expose all the problems we've encountered.

Also, this isn't a problem you can throw money at. You need to instead find sufficient headroom, and careful tuning of how the database is used. And unlike many busy web applications, sports pools face a unique challenge of dealing with TWO sharp bursts of traffic to the same hot spots, one at a critical juncture (the time leading up to the start of the first game at 11:00am), the other when players want to check standings after the early games end (2:15 to 2:45pm). We size our servers specifically to handle these spikes, as well as the heavier Sunday traffic overall. In fact, as the graph below shows, we’re currently running with extra headroom by design. What might feel like overkill most of the week is exactly what should keep the site stable during those peak bursts, especially as we continue to find more database connection efficiencies.

CPU Utilization - Week 3 (Monday through Sunday)

Thank You

We appreciate everyone’s patience while we’ve tuned things up.

Topic Participants
Fred Williams
daveahorowitz

Keeping OfficePoolStop Fast and Reliable: What Happened, What We Fixed, and What’s Next

Keeping OfficePoolStop Fast and Reliable: What Happened, What We Fixed, and What’s Next

What happened

What We Changed

Can We Do More?

Thank You

Topic Participants

Fred Williams

daveahorowitz