Most of Your Players Have Already Muted You
There's a number that should probably be painted on the wall of every game audio studio in the world. Fifty-two to sixty percent.
That's how many mobile game sessions are played with the sound completely off. More than half. The majority of your players will never hear a single note you wrote.
We spent a long time researching how audio actually works in mobile games, not the theoretical version you read about in game design textbooks, but the version that ships in the top-grossing games on the App Store. What we found changed how we thought about almost everything.
I want to walk through what we learned, because most of it surprised us, and some of it is the kind of thing that only becomes obvious after you've already made the wrong decision.
The muting thing first, because it's important.
Seventy percent of the people who mute do it because they're in a social situation. On a bus. In a waiting room. At their desk pretending to work. Fifteen percent have their own music playing. Ten percent are annoyed by the game's audio. The rest is miscellaneous.
The instinct when you hear this is: "Well, then why bother with audio at all?"
That's the wrong takeaway. The right takeaway is that the first ten to thirty seconds of your audio are the only seconds that matter for most people. That's the window before they decide to mute. In those seconds, your game either sounds like it was made by professionals or it doesn't. The decision sticks. Players who muted on day one almost never unmute on day thirty.
So the question isn't "how much music should we make?" It's "how good do those first thirty seconds need to be?" And the answer is: disproportionately good. Spend more time on the first thing players hear than on anything else in your audio budget.
Now, about that audio budget. Here's where most indie developers get the physics wrong.
The common fear is that audio eats RAM. It doesn't, or rather, it barely does. Background music and ambient tracks are always streamed from disk through tiny buffers — 128 to 256 kilobytes. The music is never "in memory" in any meaningful sense. It trickles through a buffer the way water trickles through a straw. Only sound effects and UI clicks get fully loaded into RAM, and those are tiny — five to fifty kilobytes each.
The constraint that actually matters is your app's download size. iOS has a 200 megabyte threshold for cellular downloads. Go above it and you lose fifteen to thirty percent of impulse downloads. People on the train see your game, tap "Get," see the Wi-Fi warning, and move on with their lives. Google Play has a similar wall at 150 megabytes for the base module.
What this means in practice: for a casual game, your total compressed audio budget is five to thirty megabytes. For a mid-core game with multiple regions and combat music, you get thirty to a hundred. That sounds like a lot until you start counting tracks.
A single minute of background music at reasonable quality (OGG Vorbis, 128 kilobits per second, stereo) is about one megabyte. A thirty-second loop is half that. Sound effects are almost free — a button click is five kilobytes, a coin collect is maybe fifteen. But a full mid-core RPG needs a menu theme, four to six exploration tracks, two or three combat tracks, a boss theme, stingers for victory and defeat, a bunch of UI sounds, a set of combat sound effects, ambient beds for each region, and maybe some character barks. Add it all up and you're at twenty-five megabytes before you've even thought about voice acting.
The trick that saves you thirty to forty percent on all of this: make everything mono except your background music and stingers. Mobile phone speakers have almost no stereo separation. Nobody can tell. Spatial audio gets applied at runtime anyway. Mono files are half the size. It's free savings.
Here's where the format wars come in, and I'll keep this brief because format wars are tedious.
There are really only two formats that matter for mobile games: OGG Vorbis and WAV. Everything else is a compromise.
OGG wins for anything that loops. The reason is technical but important: when you compress audio into MP3, the encoder adds about 1,152 samples of padding at the start and end of the file. That's roughly 26 milliseconds of silence. Most game audio engines ignore this padding completely. So when your carefully crafted 30-second loop wraps around, there's a tiny pop. A hiccup. A seam. It's subtle, but once you hear it, you can't unhear it, and it makes your game feel cheap.
OGG Vorbis has zero encoder padding. It stores something called granule position natively, which means loop points are sample-accurate without any workarounds. AAC technically can loop cleanly, but it requires a special metadata field called iTunSMPB, and support for it varies wildly across game engines and platforms. MP3 should never be used for anything that loops. I'll say it again: never use MP3 for looping audio.
WAV wins for anything that needs to play instantly — button taps, coin collects, impact sounds. It's uncompressed, so there's zero decode time. On iOS, you can get trigger-to-sound in under ten milliseconds with a preloaded WAV file.
Speaking of latency: this is the hidden gap between platforms that nobody warns you about. iOS delivers audio in five to fifteen milliseconds. An Android flagship with the Oboe library gets twenty to forty. A budget Android phone from 2022? Fifty to two hundred milliseconds. That's a fifth of a second between tapping a button and hearing the click. It feels broken. The only real mitigation is to pre-decode your sound effects into memory buffers and output at 48 kHz on Android to avoid a resampling penalty. If you're making a rhythm game, you'll also need a visual offset calibration screen, because no amount of software engineering can fix two hundred milliseconds of hardware latency.
Let me talk about repetition, because this is the part where most game developers silently ruin their own work.
There's a principle I've started calling the 3x rule. If a player hears the same track more than three times in a single session, they start getting tired of it. Not consciously — they don't think "this music is repetitive." They just feel slightly more inclined to quit. After twenty to thirty minutes of the same loop, session abandonment rates climb measurably.
This makes intuitive sense once you think about it. A thirty-second loop plays twice per minute. In a ten-minute casual session, that's twenty repetitions. Even a very good piece of music becomes wallpaper after the fifth play and irritating after the fifteenth.
The minimum viable music library depends on how long your players play:
For a hyper-casual game where sessions last under two minutes, one thirty-second background track and a couple of stingers is fine. That's about 570 kilobytes total. For a casual game with five-minute sessions, you need two or three tracks that rotate, plus combo stingers and a small ambient bed — around four megabytes. For a mid-core game where people play for fifteen minutes or more, you need a proper state-based music system with different tracks for different game states, and you're looking at eight to twelve unique tracks plus a full stinger set.
The smart way to handle this is with loops that don't sound like loops, and there's a beautiful technique I want to describe.
Take three ambient loops. Make them different prime-number durations: seventeen seconds, twenty-three seconds, and thirty-one seconds. Layer them on top of each other. Because the durations share no common factors, the combination won't exactly repeat until the least common multiple — which is 12,121 seconds. That's three and a half hours.
Three and a half hours of non-repeating ambient texture from three short loops.
This is essentially what Alto's Odyssey does. All the musical phrases are in the same key and tempo, randomly selected from a compatible pool. Any combination sounds coherent. It's musical Lego. The player hears an ever-shifting soundscape, and the total audio footprint is tiny.
I want to mention one more thing about how the best mobile games handle audio, because it illustrates how far this can go.
Genshin Impact has one of the most sophisticated adaptive music systems ever put on a phone. Every region in the game has exploration music and combat music, each built from separate stems that share a key and tempo. When you're wandering a forest, you hear gentle strings and a solo flute. The moment an enemy spots you, the game plays a one-to-two-second orchestral riser stinger — a dramatic swell that functions like a musical exclamation point — and then crossfades into full combat music, beat-synced so the transition lands on a downbeat. When the fight ends, the process reverses. The combat drums fade, the strings soften, and you're back in the forest.
All of this happens through the Wwise middleware, which is a complex piece of software. But the underlying concept is simple. Two patterns cover almost everything: vertical layering (fading stems in and out on top of a base layer) and horizontal re-sequencing (switching between interchangeable segments that all share a key and tempo). You don't need Wwise. You need stems that were generated with the same musical parameters, and a state machine that knows when to crossfade.
Alto's Odyssey does something quite elegant. All the musical phrases are composed in the same key and tempo, and the game randomly selects from a pool. Any combination sounds musical. The phrases layer. They drift. Three-to-five-second crossfades prevent any jarring moments. Environmental sounds — wind, sand — mix dynamically based on the player's speed. The total audio footprint is small, but the sonic experience is enormous.
Monument Valley takes the opposite approach: no traditional looping music at all. Instead, sounds are triggered by the player's interaction with geometry — rotating platforms, moving blocks, sliding paths. The "music" emerges from gameplay. The loudness target is extraordinarily low, around -20 LUFS. The game is designed to be played almost silently, like a music box you hold in your hands.
These games are proof that audio design isn't about how much music you have. It's about how intelligently you use what you have.
One last thing. There's a detail about volume sliders that took me an embarrassingly long time to understand, and I think it encapsulates a lot of what makes game audio hard.
Human hearing is logarithmic. A sound that's twice as loud doesn't have twice the amplitude — it has roughly ten times the amplitude. This means a linear volume slider (the default, the obvious implementation, the one every tutorial uses) feels completely wrong. Moving the slider from zero to ten percent is barely audible. Moving it from ninety to a hundred percent is a dramatic jump. The perceptual midpoint of the slider is nowhere near the physical midpoint.
The fix is a logarithmic curve. Map the slider's zero-to-one range onto a -60 to 0 decibel range using a power function. It's four lines of code and it makes your volume controls feel like every professional app the player has ever used.
But almost nobody does it. Because it's not the kind of thing you notice when it's right. You only notice when it's wrong, and by then you've attributed the "cheapness" to something else entirely.
That's game audio in a nutshell, really. When it's done well, nobody thinks about it. They think the game is fun. They think the game is polished. They think the game feels good. The music and the sound effects and the stingers and the volume curves and the crossfades and the ducking and the frequency cleanup — all of it invisible, all of it load-bearing.
More than half your players will mute you. The ones who don't will never consciously appreciate what you've done. And the work is worth doing anyway, because the alternative — silence, or bad audio, or a loop that pops every thirty seconds — is the kind of thing that makes people close your app and never come back without ever quite knowing why.