How to Time Lyrics to Music (Manual vs Auto Sync)
A lyric video where the text is even slightly off from the audio feels broken. Words appearing a half-second early or late creates an uncanny valley effect -- close enough to be noticeable, far enough off to be distracting.
Tight timing is what separates a lyric video that people watch on mute, unmute, and immediately save from one they scroll past. The text needs to hit when the vocal hits. Not before. Not after. Right on it.
There are two ways to get there: manual timing and automatic sync. Both work. The right choice depends on your situation.
Automatic Sync (AI Transcription)
Epitrite's AI transcription doesn't just convert audio to text -- it timestamps every line and every word. Upload your track, click Transcribe, and you get lyrics that are already synced to the audio.
How It Works
The AI listens to your audio and identifies:
- When each line starts and ends -- The broad strokes of timing
- When each word is spoken -- Word-level precision for animations like stomp words and scatter words
- Section boundaries -- Where verses, choruses, and bridges begin
The result is a fully timed lyrics file that drops into the editor ready to preview.
Accuracy Expectations
| Scenario | Line Timing Accuracy | Word Timing Accuracy | |----------|---------------------|---------------------| | Clear vocals, sparse mix | 95%+ | 90%+ | | Standard pop/R&B production | 90-95% | 85-90% | | Dense mix, effects on vocals | 80-90% | 75-85% | | Fast rap, overlapping vocals | 75-85% | 70-80% | | Heavy distortion, screaming | 60-75% | 55-70% |
Even at the lower end, automatic sync saves massive time. Fixing a few misaligned words is way faster than timing everything from scratch.
When Auto Sync Is Enough
For most musicians posting TikTok and Instagram content, auto sync accuracy is perfectly fine. Viewers watching a 30-second vertical video on their phone aren't scrutinizing timing to the millisecond. If the text appears roughly when the word is sung, it works.
Auto sync is enough when:
- You're making short-form content (under 60 seconds)
- The audio has clear, upfront vocals
- You're posting across multiple platforms and need volume over perfection
- The song is mid-tempo or slower
When to Go Manual
Auto sync needs manual cleanup when:
- You're making a full-length YouTube lyric video where timing precision matters more
- The track has complex vocal layering or effects
- Fast delivery sections where the AI missed word boundaries
- You want frame-perfect sync for professional distribution
Manual Timing
Manual timing means setting the start and end time for each line (or word) yourself. It's more work, but you get complete control over exactly when each piece of text appears and disappears.
The Manual Workflow in Epitrite
- Paste or transcribe your lyrics -- Get the text into the editor
- Play the audio -- Epitrite's timeline shows the audio waveform
- Set start times -- For each line, click or tap when the vocal begins. The editor snaps to that timestamp.
- Set end times -- Mark when each line should disappear. Usually when the next line's vocal begins, or at the end of a held note.
- Fine-tune -- Scrub through the timeline and adjust any lines that feel off by a few milliseconds
Tips for Accurate Manual Timing
Watch the waveform. Vocal attacks (the start of a word) create visible spikes in the audio waveform. Align your text start times to those spikes rather than guessing by ear.
Use headphones. Speakers have latency that can throw off your timing. Headphones give you the most accurate audio-to-visual alignment.
Time on the attack, not the downbeat. Musical instinct makes you want to time lyrics to the beat. But vocals don't always land exactly on the beat -- they can be slightly ahead or behind the grid. Time to the actual vocal, not where you think the vocal should be.
Leave breathing room. Don't make text appear the instant a word starts. A 50-100ms early entry gives the viewer's brain time to process the text before the audio confirms it. It feels tighter even though it's technically early.
Handle held notes carefully. If a singer holds a word for 3 seconds, the text should stay visible for those 3 seconds. Don't cut it short. A word disappearing while it's still being sung looks like an error.
The Hybrid Approach
The smartest workflow for most situations: start with auto sync, then manually fix the sections that need it.
- AI transcription for the base timing (saves 15-20 minutes)
- Full playback review -- Watch the video start to finish, noting any lines that feel off
- Manual adjustment on problem sections only (usually 5-15% of the lyrics)
- Final pass at 0.5x speed to catch subtle misalignments
This hybrid approach gives you 95%+ timing accuracy in about 5-10 minutes of manual work, compared to 20-30 minutes for fully manual timing.
Word-Level vs Line-Level Timing
Line-level timing means each entire line appears at once and disappears when the next line starts. This is the traditional approach and works well for most genres.
Word-level timing means individual words appear as they're spoken. This is newer and more visually dynamic, but requires more precise timing data.
When to Use Word-Level
- Rap and hip-hop (word-by-word delivery is central to the genre)
- Any track where you're using stomp words, scatter words, or depth tilt animations
- Songs with rhythmically complex vocal patterns
- When you want maximum visual energy
When Line-Level Is Fine
- Ballads and slow songs where lines flow as complete thoughts
- Clean, minimal aesthetics where word-by-word feels too busy
- Quick content where timing precision isn't the priority
Epitrite's AI transcription generates both line-level and word-level timestamps. You get both options from a single transcription.
Common Timing Mistakes
Starting text too late. If text appears after the word is already half-spoken, the viewer's brain has to work backward to match text to audio. Slightly early is better than slightly late.
Ending text too early. Cutting a line before the last word finishes being sung. The text should persist at least until the vocal phrase ends.
Ignoring instrumental sections. During an instrumental break, the screen should either be empty, show a section label ("Instrumental"), or display a visual element. A frozen last lyric line during a 16-bar instrumental break looks like a glitch.
Uniform timing for variable delivery. Not every line in a song has the same duration. A quick spoken line and a long held note need different text display times. Don't set a blanket "2 seconds per line" and call it done.
Get Your Timing Right
Upload your track to Epitrite at epitrite.com and try auto sync first. If the timing is close enough, you're done. If it needs cleanup, the timeline editor lets you fine-tune to the millisecond.
