Bot WA AI dengan Voice Message
Cara membuat bot AI WhatsApp yang bisa handle voice message. Speech-to-text, voice response. Tutorial lengkap!
Voice message = Lebih personal!
Banyak customer prefer kirim voice note daripada ketik. Bot AI yang bisa handle voice dan respond dengan voice memberikan experience lebih natural.
Kenapa Voice Support?
📊 FAKTA VOICE MESSAGE:
- 70% user WA pernah kirim voice note
- Voice lebih cepat dari ketik
- Lebih personal dan ekspresif
- Penting untuk accessibility
- Trend meningkat terus
USE CASES:
- Customer malas ketik panjang
- Komplain (lebih ekspresif)
- Lansia / kurang familiar teknologi
- Multitasking (sambil nyetir, dll)Architecture
🏗️ VOICE MESSAGE FLOW:
INCOMING VOICE:
┌─────────────┐
│ Voice Note │
│ (.ogg/.mp3) │
└──────┬──────┘
│
▼
┌─────────────┐
│ Download │
│ Media │
└──────┬──────┘
│
▼
┌─────────────┐
│ Speech to │
│ Text │
└──────┬──────┘
│
▼
┌─────────────┐
│ AI Process │
│ (GPT) │
└──────┬──────┘
│
▼
┌─────────────┐
│ Reply │
│ Text/Voice │
└─────────────┘Implementation
1. Download Voice Message:
javascript
// Official API
async function downloadVoiceMessage(mediaId) {
// Get media URL
const urlResponse = await axios.get(
`https://graph.facebook.com/v17.0/${mediaId}`,
{ headers: { 'Authorization': `Bearer ${ACCESS_TOKEN}` } }
);
const mediaUrl = urlResponse.data.url;
// Download media
const mediaResponse = await axios.get(mediaUrl, {
headers: { 'Authorization': `Bearer ${ACCESS_TOKEN}` },
responseType: 'arraybuffer'
});
return Buffer.from(mediaResponse.data);
}
// Unofficial (Baileys)
async function downloadVoiceBaileys(msg) {
const buffer = await downloadMediaMessage(
msg,
'buffer',
{},
{ reuploadRequest: sock.updateMediaMessage }
);
return buffer;
}2. Speech-to-Text dengan OpenAI Whisper:
javascript
const OpenAI = require('openai');
const fs = require('fs');
const openai = new OpenAI();
async function transcribeAudio(audioBuffer, filename = 'audio.ogg') {
// Save buffer to temp file
const tempPath = `/tmp/${filename}`;
fs.writeFileSync(tempPath, audioBuffer);
// Transcribe with Whisper
const transcription = await openai.audio.transcriptions.create({
file: fs.createReadStream(tempPath),
model: 'whisper-1',
language: 'id', // Indonesian
response_format: 'text'
});
// Cleanup
fs.unlinkSync(tempPath);
return transcription;
}
// With timestamps (for long audio)
async function transcribeWithTimestamps(audioBuffer) {
const tempPath = `/tmp/audio_${Date.now()}.ogg`;
fs.writeFileSync(tempPath, audioBuffer);
const transcription = await openai.audio.transcriptions.create({
file: fs.createReadStream(tempPath),
model: 'whisper-1',
language: 'id',
response_format: 'verbose_json',
timestamp_granularities: ['segment']
});
fs.unlinkSync(tempPath);
return transcription;
}3. Alternative: Google Speech-to-Text:
javascript
const speech = require('@google-cloud/speech');
const client = new speech.SpeechClient();
async function transcribeWithGoogle(audioBuffer) {
const audio = {
content: audioBuffer.toString('base64')
};
const config = {
encoding: 'OGG_OPUS',
sampleRateHertz: 16000,
languageCode: 'id-ID',
alternativeLanguageCodes: ['en-US'], // Fallback
enableAutomaticPunctuation: true
};
const [response] = await client.recognize({ audio, config });
const transcription = response.results
.map(result => result.alternatives[0].transcript)
.join('\n');
return transcription;
}Full Handler
javascript
async function handleVoiceMessage(userId, message) {
// 1. Get voice message info
const audioInfo = message.audioMessage || message.audio;
const mediaId = audioInfo.id;
// 2. Send "processing" indicator
await sendTypingIndicator(userId);
// 3. Download audio
const audioBuffer = await downloadVoiceMessage(mediaId);
// 4. Transcribe
let transcription;
try {
transcription = await transcribeAudio(audioBuffer);
console.log('Transcription:', transcription);
} catch (error) {
console.error('Transcription failed:', error);
return await sendMessage(userId, {
text: 'Maaf kak, voice note-nya kurang jelas. Bisa ketik atau kirim ulang? 🙏'
});
}
// 5. Check if transcription is empty or too short
if (!transcription || transcription.trim().length < 3) {
return await sendMessage(userId, {
text: 'Maaf kak, aku tidak bisa mendengar dengan jelas. Bisa diulang atau ketik saja? 😊'
});
}
// 6. Process with AI (same as text)
const aiResponse = await processWithAI(userId, transcription);
// 7. Send response (text or voice)
const userPreference = await getUserVoicePreference(userId);
if (userPreference === 'voice') {
await sendVoiceResponse(userId, aiResponse);
} else {
await sendMessage(userId, { text: aiResponse });
}
// 8. Log for analytics
await logVoiceInteraction(userId, {
transcription,
response: aiResponse,
duration: audioInfo.seconds
});
}Text-to-Speech Response
OpenAI TTS:
javascript
async function generateVoiceResponse(text) {
const mp3 = await openai.audio.speech.create({
model: 'tts-1',
voice: 'nova', // Options: alloy, echo, fable, onyx, nova, shimmer
input: text,
response_format: 'opus' // Better for WhatsApp
});
const buffer = Buffer.from(await mp3.arrayBuffer());
return buffer;
}
async function sendVoiceResponse(userId, text) {
// Generate voice
const audioBuffer = await generateVoiceResponse(text);
// Save to temp file
const tempPath = `/tmp/response_${Date.now()}.opus`;
fs.writeFileSync(tempPath, audioBuffer);
// Send voice message
await sock.sendMessage(userId, {
audio: { url: tempPath },
mimetype: 'audio/ogg; codecs=opus',
ptt: true // Push-to-talk (voice note style)
});
// Cleanup
fs.unlinkSync(tempPath);
}Google TTS:
javascript
const textToSpeech = require('@google-cloud/text-to-speech');
const ttsClient = new textToSpeech.TextToSpeechClient();
async function generateVoiceGoogle(text) {
const request = {
input: { text },
voice: {
languageCode: 'id-ID',
name: 'id-ID-Wavenet-A', // Female Indonesian voice
ssmlGender: 'FEMALE'
},
audioConfig: {
audioEncoding: 'OGG_OPUS',
speakingRate: 1.0,
pitch: 0
}
};
const [response] = await ttsClient.synthesizeSpeech(request);
return response.audioContent;
}Voice Preference Detection
javascript
// Auto-detect if user prefers voice
async function detectVoicePreference(userId) {
const recentMessages = await db.messages
.find({ oderId
userId })
.sort({ timestamp: -1 })
.limit(10)
.toArray();
const voiceCount = recentMessages.filter(m =>
m.type === 'audio' || m.type === 'ptt'
).length;
// If > 50% are voice, user probably prefers voice
return voiceCount > 5 ? 'voice' : 'text';
}
// Or ask user
async function askVoicePreference(userId) {
await sendMessage(userId, {
text: `Kak, mau aku balas pakai voice note juga atau text aja?
1️⃣ Voice note
2️⃣ Text aja
Reply angkanya ya!`
});
}Handle Long Voice Messages
javascript
async function handleLongVoice(audioBuffer, durationSeconds) {
// Whisper has 25MB / ~25 minute limit
// For longer audio, split or summarize
if (durationSeconds > 120) { // > 2 minutes
// Transcribe with timestamps
const result = await transcribeWithTimestamps(audioBuffer);
// Summarize if very long
const summary = await summarizeLongTranscription(result.text);
return {
fullText: result.text,
summary,
segments: result.segments
};
}
return {
fullText: await transcribeAudio(audioBuffer),
summary: null
};
}
async function summarizeLongTranscription(text) {
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{
role: 'user',
content: `Summarize this customer voice message in 2-3 sentences.
Identify the main request or question.
Transcription: "${text}"`
}]
});
return response.choices[0].message.content;
}Error Handling
javascript
const voiceErrorResponses = {
transcriptionFailed: 'Maaf kak, voice note-nya kurang jelas 🙏\nBisa ketik atau kirim ulang?',
tooShort: 'Voice note-nya terlalu pendek kak.\nBisa diperjelas lagi?',
tooLong: 'Wah voice note-nya panjang banget kak! 😅\nBisa disingkat atau ketik poin pentingnya aja?',
noiseDetected: 'Sepertinya ada noise di voice note-nya kak.\nBisa cari tempat lebih tenang atau ketik aja?',
formatNotSupported: 'Format audio-nya tidak didukung kak.\nCoba kirim sebagai voice note WhatsApp biasa ya!'
};
async function handleVoiceError(userId, errorType) {
const response = voiceErrorResponses[errorType] ||
'Maaf ada kendala dengan voice note-nya. Bisa ketik aja kak? 🙏';
await sendMessage(userId, { text: response });
}Pricing Considerations
💰 COST BREAKDOWN:
WHISPER (Speech-to-Text):
- $0.006 per minute
- 1000 voice messages (avg 30s) = $3
OPENAI TTS (Text-to-Speech):
- $0.015 per 1K characters
- 1000 responses (avg 200 chars) = $3
GOOGLE SPEECH:
- $0.006 per 15 seconds
- Similar pricing
TIPS HEMAT:
- Only TTS for users who prefer voice
- Cache common responses
- Limit voice note duration
- Use cheaper models for simple transcriptionBest Practices
DO ✅
- Support voice as input
- Give choice for response format
- Handle errors gracefully
- Log for improvement
- Consider costs
- Natural voice toneDON'T ❌
- Force voice response
- Ignore transcription errors
- No fallback to text
- Unlimited voice duration
- Skip noise handling
- Robotic voiceFAQ
Whisper support Bahasa Indonesia?
Ya, sangat bagus. Whisper trained dengan banyak bahasa termasuk Indonesia.
Voice response wajib?
Tidak. Sebagian user prefer text. Beri pilihan atau detect preference.
Kesimpulan
Voice Support = Better accessibility!
| Text Only | Voice + Text |
|---|---|
| Limited input | Flexible input |
| Type everything | Speak naturally |
| Less personal | More personal |