Bot WA AI dengan Voice Message

Cara membuat bot AI WhatsApp yang bisa handle voice message. Speech-to-text, voice response. Tutorial lengkap!

Bot WA AI dengan Voice Message
Bot WA AI dengan Voice Message

Voice message = Lebih personal!

Banyak customer prefer kirim voice note daripada ketik. Bot AI yang bisa handle voice dan respond dengan voice memberikan experience lebih natural.


Kenapa Voice Support?

📊 FAKTA VOICE MESSAGE:

- 70% user WA pernah kirim voice note
- Voice lebih cepat dari ketik
- Lebih personal dan ekspresif
- Penting untuk accessibility
- Trend meningkat terus

USE CASES:
- Customer malas ketik panjang
- Komplain (lebih ekspresif)
- Lansia / kurang familiar teknologi
- Multitasking (sambil nyetir, dll)

Architecture

🏗️ VOICE MESSAGE FLOW:

INCOMING VOICE:
┌─────────────┐
│ Voice Note  │
│ (.ogg/.mp3) │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Download   │
│   Media     │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Speech to  │
│    Text     │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  AI Process │
│   (GPT)     │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Reply     │
│ Text/Voice  │
└─────────────┘

Implementation

1. Download Voice Message:

javascript

// Official API
async function downloadVoiceMessage(mediaId) {
    // Get media URL
    const urlResponse = await axios.get(
        `https://graph.facebook.com/v17.0/${mediaId}`,
        { headers: { 'Authorization': `Bearer ${ACCESS_TOKEN}` } }
    );
    
    const mediaUrl = urlResponse.data.url;
    
    // Download media
    const mediaResponse = await axios.get(mediaUrl, {
        headers: { 'Authorization': `Bearer ${ACCESS_TOKEN}` },
        responseType: 'arraybuffer'
    });
    
    return Buffer.from(mediaResponse.data);
}

// Unofficial (Baileys)
async function downloadVoiceBaileys(msg) {
    const buffer = await downloadMediaMessage(
        msg,
        'buffer',
        {},
        { reuploadRequest: sock.updateMediaMessage }
    );
    return buffer;
}

2. Speech-to-Text dengan OpenAI Whisper:

javascript

const OpenAI = require('openai');
const fs = require('fs');

const openai = new OpenAI();

async function transcribeAudio(audioBuffer, filename = 'audio.ogg') {
    // Save buffer to temp file
    const tempPath = `/tmp/${filename}`;
    fs.writeFileSync(tempPath, audioBuffer);
    
    // Transcribe with Whisper
    const transcription = await openai.audio.transcriptions.create({
        file: fs.createReadStream(tempPath),
        model: 'whisper-1',
        language: 'id', // Indonesian
        response_format: 'text'
    });
    
    // Cleanup
    fs.unlinkSync(tempPath);
    
    return transcription;
}

// With timestamps (for long audio)
async function transcribeWithTimestamps(audioBuffer) {
    const tempPath = `/tmp/audio_${Date.now()}.ogg`;
    fs.writeFileSync(tempPath, audioBuffer);
    
    const transcription = await openai.audio.transcriptions.create({
        file: fs.createReadStream(tempPath),
        model: 'whisper-1',
        language: 'id',
        response_format: 'verbose_json',
        timestamp_granularities: ['segment']
    });
    
    fs.unlinkSync(tempPath);
    
    return transcription;
}

3. Alternative: Google Speech-to-Text:

javascript

const speech = require('@google-cloud/speech');

const client = new speech.SpeechClient();

async function transcribeWithGoogle(audioBuffer) {
    const audio = {
        content: audioBuffer.toString('base64')
    };
    
    const config = {
        encoding: 'OGG_OPUS',
        sampleRateHertz: 16000,
        languageCode: 'id-ID',
        alternativeLanguageCodes: ['en-US'], // Fallback
        enableAutomaticPunctuation: true
    };
    
    const [response] = await client.recognize({ audio, config });
    
    const transcription = response.results
        .map(result => result.alternatives[0].transcript)
        .join('\n');
    
    return transcription;
}

Full Handler

javascript

async function handleVoiceMessage(userId, message) {
    // 1. Get voice message info
    const audioInfo = message.audioMessage || message.audio;
    const mediaId = audioInfo.id;
    
    // 2. Send "processing" indicator
    await sendTypingIndicator(userId);
    
    // 3. Download audio
    const audioBuffer = await downloadVoiceMessage(mediaId);
    
    // 4. Transcribe
    let transcription;
    try {
        transcription = await transcribeAudio(audioBuffer);
        console.log('Transcription:', transcription);
    } catch (error) {
        console.error('Transcription failed:', error);
        return await sendMessage(userId, {
            text: 'Maaf kak, voice note-nya kurang jelas. Bisa ketik atau kirim ulang? 🙏'
        });
    }
    
    // 5. Check if transcription is empty or too short
    if (!transcription || transcription.trim().length < 3) {
        return await sendMessage(userId, {
            text: 'Maaf kak, aku tidak bisa mendengar dengan jelas. Bisa diulang atau ketik saja? 😊'
        });
    }
    
    // 6. Process with AI (same as text)
    const aiResponse = await processWithAI(userId, transcription);
    
    // 7. Send response (text or voice)
    const userPreference = await getUserVoicePreference(userId);
    
    if (userPreference === 'voice') {
        await sendVoiceResponse(userId, aiResponse);
    } else {
        await sendMessage(userId, { text: aiResponse });
    }
    
    // 8. Log for analytics
    await logVoiceInteraction(userId, {
        transcription,
        response: aiResponse,
        duration: audioInfo.seconds
    });
}

Text-to-Speech Response

OpenAI TTS:

javascript

async function generateVoiceResponse(text) {
    const mp3 = await openai.audio.speech.create({
        model: 'tts-1',
        voice: 'nova', // Options: alloy, echo, fable, onyx, nova, shimmer
        input: text,
        response_format: 'opus' // Better for WhatsApp
    });
    
    const buffer = Buffer.from(await mp3.arrayBuffer());
    return buffer;
}

async function sendVoiceResponse(userId, text) {
    // Generate voice
    const audioBuffer = await generateVoiceResponse(text);
    
    // Save to temp file
    const tempPath = `/tmp/response_${Date.now()}.opus`;
    fs.writeFileSync(tempPath, audioBuffer);
    
    // Send voice message
    await sock.sendMessage(userId, {
        audio: { url: tempPath },
        mimetype: 'audio/ogg; codecs=opus',
        ptt: true // Push-to-talk (voice note style)
    });
    
    // Cleanup
    fs.unlinkSync(tempPath);
}

Google TTS:

javascript

const textToSpeech = require('@google-cloud/text-to-speech');

const ttsClient = new textToSpeech.TextToSpeechClient();

async function generateVoiceGoogle(text) {
    const request = {
        input: { text },
        voice: {
            languageCode: 'id-ID',
            name: 'id-ID-Wavenet-A', // Female Indonesian voice
            ssmlGender: 'FEMALE'
        },
        audioConfig: {
            audioEncoding: 'OGG_OPUS',
            speakingRate: 1.0,
            pitch: 0
        }
    };
    
    const [response] = await ttsClient.synthesizeSpeech(request);
    return response.audioContent;
}

Voice Preference Detection

javascript

// Auto-detect if user prefers voice
async function detectVoicePreference(userId) {
    const recentMessages = await db.messages
        .find({ oderId
userId })
        .sort({ timestamp: -1 })
        .limit(10)
        .toArray();
    
    const voiceCount = recentMessages.filter(m => 
        m.type === 'audio' || m.type === 'ptt'
    ).length;
    
    // If > 50% are voice, user probably prefers voice
    return voiceCount > 5 ? 'voice' : 'text';
}

// Or ask user
async function askVoicePreference(userId) {
    await sendMessage(userId, {
        text: `Kak, mau aku balas pakai voice note juga atau text aja?

1️⃣ Voice note
2️⃣ Text aja

Reply angkanya ya!`
    });
}

Handle Long Voice Messages

javascript

async function handleLongVoice(audioBuffer, durationSeconds) {
    // Whisper has 25MB / ~25 minute limit
    // For longer audio, split or summarize
    
    if (durationSeconds > 120) { // > 2 minutes
        // Transcribe with timestamps
        const result = await transcribeWithTimestamps(audioBuffer);
        
        // Summarize if very long
        const summary = await summarizeLongTranscription(result.text);
        
        return {
            fullText: result.text,
            summary,
            segments: result.segments
        };
    }
    
    return {
        fullText: await transcribeAudio(audioBuffer),
        summary: null
    };
}

async function summarizeLongTranscription(text) {
    const response = await openai.chat.completions.create({
        model: 'gpt-4o-mini',
        messages: [{
            role: 'user',
            content: `Summarize this customer voice message in 2-3 sentences.
            Identify the main request or question.
            
            Transcription: "${text}"`
        }]
    });
    
    return response.choices[0].message.content;
}

Error Handling

javascript

const voiceErrorResponses = {
    transcriptionFailed: 'Maaf kak, voice note-nya kurang jelas 🙏\nBisa ketik atau kirim ulang?',
    
    tooShort: 'Voice note-nya terlalu pendek kak.\nBisa diperjelas lagi?',
    
    tooLong: 'Wah voice note-nya panjang banget kak! 😅\nBisa disingkat atau ketik poin pentingnya aja?',
    
    noiseDetected: 'Sepertinya ada noise di voice note-nya kak.\nBisa cari tempat lebih tenang atau ketik aja?',
    
    formatNotSupported: 'Format audio-nya tidak didukung kak.\nCoba kirim sebagai voice note WhatsApp biasa ya!'
};

async function handleVoiceError(userId, errorType) {
    const response = voiceErrorResponses[errorType] || 
        'Maaf ada kendala dengan voice note-nya. Bisa ketik aja kak? 🙏';
    
    await sendMessage(userId, { text: response });
}

Pricing Considerations

💰 COST BREAKDOWN:

WHISPER (Speech-to-Text):
- $0.006 per minute
- 1000 voice messages (avg 30s) = $3

OPENAI TTS (Text-to-Speech):
- $0.015 per 1K characters
- 1000 responses (avg 200 chars) = $3

GOOGLE SPEECH:
- $0.006 per 15 seconds
- Similar pricing

TIPS HEMAT:
- Only TTS for users who prefer voice
- Cache common responses
- Limit voice note duration
- Use cheaper models for simple transcription

Best Practices

DO ✅

- Support voice as input
- Give choice for response format
- Handle errors gracefully
- Log for improvement
- Consider costs
- Natural voice tone

DON'T ❌

- Force voice response
- Ignore transcription errors
- No fallback to text
- Unlimited voice duration
- Skip noise handling
- Robotic voice

FAQ

Whisper support Bahasa Indonesia?

Ya, sangat bagus. Whisper trained dengan banyak bahasa termasuk Indonesia.

Voice response wajib?

Tidak. Sebagian user prefer text. Beri pilihan atau detect preference.


Kesimpulan

Voice Support = Better accessibility!

Text OnlyVoice + Text
Limited inputFlexible input
Type everythingSpeak naturally
Less personalMore personal

Setup Voice Bot →


Artikel Terkait