Speech To Text

Introduction

The Xdroid Speech To Text API enables digital transformation in contact centers through voice and text solutions based on artificial intelligence and machine learning.

API specification

Test the API on SwaggerHub

Base URL

https://api-prd.kpn.com/data/kpn/voiceanalytics/

Conceptual model

Documentation page Xdroid Speech To Text conceptual model

Definitions

Codec

A codec is a device or computer program which encodes or decodes a digital data stream or signal. Codec is short for coder-decoder.

G.711

G.711 is a narrowband audio codec, which defines two main compandings (compressing/expanding) algorithms, the μ-law algorithm and the A-law algorithm.

ISO

International Organization for Standardization (ISO) is an international standard-setting body.

MP3

MP3 is a coding format for digital audio.

MP4

MPEG-4 Part 14 or MP4 is a digital multimedia container format.

OGG

Ogg is a free, open multimedia container format.

Opus

Opus is an audio coding format using lossy compression.

PCI

Payment Card Industry (PCI) compliance is mandated by credit card companies to help ensure the security of credit card transactions in the payments industry.

PCM

Pulse Code Modulation (PCM) is an audio format. PCM is both uncompressed and lossless.

WAV

Waveform Audio File Format is an audio file format standard, developed by IBM and Microsoft.

API workflow

Documentation page Xdroid Speech To Text workflow

Requirements

Accepted container formats:
- .wav
- .mp3 / .mp4
- .opus / .ogg
Preferred audios that provides the best quality:
- Bitrate: 64 Kbit/s per channel (stereo recording is supported).
- Sample rate: 8 KHz / 16 KHz.
- Uncompressed / lossless telephony codecs (PCM Linear, G711 a-law/u-law).

Features and constraints

Features

API provides speech to text transcriptions based on search word volume.
Voice analytics system provides additional emotion analysis, keyword detection and semantic capabilities along with full quality evaluation for call centers.

Constraints

Audio file should not be greater than 150 M bytes.

Getting started

Make sure you've read Getting Started for more info on how to register your application and start trying out our APIs.

Authentication

The API follows the KPN Store API authentication standard to secure the API. It includes the use of OAuth 2.0 client_id and client_secret to receive an access token.

Go to the Authentication tab on top of this page to find out how to:

Authenticate to an API using cURL.
Authenticate to an API on SwaggerHub.
Import Open API Specifications (OAS), also called Swagger files into Postman.

How to...

Submit audio files for analysis

This endpoint lets you submit audio files to start a new voice analytics job.

Recommended formats are:

WAV container (PCM Linear 16 bit, G711 μ-law/A-law) is recommended.
MP3/OPUS recordings are also supported but depending on the compression level, it may affect transcription accuracy.

The supported content type is multipart/form-data.

Request

POST /job

Upload the audio file and send the language config in the body.


      cURL request example
      
        Code copied
        
            
      

curl -X POST "https://api-prd.kpn.com/data/kpn/voiceanalytics/job" -H  "accept: application/json" -H  "Authorization: Bearer *****************" -H  "Content-Type: multipart/form-data" -F "config={"language":"en","recording_start":"" }" -F "audio_file=@speech_orig.wav;type=audio/wav"

Body parameter	Type	Description
`audio_file=@`	multipart/form-data	Audio file. File size limit: 100 Mbyte for each file. Example: `/Audios/0036550e-720f-1239-0b99-eecf4973.wav`.
`config=`	object	JSON object containing `language` and `recording_start` parameters. Example: `{"language":"en","recording_start":"" }`
`language`	string	The parameter `language` is required in ISO language code. Supported language codes: Global English: `en` Global Spanish: `es` Dutch: `nl` French: `fr` Example: `"language":"en"`
`recording_start`	integer	Day and time when the recording starts. Optional Format: `YmdHis`. Example: `"recording_start":"20201216081228"`.

Response

The response returns the unique job_id. Save it to retrieve the transcription in later requests.


      Response example
      
        Code copied
        
            
      

{
  "job_id": 12
}

Retrieve transcription

This endpoint retrieves the JSON transcript of a finished transcription job. Send the unique job_id as a path parameter.

Please do not use intervals that are shorter than 10 seconds to check the status to avoid a throttle penalty.

Request

GET /job/transcript/$job_id


      cURL request
      
        Code copied
        
            
      

curl -X GET "https://api-prd.kpn.com/data/kpn/voiceanalytics/job/13" -H  "accept: application/json" -H  "Authorization: Bearer *****************"

Response

Expected flow of statuses is queued > processing > analyzed.

Please calculate with real-time equivalent (RTE) of 1. That means that a minute length conversation will take approximately the same time as the length of the recording.


      Response - Processing status
      
        Code copied
        
            
      

{
  "job": {
    "job_id": 12,
    "created_at": "2020-12-16 15:26:39",
    "audio_file": "xdroid-voiceanalytics-sample_20201216152639.wav",
    "status": "processing" 
  },
  "results": [
}

If the job status gets to analyzed, the request will retrieve analytics results in the [results] block. The results are in JSON array format.


      Response - Analyzed status with results
      
        Code copied
        
            
      

{
  "job": {
    "job_id": 12,
    "created_at": "2020-12-16 15:26:39",
    "audio_file": "xdroid-voiceanalytics-sample_20201216152639.wav",
    "status": "analyzed"
  },
  "results": [
    {
      "data_type": "TRANSCRIPT",  // Type of data, see table below  
      "data_channel": 1,   // Detected channel in stereo, where 1 = first, 2 second channel
      "data_value": "welcome",  //  A transcribed word
      "data_detect_start": 570,  // Start time in milliseconds
      "data_detect_end": 1020,  // End time in milliseconds
      "data_length": 450,  // Length of block in milliseconds
      "data_probability": 1  // Probability of result
    },
    {
      "data_type": "TRANSCRIPT",
      "data_channel": 1,
      "data_value": "to",
      "data_detect_start": 1020,
      "data_detect_end": 1140,
      "data_length": 120,
      "data_probability": 1
    },
    {
      "data_type": "TRANSCRIPT",
      "data_channel": 1,
      "data_value": "voice",
      "data_detect_start": 1140,
      "data_detect_end": 1440,
      "data_length": 300,
      "data_probability": 1
    },
    {
      "data_type": "TRANSCRIPT",
      "data_channel": 1,
      "data_value": "analytics",
      "data_detect_start": 1440,
      "data_detect_end": 2050,
      "data_length": 610,
      "data_probability": 1
    },
}

Parameter	Description
`data_type`	TRANSCRIPT. Word-level transcription. Data value contains the word, data_probability is the internal confidence level.
`data_channel`	Detected channel in stereo, where 1 = first, 2 second channel. Example: `1`
`data_value`	Transcribed word. Example: `welcome`
`data_detect_start`	Start time in milliseconds. Example: `8820`
`data_detect_end`	End time in milliseconds. Example: `9070`
`data_length`	Length of block in milliseconds. Example: `250`
`data_probability`	Probability of result. Example: `0.83`

Messaging

Voice

Video

Chat

Identity

Internet

IoT

LAN/WAN

Service Management

Mobile

Integrations

Features

Speech To Text

Introduction

API specification

Base URL

Conceptual model

Definitions

API workflow

Requirements

Features and constraints

Features

Constraints

Getting started

Authentication

How to...

Submit audio files for analysis

Retrieve transcription