Spitch Lingware Test Node

Quickstart integration documentation, basic usage examples.

Basic information about this node

Introduction

The following sections provide a brief introduction to the Spitch integration test node, as well as a number of quickstart example. This node is intended for technical integration tests only; not accuracy evaluations. The node supports two main modes of operation: asynchronous mode and synchronous mode.

Asynchronous mode

The asynchronous usage pattern typically involves two steps:

  1. Submit a file+job request to the Speech-To-Text (STT) queue. This returns a job ID.
  2. Poll the queue manager for progress using the job ID.

The asynchronous mode internally performs silence/VAD based segmentation, then submits the individual segments to the STT queue for recognition. The results are reconstructed, along with segment timestamps in the final response. In the case of extra long jobs, polling the job ID will return partial results. In asynchronous mode, the manager will also make a best effort to transcode any arbitrary codec to something appropriate for the backend.

The asynchronous mode attempts to internally manage the STT load, and is especially appropriate for long utterances. Note however, that job results are ephemeral in nature. In the test node these are stored by default for only twenty four hours. Afterwards all information related to the job, including associated files, requests, and results are purged from the local database.

Synchronous mode

The synchronous mode is essentially 'headless' access to the STT backends. It is not appropriate for long utterances and it will perform absolutely no resource management. The headless backend only understands 8kHz, mono, 16bit, signed-linear PCM data. It may optionally include a wav header. The advantage of synchronous mode is reduced latency. Currently the synchronous mode also returns some supplementary information which is not propagated to the asynchronous mode responses. This additional information includes word timestamps and support for N-best mode. Extension of support for these features to asynchronous mode is planned for a future release.

Node Data

Note: TTL values are in seconds and memory values in megabytes.

Available Lingware Applications

Note: the port information is only needed for accessing a backend in raw/headless mode. This access mode performs no reformatting, and no management of simultaneous requests. UAYOR.

Quickstart Examples

Example Audio File

Queue an asynchronous ASR request

All parameters in the basic queueing request are required. Short descriptions of these are provided in the following table.

# Parameter Values Description
1 Audio-Content-Type audio/x-pcm; rate=8000 Fixed value.
2 Authorization A valide Bearer token. Requires a valid bearer token, which will have been provided separately.
3 Transfer-Encoding chunked The transfer encoding must be set to chunked mode.
4 X-Type Mime type for the input. Prefer 'wav'. The server will make a best effort to convert anything it receives to 'wav'. This means that a low-quality mp3 file will be accepted, however for best results the test audio should match that used to train the reference models, and in general higher quality is better.
5 X-Host e.g.: call-center.en-US.8k The name of a valid, provisioned ASR application, which is currently available on the target node.
$ wget /s3-eu-west-1.amazonaws.com/spitch-example-data/call-0001.wav
$ curl -k \
    -H "Audio-Content-Type: audio/x-pcm; rate=8000" \
    -H "Authorization: Bearer ${ACCESS_TOKEN}" \
    -H "Transfer-Encoding: chunked"         \
    -H "X-Type: wav" \
    -H "X-Host: call-center.en-US.8k" \
    --data-binary @call-0001.wav \
    /testnode-3hb7ta.spitch.ch/stt/async
{
  "status": "job_added", 
  "job_id": "710fe59d-5584-4a46-aa43-50ecae9f34a7", 
  "response": "ASR/Segmentation job added to queue."
}
Note: the server will attempt to automatically segment longer audio files based on identification of silence regions. The final result may be composed of multiple segments. Although the silence/VAD level is reported, this cannot currently be tuned by the user. Partial results will also be made available as they complete.
Note: the server will not cache the provided recordings. Uploads will be discarded by default immediately following termination of the decoding process. This may be reconfigured upon request.

Query the result from an asynchronous request

Note: make sure you use the job ID you received in the response to your own STT request! Do not copy-paste the IDs below!

$ curl -k -H "Authorization: Bearer ${ACCESS_TOKEN}" \
    "/testnode-3hb7ta.spitch.ch/progress?id=710fe59d-5584-4a46-aa43-50ecae9f34a7" \
    | python -m json.tool
{
    "response": {
        "annotationDatum": {
            "segment-0_00": {
                "app": "call-center.en-US.8k",
                "endtime": "2017-02-28_14:28:30.340615UTC",
                "index": 0,
                "job_id": "0957a868-0f1a-41ad-8fd9-cbec25175ff4",
                "region": {
                    "end": 5.34,
                    "start": 0.0
                },
                "srate": 8000,
                "starttime": "2017-02-28_14:28:27.340831UTC",
                "transcription": "hi this is joe calling from aaa dispatch how are you",
                "vad_level": 0
            }
        },
        "endtime": "2017-02-28_14:28:31.343090UTC",
        "filename": "20170228142826985277-zorrSPBe.wav",
        "segments": 1,
        "starttime": "2017-02-28_14:28:27.329332UTC",
        "version": "0.01a"
    },
    "status": "finished"
}
In addition to the "finished" status, the queueing server may return the values "queued", "started", and "failed".

Queue an asynchronous ASR request with customized decoding parameters

Each backend application typically specifies optimal parameters which have been tuned to balance speed and accuracy for a particular application target domain.

It is however, possible to send custom parameters to the backend for each request in order to tune the search. The available optional parameters are described below. These must be specified as custom HTTP headers in the request.

# Parameter Values Description
1 X-Beam Integer between 1 and 14, inclusive Indicates how 'hard' to search. Recommend 11-13 in most cases. Accuracy typically plateaus, but RealTime Factor does not.
2 X-Acoustic-Scale Float between 0.0 and 1.0, inclusive. Indicates how to balance confidence in the acoustic model versus the language model. Recommend 0.07 in most cases. Smaller values put more emphasis on the language model.
3 X-Band Integer between 2000 and 15000 inclusive. Indicates the desired parallel breadth of search. Recommend 7500 in most cases. Impacts are similar to the beam.
4 X-Insertion-Penalty Float between -10.0 and 10.0 inclusive. Indicates how 'easy' it should be to insert new words. Recommend 0.0 in most cases. For noisy data a small positive value up to 1.0 can sometimes help reduce erroneous insertions. Negative values typically slow search.
$ curl -k \
    -H "Audio-Content-Type: audio/x-pcm; rate=8000" \
    -H "Authorization: Bearer ${ACCESS_TOKEN}" \
    -H "Transfer-Encoding: chunked"         \
    -H "X-Type: wav" \
    -H "X-Host: call-center.en-US.8k" \
    -H "X-Band: 6500" \
    -H "X-Acoustic-Scale: 0.07" \
    -H "X-Beam: 9"         \
    -H "X-Insertion-Penalty: 0.5" \
    --data-binary @call-0001.wav \
    /testnode-3hb7ta.spitch.ch/stt/async
{
    "job_id": "d00437b2-ff0f-48d2-bf0d-ed098d140d97",
    "response": "ASR/Segmentation job added to queue.",
    "status": "job_added"
}

Make a synchronous ASR request with customized decoding parameters

The synchronous API is typically not exposed. It is exposed for experimentation purposes, but should be used with caution. Requests are not queued, and no resource management is performed. Making long-audio requests is discouraged and may cause the system to crash. There is no authorization or rate limiting. The audio file MUST be provided in the correct format and correct sample rate (e.g. 8kHz, 16bit PCM). Results are provided in the raw backend XML format. UAYOR!

$ curl \
      -H "Content-Type: audio/x-pcm; rate=8000" \
      -H "Transfer-Encoding: chunked" \
      -H "Host: call-center.en-US.8k" \
      -H "X-Band: 6500" \
      -H "X-Acoustic-Scale: 0.07" \
      -H "X-Beam: 9" \
      -H "X-Insertion-Penalty: 0.5" \
      --data-binary @call-0001.wav \
      /testnode-3hb7ta.spitch.ch:10032 \
      | xmllint --format -
<?xml version="1.0" encoding="UTF-8"?>
<result>
  <interpretation grammar="session:request1@form-level.store" confidence="184.280731">
    <input mode="speech">hi this is joe calling from aaa dispatch how are you</input>
    <instance>
      <dummy confidence="1.0"/>
      <SWI_literal>hi this is joe calling from aaa dispatch how are you</SWI_literal>
      <SWI_spoken>hi this is joe calling from aaa dispatch how are you</SWI_spoken>
      <SWI_meaning/>
      <SWI_ssmMeanings>
        <dummy/>
      </SWI_ssmMeanings>
      <SWI_ssmConfidences>
        <dummy/>
      </SWI_ssmConfidences>
...
...
    </instance>
  </interpretation>
</result>

Configure FreeSWITCH UniMRCP client

This test node is provisioned with a UniMRCP server configured for use with the same call center English ASR models employed in the preceding examples. In case of use with FreeSWITCH this configuration file can be stored in freeswitch/conf/mrcp_profiles/spitch-test-node-mrcp.xml or similar. FreeSWITCH will need to be reloaded for changes to take effect. The UniMRCP plugin for FreeSWITCH will also be required. Similar configurations should be possible with all other MRCPv2 compliant PBXs.

<include>
  <!-- Spitch MRCPv2 -->
  <profile name="spitch-mrcp-v2" version="2">
    <!-- External IP address of your FreeSWITCH instance -->
    <param name="client-ext-ip" value="EXTERNAL-IP"/>
    <!-- Internal IP address of your FreeSWITCH instance -->
    <param name="client-ip" value="INTERNAL IP"/>
    <param name="client-port" value="5093"/>

    <!-- External IP address of the Spitch Test Node -->
    <param name="server-ip" value="testnode-3hb7ta.spitch.ch"/>

    <!-- SIP port for the Spitch Test Node -->
    <param name="server-port" value="10061"/>
    <param name="sip-transport" value="tcp"/>

    <!-- External rtp address of your FreeSWITCH instance -->
    <param name="rtp-ext-ip" value="EXTERNAL-IP"/>
    <!-- Internal rtp address of your FreeSWITCH instance -->
    <param name="rtp-ip" value="INTERNAL-IP"/>

    <!-- Port range for rtp media communications with Spitch Test Node -->
    <param name="rtp-port-min" value="5000"/>
    <param name="rtp-port-max" value="6000"/>
    <!-- enable/disable rtcp support -->
    <param name="rtcp" value="0"/>
    <!-- rtcp bye policies (rtcp must be enabled first)
          0 - disable rtcp bye
          1 - send rtcp bye at the end of session
          2 - send rtcp bye also at the end of each talkspurt (input)
    -->
    <param name="rtcp-bye" value="0"/>
    <!-- rtcp transmission interval in msec (set 0 to disable) -->
    <param name="rtcp-tx-interval" value="5000"/>
    <!-- period (timeout) to check for new rtcp messages in msec (set 0 to disable) -->
    <param name="rtcp-rx-resolution" value="1000"/>
    <!--param name="playout-delay" value="50"/-->
    <!--param name="max-playout-delay" value="200"/-->
    <!--param name="ptime" value="20"/-->
    <param name="codecs" value="PCMA"/>

    <!-- Add any default MRCP params for SPEAK requests here -->
    <synthparams>
    </synthparams>

    <!-- Add any default MRCP params for RECOGNIZE requests here -->
    <recogparams>
      <!--param name="start-input-timers" value="false"/-->
      <param name="save_waveform" value="true"/>
    </recogparams>
  </profile>
</include>
Example Audio File for Verification Enrollment

Voice Verification: Enroll, Test, Update, Delete

The Spitch VeryFi Voice Verification system requires approximately 60 seconds of speech from a new speaker in order to create a voice print. The quality and robustness of the print will typically be improved by using multiple recordings from a variety of environments and across several days. The present example uses a single utterance for the purpose of simplicity. The examples illustrate how to enroll, test, update, and delete a voiceprint.

# Parameter Values Description
1 Content-Type audio/x-pcm; rate=8000 Fixed value.
2 Enroll true|false|update|delete The command for the Spitch VeryFi server. The values have the following meanings: 'true': enroll a new print; 'false': test an existing print; 'update': update an existing print; 'delete': delete an existing voice print.
3 Transfer-Encoding chunked The transfer encoding must be set to chunked mode.
4 Gender male|female Currently only the 'male' value is supported. Models are merged.
5 Speaker-Id ASCII string The target name for the voice print. This will be used to create a new print if one does not already exist.
Enroll a new voiceprint:
$ wget /s3-eu-west-1.amazonaws.com/spitch-example-data/call-0002.wav
$ curl -X POST \
    -H "Content-Type: audio/x-pcm; rate=8000" \
    -H "Transfer-Encoding: chunked" \
    -H "Enroll: true" \
    -H "Gender: male" \
    -H "Speaker-Id: joe" \
    --data-binary @call-0002.wav \
    testnode-3hb7ta.spitch.ch:10042
<?xml version="1.0"?>
<result>
  Total seconds for speaker: 128.78
  Enrolled with success!
</result>

Perform a speaker verification test:

Note that we use the first, shorter file for verification testing. Evaluation will work most reliably for utterances 10s or longer. The backend will return a score represented as a float with the value between 0.0-1.0 inclusive.

$ curl -X POST \
    -H "Content-Type: audio/x-pcm; rate=8000" \
    -H "Transfer-Encoding: chunked" \
    -H "Enroll: false" \
    -H "Gender: male" \
    -H "Speaker-Id: joe" \
    --data-binary @call-0001.wav \
    testnode-3hb7ta.spitch.ch:10042
<?xml version="1.0"?>
<result>
  0.7580
</result>

Update a voiceprint:

It is possible to update a partial or complete voiceprint. Note the change in the evaluation score after we update the base print with the testing call.

#We update the file twice in this example only because it is short 
# and we want to see it impact the subsequent test score.
$ for n in 1 2; do curl -X POST \
    -H "Content-Type: audio/x-pcm; rate=8000" \
    -H "Transfer-Encoding: chunked" \
    -H "Enroll: update" \
    -H "Gender: male" \
    -H "Speaker-Id: joe" \
    --data-binary @call-0001.wav \
    testnode-3hb7ta.spitch.ch:10042; \
done
<?xml version="1.0"?>
<result>
  Total seconds for speaker: 136.74
  Voiceprint updated with success!
</result>

Updating the voiceprint with new, matched/representative data should typically improve robustness:

$ curl -X POST \
    -H "Content-Type: audio/x-pcm; rate=8000" \
    -H "Transfer-Encoding: chunked" \
    -H "Enroll: false" \
    -H "Gender: male" \
    -H "Speaker-Id: joe" \
    --data-binary @call-0001.wav \
    testnode-3hb7ta.spitch.ch:10042
<?xml version="1.0"?>
<result>
  0.8280
</result>

Delete a voiceprint:
$ curl -X POST \
    -H "Content-Type: audio/x-pcm; rate=8000" \
    -H "Transfer-Encoding: chunked" \
    -H "Enroll: delete" \
    -H "Gender: male" \
    -H "Speaker-Id: joe" \
    testnode-3hb7ta.spitch.ch:10042
Voiceprint deleted