Home‎ > ‎

Technology: Text-to-Speech

The underlying SAPI 5 technology permits the basic conversion of written text to spoken words.  This is now being used to produce voice-over narration for PowerPoint slides.  The result is a set of videos which are used for instructional purposes.

This site documents the current state of this effort.

Introduction

Text-to-Speech (TTS) involves the conversion of written text to spoken text.

TTS technology has been maturing from very robot-like speech to vocalizations that now approach human-quality voices.

The education application is to replace human talent with synthesized voices.  This will allow the efficient production and editing of learning modules that have narration.  Some of the advantages of this approach include:
  • An ability to do several rounds of editing to make sure that the intended information is accurate and appropriately conveyed.
  • Anyone can update the audio information, even long after the original production.
  • Translations can be produced and audio information can be produced in languages other than the original language.

Examples


TTS Engines

The software that does the conversion from a text file to a sound file is the TTS Engine.

The TTS Engine must have a file option in order to be used in the production of narrated modules.

The TTS Engines which are being tested and which appear to meet the minimum quality qualifications include:

Note that license restrictions prevent the use of these TTS Engines for materials that will be used in public or posted on the Internet.  It may be that the voices, not the engines, are licensed.  As a result, the use of these TTS Engines is currently limited to testing the technology.

Licenses are available for some of the TTS Engines so that the voice products can be used for educational applications.  It is recommended that you contact the vendors for more specific license information and pricing.




TTS Voices

A wide variety of voices is available for use with TTS Engines.  Voices are available with different characteristics (male vs. female) and nationalities (American vs. British English).  A variety of languages have their own voices.

The voices which are currently being used include:

Ivona

Several of the IVONA voices show great promise.  These work in IVONA Reader and TextAloud with similar results.
  • Brian (male, British)
  • Jennifer (female, American)



SAPI 5 Enhancements

The underlying Microsoft SAPI  (Speech API) 5 technology has been implemented in the various TTS Engines to allow the insertion of control information (i.e., a markup language) which will improve the narration.

The markup control which have been used in the modules are shown in the examples below.

Silence

The silence control lets you place pauses in the text.  This is useful as the break between paragraphs is a fixed interval; this lets you extend this interval.

<silence msec="2000"/> 

provides a 2 second pause.

Emphasis

This control code provides a way to change how strongly a word or phrase is spoken.
Do as I <emphasis level="strong">say</emphasis>, not as I do.
The options are strong, moderate, none and reduced.
Say As

This control code tells the TTS Engine how to pronounce special cases.  For example,
Today's date is <say-as type="date:ymd">2005/01/01</say-as>
Use a <say-as type="acronym">GPS</say-as> to determine the location.
The following set of attributes is from the http://developer.voicegenie.com website and list some of the different types which are available.

  • acronym - contained text is pronounced as individual characters.
  • date, date:dmy, date:mdy, date:ymd, date:ym, date:my, date:md, date:y, date:m, date:d - contained text is a date, with presence/order of year, month, and day optionally specified. See note below.
  • duration, duration:hms, duration:hm, duration:ms, duration:h, duration:m, duration:s - contained text is a time duration, with presence/order of hours, minutes, and seconds optionally specified. See note below.
  • measure - contained text is a measurement.
  • name - contained text is a proper name (ie. person, company, etc.)
  • net - contained text is an internet handle. The net type can include one of the following formats: email, uri.
  • number, number:ordinal, number:digits - contained text is an integer, fraction, floating point, Roman numeral, or other textual format that can be interpreted and spoken as a number, with format optionally specified.
  • time, time:hms, time:hm, time:h - contained text is a time of day, with presence/order of hours, minutes, and seconds optionally specified. 
Sub

The words in the alias string replace the text inside the sub brackets.

A <sub alias="global positioning system receiver">GPS</sub> should be used.