Speech Synthesizer

Ever thought it could be fun to make your robot speak but don't want to pay for all those expensive chips?

Well now you can.

I will show you how to do it all in software - with a few extra bits of hardware for the loudspeaker amplifier.

 

We need to break this project down into three fundemental areas:

1 - The speech software

2 - To use the same processor, or not, as the rest of your code

3 - The different outputs and any electronics required.

 

Lets go...

 

1 - The speech software

Download the attached file....

 

 

The software currently requires about 12k of flash (program) memory and 500 bytes of sram.

This means that it is too big to load into an ATMega8 as per the standard $50 robot. But 'fear not' - see later for several deployment options.

 

The speech software is broken down into several files.

 

core.c - This contains all the code to generate speech either from English text or from a list of phonemes. Don't worry if you dont understand what this means yet - just think of it as the code that does the hard work! You should never need to modify this code. Those 'hackers' amongst us will find this the most 'interesting' as this is the file that 'does speech'. If you define a variable called _LOG_ then the system will output some information about the 'sounds' it is creating. On an AVR platform this is sent out via the UART and on a Windows system it is sent out via 'stdout'. This is mainly meant for me when debugging since the time outputing all this info will make the speech un-intelligble.

 

english.h - This contains two vocabulories. One to help convert English text into phonemes, and another to convert phonemes into the list of their primitive sounds. This file is mainly made up of constant string definitions but looks a bit of a mess - this is due to how you need to write stuff on microcontrollers for the fixed strings to be stored in program memory. A new file, say called 'french.h', could be created to help produce a 'French speech synthesizer' without changing the rest of the code. I'll worry about doing that later - my French ain't great !! You should never need to modify this file - but, if you do, then ask me about the syntax of the dictionaries as they have some 'clever bits'.

 

buffer.*, uart.*, rprintf.* - These files are just copied from AVRlib and are used to access the UART for user intervention.

 

speech.c - This is where your main program is executed and is purposely kept small. On an AVR platform it will just wait to receive some text (via the UART) ended in a carriage and/or line feed. This text input is echoed back over the UART. Once a carriage return / line feed has been received then it will attempt to speak the preceeding text. On a non-AVR platform it will create a file to save the details of the speech synthesis. If this is a new 'sentence' then it will create a file in the newly created 'ref' folder. But if the 'ref' folder already contains a file for this sentence then it will create a new file in the 'test' folder and then compare its contents with that in the 'ref' folder. This is basically a test suite - where I can make changes to the code and verify if the software is still trying to do the same thing (ie I haven't messed it up!!).

 

 

 


The source code for the TextToSpeech synthesizer is the exclusive copyright of Clive Webster (Webbot) whose contact details are registered on this site. The source code is made available to individual hobbyists for their own use. These users are allowed to freely modify the source code for their own use but are forbidden from publishing any versions of the software to anyone else in any form including, but not exclusively,:

 

1. Human readable source files via any means,

2. Compiled code via Programmed integrated circuits or any other means,

 

without the express permission of the original author. Attempts to publish this code for individual or commercial profit will be treated as a violation of these terms.


 

2 - Same processor or not?

The speech synthesizer requires around 12k flash memory, and 500 bytes of SRAM memory.

 

This is bigger than an ATMega8 - what do I do?

 

If your main controller is small (ATMega8 etc) and cannot cope with all of the extra demands specified above then you will have no option but to put the speech synthesiser onto another 'board'. As mentioned later, this can be a cut-down $50 robot board with an ATMega168. All you need to do is to create a link to it via the UART so that you can ask it what to say via the serial link.

 

If you are using an ATMega32, say, then you may have enough space to link the code in alongside your own code, in which case you can call the functions, mentioned later, directly. e.g.

say("I am running on the same board.");

 

However: my personal recommendation would be to have everything as a seperate board. For the following reasons:-

1. The speech software may get bigger and so may suddenly not fit on your combined processor

2. As a seperate board you can just plug it into the UART of a main board for debugging and remove it later

3. Debugging the speech board is made easier as you can plug it into 'Hyperterminal'/etc as a stand alone unit

4. Perhaps most importantly - a seperate board can concentrate on doing the speech without your main board suffering from all of the important timings required to make the speech as good as possible. ie if it's all on the same board then the interrupts you use to drive your own program may cause the speech to 'slurr'. So let 'one board speak' whilst the main board does 'robot stuff'.

 

 

 

3 - The different outputs

Using a digital device such as a processor makes it quite difficult to generate audio signals in an analogue domain. Digital devices talk about '1's and '0's whereas audio devices talk about a scalar volume as an analogue voltage. So how do we convert from one world to another?

 

At the end-of-the-day the speech program will end up trying to produce a sound wave from a volume in the range 0 to 15. Where 0=quiet and 15=full volume. Quite how it makes this 0 to 15 value available, or audible, to the outside world is one of the most difficult aspects.

 

The easy answer is that we need a DAC (digital to analogue convertor). But our processor doesn't have one!

So we will have to build one!

 

There are many/many ways to do this:-

1. One wire PWM

Look back at FM Radios. They used a constant carrier (high) frequency as the transport. The audio was then imposed by making small changes to the frequency. At the receiver end, you could cancel out the carrier frequency which was much higher, to be just left with the small 'audio' frequency. So if we used PWM with a very high frequency we could encode the audio on top (by changing the duty cycle) and then filter out the PWM base frequency at the recevier end to return to the audio information. Why all this encode/decode stuff? Well it allows us to transmit a signal over a single wire that can be returned back into an audio signal. But this is the most complex! So we only need one I/O pin from our controller but can still try to achieve the 16 volume levels to make the speech as clear as possible.

2. One wire Digital.

This is by far the easiest to understand and the cheapeast to implement. Given that the volume can be in the range 0 to 15 then we can turn it into a digital signal by saying: if input is 0.....7 then output=0 else if input is 8......15 then output=1. So we have changed the volume from 0...15 to 0..1. Of course, the output signal is somewhat of a distortion (or simplification) of the input signal. ie signal in  = 0...15, but signal out = 0...1. So something has been lost - and it is the nuances of the sounds. But we can turn the resultant 0 or 1 into 'speaker off' or 'speaker on' commands very easily to make our sound. So we only need one I/O pin from our controllerand our resultant circuitry is simple, if less effective.

3. 4 wire Digital

The previous example is 'digital' and so can only say 'sound on' or 'sound off' - ie it 'shouts' or 'is quiet'. But since the speech core can generate sound envelope volumes from 0 to 15 then how do we implement them? This option gives us 2 to the power of 4 (ie 16) possible values. If we have 4 output pins available on our controller then we can output all of the values from 0 to 15. How we convert this back to the analogue world is done by the magic of R/2R ladders. See http://en.wikipedia.org/wiki/Resistor_Ladder. This allows us to continue to use the spectrum of volumes but requires 4 I/O pins to do so.

 

The additional hardware required for each of these options is given in the following sections:-

 

 

 

3.1 - One wire PWM

This mode uses a single PWM output pin (default is port B2). The PWM is set up to oscillate at an inaudible frequency. The volume levels 0 to 15 are then used to change the duty cycle between 0 percent and 50 percent.

 

The electronics to decode this signal are made fairly complex. First we use a low pass filter to filter out the high pitch carrier frequency. We then use an audio amplifier to amplify the resultant signal. I have included a potentiometer/trimmer as a volume control. If you set the volume too high then you may well get all sorts of squeal !!!. You could use a breadboard to find the best setting of the trimmer and then replace it with fixed resistors.

 

To activate this mode you need to edit global.h and make sure that the line:-

    #define _AVR_PWM_
    is not commented out. You cannot change the port without digging around in the code a bit and requires you to understand how to change the AVR registers for the different PWM modes.

 

The attached Eagle schematic shows you how to create the circuit:-

 

 

3.2 - One wire digital

This mode uses a single output pin to drive the speaker. To activate this mode edit the 'global.h' file and make sure the line

          #define _AVR_BINARY_
is not commented out.

 

You can select which IO port and pin to use in global.h by changing these lines which currently use PORTB pin 4 :-

    #ifdef _AVR_BINARY_

    #define SOUND_PORT PORTB
        #define SOUND_DDR DDRB
        #define SOUND_BIT BV(PB4)
        #endif

 

This mode only requires the following additional hardware:-

1 x 10k ohm resistor

1 x 47 ohm resistor

1 x 2N2222 or similar NPN transistor

1 x 8 ohm 0.1 watt speaker

 

The following schematic shows how to assemble the electronics. Note that we use 0v, +5v, and the I/O port pin so you can use a standard 3 way cable (like the ones you use to connect to servos or sensors).

 

3.3 - 4 wire digital

This mode allows you to use 4 contiguous output pins from the same port to output the 16 different volume levels.

 

To activate this mode then edit the global.h and make sure that the line:-

    #define _AVR_QUAD_
    is not commented out.

 

You can change the port and pins that are used by editing the following lines in global.h

 

    #ifdef _AVR_QUAD_
        #define QUAD_PORT PORTC
        #define QUAD_DDR  DDRC
        #define QUAD_BIT  PC0
        #define QUAD_MASK (BV(QUAD_BIT) | BV(QUAD_BIT+1) | BV(QUAD_BIT+2) | BV(QUAD_BIT+3))
        #endif

the above values will use Port C pins 0,1,2,3.

 

This mode may be preferable to PWM if you are using the PWM ports already, say for controlling motors, but it does need 4 I/O pins. The electronics of the audio driver board are also fairly complex. First of all we take the outputs from the pins and feed them into a resistor ladder, which acts like a digital-to-analogue convertor, and generates a signal between 0v and 5v in 16 steps. Since the 'ladder' requires lots of resistors with one value, and lots more with twice that value, then I have opted (in my schematic) to use several SIL Resistor arrays of 10k resistors. These things are quite cool in this scenario. A single package has 8 pins in a single line (hence SIL). Each adjacent pair of pins contains one 10k resistor. So an 8 pin package has 4 resistors. Note how I put some in parallel to generate 5k resistors in order to satisfy the requirement of the ladder. These devices are normally rated at +-2% which is better than your average resistor.

 

The output of the ladder (ie the analogue signal) is then fed into a unity gain operational amplifier just to give it some extra oomph. Note how the op-amp has power supplied via an RC network to try to protect it from power supply noise.

 

The output of the op amp is used to drive a transistor that drives the speaker.

 

Here's the schematic:-

 

4 - Providing input

 

The previous section has talked about the different modes and their associated hardware setup. But how do you make the thing make a noise!!

 

Assuming that you have created the synthesizer as a seperate board then:-

 

  1. Once the software is started up it goes into a main loop that accepts characters from the UART
  2. You can therefore attach it to either another micro controller board or, via a MAX232 chip, to the serial port of your pc
  3. Any characters you type are echoed back. Once you press Enter the sentence is spoken. Any error messages are also sent back to hyperterminal. Why would I get errors? The software works 'a line at a time' - so the line of text may be too long for it to cope. Alternatively it may contain characters that is doesn't recognise (like a backspace or a euro currency symbol).
  4. If the first character on a line is a '*' symbol then it assumes that the rest of the line is made of 'phonemes' rather than English text. See the section about the valid phonemes.
  5. If the first character on a line is a "!" symbol and is followed by a single character between A and Z then the input is treated as a command to change the default voice pitch. ie "!A" will change to the highest pitch voice and "!Z" will change to the lowest. These settings are remembered until the next reboot. The default value is "!G"

If you have linked the code into your own code then there is no main loop.Instead you can call the following methods:-

  1. say("An English phrase");
  2. speak("Some phonemes");
  3. setPitch(val); where val is 0 to 25, to change the default voice pitch

Note that, by default, the UART is programmed to use 19200 baud, 8 data, no parity, and 1 stop bit. The baud rate can be changed in 'core.c' in the method called 'init'.

 

 

 

5 - The $50 robot

So you've built the $50 robot board - so how do you add the ability for speech?

 

First the downside - the program is just 'too big' to fit inside an ATMega8

 

Then the upside - you can replace the ATMega8 in your $50 Robot with an ATMega168. They are plug in compatible - ie you can plug in the ATMega168 instead of the ATMega8 without making any other changes. The difference is that the ATMega168 can hold 16k of program (Speech needs about 12k) whereas the ATMega8 can only hold 8k of program.

 

Since the program still eats up most of an ATMega168 then I would recommend building a dedicated board. This can use the standard $50 design (with the ATMega168). However: you can save on all those header pins as the only ones you need are Port D0 and D1 (which are the uart receive and transmit). These are the first two header pins at the top left of your microcontroller. You will then need either 1 or 4 I/O pins depending on the mode in which it is driven.The easiest, and cheapest solution, is to use my 1 wire digital mode. This means you could use any one of the other pin headers.

 

 

 

6 - Compiling the program

I have provided a 'makefile' to allow you to compile the program. From a command prompt: change to the directory where you have installed the source and type 'make'.

 

You can also use AVRStudio but you need to make sure that you go into 'Project, Configuration Options', select the 'Custom Options' and add the following options:

-D_AVR_

-DF_CPU=8000000

 

The first option says you are building for an AVR platform and the second option gives your processor clock speed (in this case 8MHz)

 

You can also use another compiler, say Microsoft Visual C++, to compile the program. If you do this then it wont make any sounds when run - but will just log stuff out to a file. I use this for debugging.

7 - Phoneme appendix

If you are programatically calling the speak method, or your text starts with a '*' character then the following text is made up of phonemes with optional pitch numbers. If you are just 'playing' then it is very likely that you may get an error message - because you have used a phoneme that is not in the following list.

 

The valid phonemes are:-

AY as in 'pale'

AE as in 'black'

AA as in 'car'

AI as in 'fair'

EE as in 'meet'

EH as in 'get'

ER as in 'perk'

IY as in 'site'

IX as in 'sit'

IH as in 'sit!!'

OW as in 'coat'

O as in 'cot'

UX as in 'coot'

OY as in 'voice'

AW as in 'now'

AO as in 'door'

OH as in 'won'

UW as in 'you'

/U as in 'put'

UH as in 'wood'

AH as in 'up'

B as in 'bat'

D as in 'dab'

F as in 'fat'

G as in 'gap'

/H as in 'hat'

J as in 'jab'

K as in 'cat'

L as in 'lag'

M as in 'mat'

N as in 'nap'

P as in 'pat'

R as in 'rat'

S as in 'sat'

T as in 'tap'

V as in 'vat'

W as in 'wag'

Y as in 'yap'

Z as in 'zap'

CH as in 'chair'

DH as in 'this'

SH as in 'share'

TH as in 'thick'

ZH as in 'azure'

CT as in 'fact'

DR as in 'dragon'

DUX as in 'duke'

NX as in 'sing'

TR as in 'track'

 

When using phonemes you can optionally append a number from 1 to 8 to change the pitch of the phoneme.

So to say 'Welcome everyone' using phonemes we could use 'WEH4LKAHM EH3VREEWON'