Help to convert a blank ATmega328 into a speech synthesizer chip for under $5
March 8, 2014
Some time ago when I was looking at the cost of speakjet and other hardware speech systems I remembers that back in the 80's there was a program call S.A.M. (software automated mouth) that ran on my Atari XL. It was also available for the C64, Apple and possibly some other systems.
The thing is that my ATARI computer ran at 2.5MHz and only had 16K of memory so it made me wonder if S.A.M. could run on an Arduino. I did some research and found an interesting explanation of how S.A.M. worked using such a small amount of memory. Below is an excerpt from the instruction manual downloaded from here: http://www.retrobits.net/atari/sam.shtml.
How it works:
WHAT AM I HEARING?
In recent years, many new speech synthesizers have appeared in the marketplace. The techniques they use vary widely depending on the intended application. Most synthesizers found in consumer products, such as talking televisions or microwave ovens, use a "speech compression" technique of one sort or another. These techniques require a person to speak the needed words or entire sentences. The speech waveform is then "compressed" using a mathematical algorithm and, as a result, can then be stored in a memory chip without taking up a lot of room. The synthesizer's job is to then take this compressed speech information and expand it back into the original waveform. Some of these systems work quite well, retaining the speaker's intonation and sometimes even his or her identity. The processes used in such synthesizers differ greatly from those used in unlimited vocabulary synthesizers like S.A.M.
Let's follow the evolution of an unlimited vocabulary speech synthesizer. First, we must define the task. Simply, we want to create a system that will synthesize any English utterance. One way to begin would be to record every possible utterance on tape and just play back the right one whenever we need it. This would take up more tape or computer memory than could ever exist, so this method is obviously not too practical.
The next method might be to record all the English words and play them back in a specific order to create sentences. This is certainly practical. It would take up a large amount of memory, but it would work. However, we have lost something in this process. The words now sound disjointed because we have "spliced" the sentence together. Also, the stress or inflection pattern of the sentence is either wrong or non-existent. If we wanted an accurate stress pattern, we would need to record every word in a number of different styles, at different pitches, etc.
Such a system needs too much memory. So, let's break things down even further and try to store as little as possible in memory. Instead of storing sentences or words or even syllables, we could store phonemes. Phonemes are the atoms of spoken language, the individual speech sounds. It turns out that English has a little over forty of them. Wow -- this takes up practically no memory at all! We could specify the phonemes in the order we need to create words and sentences and really have ourselves a system. So, we go and record the phonemes and play them back to say the sentence, "I am a computer." Why can we barely understand it? It seems we have broken things down a bit too far. When we chop the words down to this level and then try to reassemble them, everything that blends one sound into another is lost and the results are nothing less than horrible.
But all is not lost, Our efforts are not wasted because we have the acoustic phonetician to come to our rescue. These people deal in the study of speech sounds and they can tell us just how to repair our phoneme-based system. First, instead of recording the actual speech waveform, we only store the frequency spectrums. By doing this, we save memory and pick up other advantages. Second, we learn that we need to store some data about timing. These are numbers pertaining to the duration of each phoneme under different circumstances, and also some data on transition times so we can know how to blend a phoneme into its neighbors. Third, we devise a system of rules to deal with all this data and, much to our amazement, our computer is babbling in no time.
The advantages in synthesizing speech in this way are tremendous. We use very little memory for all the data and the rules to use that data, and we also gain the ability to specify inflection, timing, and intonation. This is because we have not stored actual speech sounds, only their spectrums. (You can think of this as a printer needing only four colors of ink to reproduce all the colors in a picture.)
Now, in actuality, we do not store all the spectrums, but only those that are targets. Each phoneme has associated with it a target spectrum which can be specified with very little data. The target may be thought of as a "frozen" speech sound, the sound you would be making if your mouth was frozen exactly in the middle of pronouncing the phoneme. The timing rules tell the synthesizer how to move from target to target in a manner that imitates the timing of a human talker.
S.A.M. is this type of synthesizer implemented entirely in software. It has the tables of phoneme spectra and timing, together with the rules for using this data to blend the sounds together into any English utterance we may have in mind. We have traded some quality from the method using all the recorded words, but what we have gained is versatility, practicality, and the ability to do it all in real time, with very little memory usage, on an inexpensive microcomputer.
As it turns out, Sebastian Macke has been doing some work on S.A.M. and has some C code which he says will compile on the AVR-GCC compiler and without the reciter will compile into aproximately 28K. The output are simple 4 bit amplitudes at 22050Hz.
Some people might say that at 28K it is not much use as their Arduino only has about 28K free after you allow for the bootloader but if we can squeeze this into 32K along with a library for using the I2C interface or serial port then we can burn it directly onto an ATmega328 chip without the bootloader and you end up with a speech synthesizer chip for less than $5! It is then quite easy to add a small speaker or amplifier.
Like all of us, Sebastian has very limited time to spend on this project and he does not have an Arduino nor has he used Arduino before so if we want to have a purely software speech synthesizer for Arduino that is the fraction of the price of those expensive chipsets then please help me with this project.
Please forgive my use of the word "we", I mean "we as LMR members". I will help if I can but I am a terrible programmer and needed a lot of help just to make a simple library for the Micro Magician so I am over my head here. I am hoping to find more experienced programmers here on LMR who can take Sebastian's C code and tweak it to better accomodate the Arduino.
What we need!
We need programmers who can modify Sebastian's code to generate the 22050Hz, 4bit amplitude output so it drives a single pin on the ATmega328. Ideally it would be great to have this as a library so it can also be used on Arduino Megas that have more memory.
Once we have S.A.M. on an ATmega328 as a standalone speech synthesizer that we can connect to a speaker and either our serial port or I2C bus we need to make the reciter into a separate library or function that our code can send text strings to.
The reciter then converts the text to the phonetic spelling needed by S.A.M. to pronounce the words correctly and outputs it to either the I2C bus or serial port. I could write Arduino code to do this (I will if required) but as Sebastian already has the reciter in C then it is probably much easier to tweak it for use with the Arduino.
I emailed Sebastian yesterday and let him know I posted this link. Like most of us he does not have a lot of time but he was kind enough to try and reduce the code size. He was then able to get it to compile with AVR-GCC to 13.4K!
This is a much more useful size for our Arduinos to work with but the code still needs some adaption to work with the Arduino. The Output() function in render.c needs to be modified and perhaps the input will need to be changed to suit different needs. As BDK6 mentioned he has also done some work previously so maybe the new version I'm attaching can help. Please see the attached sam.zip file
Currently I am trying to understand how S.A.M. works so that later I might be able to write some code from scratch and do some sort of tutorial. Sebastian was good enough to give me a smaller breakdown of the code which I've attached as tinySAM.C (it's a text file).
I have spent the day trying to convert tinySAM into an Arduino sketch called SAMarduino. The biggest problem is I still don't completely understand how the code works. I've attached the sketch but it's not working yet. I can only tell you it compiles to a size of 8.5K (including I2C and Serial libraries) and currently uses 1.6K of SRAM.
The output from the white noise generator is at full amplitude (128 +/- 127) but the actual speech is at about 128 +/- 5. I suspect there is a bug in tinySAM.
To work with only 2K of SRAM memory (assuming I understood and modified the code correctly) the input string cannot be more than 40 characters which should be fine for most robotic applications. As most Arduinos use the ATmega328 with only 2K of SRAM, SAM really needs to be installed into a separate Arduino. For this reason, SAMarduino will accept a string from either the serial port or the I2C bus.
In SAMarduino I have tried to get SAM to output the waveform as it is generated at 20KHz as an 8 bit output that needs a DAC (or a handful of resistors) to produce an analog output.
Bits 0-3 are port B outputs 0-3 (D8-D11).
Bits 4-7 are port D outputs 4-7 (D4-D7).
As a DAC can be made with resistors in an R-2R resistor ladder this can be made on a breadboard or prototype PCB and should not be a big problem.
I set up an Arduino Nano on a breadboard and rigged up an R-2R ladder made entirely from 75Ω resistors. This drives a small 32 ohm speaker which has a 22uF capacitor in series. The speaker is hot glued into a small plastic container which acts as a speaker box.
Based on the advice from BDK6, I increased the amplitude scale factor but i have not had time to convert the floating point math to integers or create a sine table. As a result the sound was something like deep rumbling with short burst of static.
Hopefully it will work better when I clean up the math.