Share this

Design and Implementation of a Robot Voice Control System Based on DSP and FPGA

2026-04-06 08:57:27 · · #1
The robot's auditory system mainly recognizes and judges human voices, and then outputs corresponding action commands to control the head and arm movements. Traditional robot auditory systems are generally controlled by PCs. The characteristic is that a computer is used as the robot's information processing core to control the robot through interface circuits. Although the processing power is relatively strong, the voice library is relatively complete, and the system update and function expansion are relatively easy, it is relatively bulky and not conducive to the miniaturization of the robot and its operation under complex conditions. In addition, it has high power consumption and high cost. This design uses the cost-effective digital signal processing chip TMS320VC5509 as the voice recognition processor, which has a fast processing speed, enabling the robot to independently complete complex voice signal processing and action command control in offline state. The development of the FPGA system reduces the area occupied by the timing control circuit and logic circuit on the PCB board [1], making the voice processing part of the robot's "brain" miniaturized and low-power. The development of a robot system that is small in size, low in power consumption, and high in speed and can complete voice recognition and action commands within a specific range has great practical significance. 1. System Hardware Overall Design The system's hardware function is to acquire voice commands and drive stepper motors, providing a development and debugging platform for the system software. As shown in Figure 1, the system hardware consists of several parts: voice signal acquisition and playback, DSP-based voice recognition, FPGA action command control, stepper motors and their drivers, an external flash memory chip for the DSP, JTAG port simulation debugging, and keyboard control. The workflow is as follows: the microphone converts the human voice signal into an analog signal, which is then quantized and converted into a digital signal by the TLV320AIC23 audio chip and input to the DSP. After recognition, the DSP outputs action commands. The FPGA generates correct forward and reverse signals and accurate pulses based on the action commands input from the DSP, providing the stepper motor driver chip with the drive signals to control the stepper motor's rotation. External FLASH memory is used to store the system program and voice library and to load the system upon power-up. The JTAG port is used for online simulation with a PC, and the keyboard is used for parameter adjustment and function switching. 2 Design of speech recognition system 2.1 Characteristics of speech signal The frequency components of speech signal are mainly distributed between 300 and 3400 Hz. According to the sampling theorem, the sampling rate of the signal is selected as 8 kHz. One characteristic of speech signal is its "short time". Sometimes it exhibits the characteristics of random noise in a short period of time, while in another period it exhibits the characteristics of a periodic signal, or both. The characteristics of speech signal change with time. Only within a certain period of time does the signal exhibit stable and consistent characteristics. Generally, the short time period can be taken as 5 to 50 ms. Therefore, the processing of speech signal should be based on its "short time" [2]. The system sets the frame length of speech signal to 20 ms and the frame shift to 10 ms. Then each frame of data is 160×16 b. 2.2 Voice Signal Acquisition and Playback The voice acquisition and playback chip used is the TLV320AIC23B manufactured by TI. The TLV320AIC23B's analog-to-digital converter (ADC) and digital-to-analog converter (DAC) components are highly integrated within the chip. The chip uses an 8kHz sampling rate, mono analog signal input, and dual-channel output. The TLV320AIC23 is programmable; the DSP can edit the device's control registers through the control interface and can compile both SPI and I2C interfaces. The circuit connection between the TLV320AIC23B and the DSP5509 is shown in Figure 2. The DSP uses the I2C port to configure the TLV320AIC23's registers. When MODE=0, it uses the I2C interface, and the DSP adopts the master transmit mode, initializing 11 registers with addresses 0000000 to 0001111 through the I2C port. In I2C mode, data is written in three 8-bit chunks. The TLV320AIC23 has a 7-bit address and a 9-bit data, meaning that the most significant bit of the data item needs to be padded to the last bit of the second 8-bit array. The MCBSP serial port is connected to the TLV320AIC23 via six pins: CLKX, CLKR, FSX, FSR, DR, and CX. Data communication with peripherals via the MCBSP serial port is transmitted through the DR and DX pins, while the control synchronization signals are implemented by the CLKX, CLKR, FSX, and FSR pins. The MCBSP serial port is set to DSP Mode, and the receiver and transmitter are synchronized. Serial transmission is initiated by the TLV320AIC23's frame synchronization signals LRCIN and LRCOUT, and the data word length for transmission and reception is set to 32 bits (16 bits for the left channel and 16 bits for the right channel) in single-frame mode. 2.3 Design of the Speech Recognition Program Module To enable the robot to recognize speaker-independent speech commands, the system adopts a speaker-independent isolated word recognition system. Speech recognition without a specific person refers to speech models trained by people of different ages, genders and accents, which can recognize the speaker's speech without training during recognition [2]. The system is divided into several parts: pre-emphasis and windowing, short point detection, feature extraction, pattern matching with the speech database and training. 2.3.1 Pre-emphasis and windowing of speech signal The pre-emphasis processing mainly removes the influence of glottal excitation and nasal radiation. The pre-emphasis digital filter H(Z) = 1 - KZ-1, where is the pre-emphasis coefficient, which is close to 1. In this system, k is taken as 0.95. The speech sequence X(n) is pre-emphasized to obtain the pre-emphasized speech sequence x(n): x(n) = X(n) - kX(n-1) (1) The system uses a Hamming window of finite length to slide on the speech sequence to extract the speech signal with a frame length of 20 ms and a frame shift of 10 ms. Using a Hamming window can effectively reduce the loss of signal features. 2.3.2 Endpoint Detection Endpoint detection detects the beginning and end points of words when there is sufficient time interval between words. It generally uses short-time energy distribution detection, with the equation: where x(n) is the Hamming window truncation of the speech sequence, with a sequence length of 160, so N is 160. For silent signals, E(n) is very small, while for phonic signals, E(n) rapidly increases to a certain value, thus distinguishing the start and end points of words. 2.3.3 Feature Vector Extraction Feature vectors extract effective information from the speech signal for further analysis and processing. Commonly used feature parameters include Linear Predictive Cepstral Coefficients (LPCC) and Mel-frequency Cepstral Coefficients (MFCC). The speech signal feature vector is extracted using Mel Frequency Cepstrum Coefficients (MFCC). The MFCC parameter is based on human auditory characteristics. It utilizes the critical band effect of human hearing [3]. The speech signal is processed using MEL cepstrum analysis technology to obtain the MEL cepstrum coefficient vector sequence. The MEL cepstrum coefficient represents the spectrum of the input speech. Several bandpass filters with triangular or sinusoidal filtering characteristics are set within the speech spectrum range. Then, the speech energy spectrum is passed through the filter group, the output of each filter is calculated, the logarithm is taken, and the discrete cosine transform (DCT) is performed to obtain the MFCC coefficient. The transformation formula of the MFCC coefficient can be simplified to: where i is the number of triangular filters, P is selected as 16 in this system, F(k) is the output data of each filter, and M is the data length. 2.3.4 Speech signal pattern matching and training model training is to train the feature vector to establish a template. Pattern matching is to match the current feature vector with the template in the speech library to obtain the result. The speech library pattern matching and training adopts Hidden Markov Model (HMM). Models), which is a statistical random process, is a probabilistic model of statistical characteristics. It is a doubly random process. Because the Hidden Markov Model can well describe the non-stationarity and variability of speech signals, it is widely used [4]. There are three basic algorithms of HMM: Viterbi algorithm, forward-backward algorithm, and Baum-Welch algorithm. In this design, the Viterbi algorithm is used for state discrimination, and the feature vector of the collected speech is matched with the model of the speech library. The Baum-Welch algorithm is used to solve the training of speech signals. Since the observation features of the model are independent between frames, the Baum-Welch algorithm can be used to train the HMM model. 2.4 DSP development of speech recognition program The DSP development environment is CCS3.1 and DSP/BIOS. The speech recognition and training programs are made into modules and defined as different functions, which are called in the program. The speech recognizer function is defined as int Recognizer (int Micin), the recognition result output function is int Result (void), the speech trainer function is int Train (int Tmode, int Audiod), and the action command input function is int Keyin (int Action [5]). The speech recognizer transforms the current speech input into a speech feature vector, matches it against templates in the speech database, and outputs the result. The speech response output function outputs the speech response corresponding to the acquired speech recognition result. Speech training involves converting speech commands input from multiple people of different ages, genders, and accents into templates for the training database. To prevent sample errors, each person's speech command needs to be trained twice. Euclidean distance is used to perform pattern matching for the two inputs. If the similarity between the two inputs reaches 95%, it is added to the sample set. The speech response input function outputs the corresponding speech output for each template input in the speech database to achieve the purpose of language response. The system's working state is to execute the speech recognition subroutine, execute the training function during training, obtain the database template, and return after training is completed. The program flowchart is shown in Figure 3. 3 Design of the Robot's Motion Control System 3.1 The FPGA logic design system controls the robot's head movements via voice. The head movement has two degrees of freedom: up/down and left/right, requiring two stepper motors for control. After the DSF completes voice recognition, it outputs the corresponding action commands. After the action is executed, the DSP issues a reset command, and the head returns to its initial state. The FPGA's role is to provide the DSP interface logic, set up the RAM block storing DSP instructions, and generate stepper motor drive pulses to control the stepper motor's rotation direction and angle. The FPGA device is the action command control unit, designed using a FLEXLOKE chip, which receives DSP data and controls two stepper motors in parallel. The internal structure logic of the FPGA is shown in Figure 4. The FPGA internally includes two components as motor pulse generators to control the motor's operating pulses and forward/reverse rotation. AO~A7 are DSP data input ports, WR is the data write port, P1 and P2 are the pulse input ports for the two stepper motor driver chips, L1 and L2 are the motor forward/reverse control ports, and ENABLE is the enable signal. RAM1 and RAM2 are the instruction registers for two stepper motors, respectively. The motor pulse generator emits a corresponding number of square wave pulses from the RAMs. The DSP outputs 8-bit instructions through the DO~D8 data terminals, where D8 is the RAM selector (RAM1 is selected when D8 is 1, RAM0 is selected when D8 is 0), DO~D7 are the output motor angles (120° vertical and 1° horizontal rotation with an initial value of 60°), and the range of DO~D7 is 00000000~11111000 with an initial value of 00111100. The FPGA acts as the stepper pulse generator, controlling the motor speed through clock cycle configuration, and determining forward and reverse rotation based on the coordinates corresponding to the initial values. The system action instruction program is shown in Figure 5. R1 is the DSP instruction register, and R2 is the current coordinate register. The rotation direction and angle of the stepper motor are determined by the difference between the DSP's output coordinates and the FPGA's current coordinates. The advantage is that it can end the current action and run a new instruction based on changes in the input instructions. After the instruction is executed, the system is reset, and the stepper motor returns to its initial state. 3.2 FPGA Logic Simulation: The FPGA was developed using the MAX-PLUS II platform, and the above logic functions were designed using VHDL language. Debugging was performed via the JTAG interface. The FLEX10KE chip can output correct forward and reverse signals and pulse waveforms according to the DSP output instructions. 3.3 Stepper Motor Driver Design: The FPGA controls the stepper motor driver chip through P1, L1, P2, and L2 outputs. The stepper motor driver uses the Toshiba TA8435H single-chip sinusoidal microstepping two-phase stepper motor driver. The circuit connection between the FPGA and TA8435H is shown in Figure 6. Since the FLEX10KE and TMS320VC5509 operate at 3.3V, while the TA8435H operates at 5V and 25V respectively, the pin connections use optocouplers TLP521 to isolate the voltage on both sides. CLK1 is the clock input pin, CW/CCW are the forward and reverse control pins, and A, A, B, B are the two-phase stepper motor inputs. 4 Conclusion The system fully utilizes the high processing speed and expandable off-chip storage of DSPs, featuring high speed, real-time operation, high recognition rate, and support for a large voice library. The use of FPGAs simplifies the system circuitry; a single FLEX10KE chip can complete the timing control of two stepper motors. Although it lags behind PC systems in processing speed and voice library storage capacity, embedded systems based on DSPs and FPGAs undoubtedly have broad prospects for miniaturization, low power consumption, and the implementation of specific functions in robots.
Read next

CATDOLL Dudu Soft Silicone Head

You can choose the skin tone, eye color, and wig, or upgrade to implanted hair. Soft silicone heads come with a functio...

Articles 2026-02-22