Development of Real‐World Three‐ Dimensional Sound Localization System Using the Binaural Model by Co. SEP

Studies in System Science (SSS) Volume 2, 2014 www.as‐se.org/sss

Development of Real‐World Three‐ Dimensional Sound Localization System Using the Binaural Model Tadashi SHO*1, Takashi IMAMUA2, Tetsuo MIYAKE3 Department of Mechanical Engineering, Toyohashi University of Technology, 1‐1 Hibarigaoka Tenpaku‐cho, Toyohashi 441‐8580, Japan *1

zhang@is.me.tut.ac.jp; 2ima@nigata.ca.jp; 3miyake@is.me.tut.ac.jp

Abstract Humans can estimate the direction of a generated sound source using the time difference between both ears and the way the sound changes due to the shape of the head and ears. However, it is very difficult to create an estimation system using a computer and two microphones. In this study, we propose a novel sound source localization method, in which the peak‐hold method is used to decrease environment noise and reflected sound. As a result, an average correct estimation rate of 93.2% is obtained in determining the location of the object sounds. Keywords Sound Source Direction; Intramural Time Difference; Microphone; Binaural Model

Introduction Recently, various robots have made their way into many places such as in the industrial world and the home. Hence what is essential for such robots is a system that acknowledges their surrounding situation. One important function for recognizing the surroundings is a sense of hearing. Generally, a generated sound source direction can be estimated by humans using the time difference between both ears and the difference in how a sound is heard. The purpose of this research is to achieve this system technologically. It is believed that such systems can be applied to assist the hearing‐impaired and humanoid robots to estimate the sound source direction. Research that uses three or more microphone arrays[1‐4] and research based on using the sound wave pressure information obtained from many microphones[5] have been performed to estimate the sound source direction. However, there are a lot of advantages in using two microphones in any sound source direction estimation system that is applied to the hearing‐impaired and humanoid robots, for example, the structure of the system is simple, miniaturization is easy, the amount of signal processing is small and so on[6, 7]. Authors have developed a sound source localization system using a dummy head with two ears[8, 9]. The obtained results showed it was possible to correctly estimate more than 93.5% of the sound directions when using 11 kinds of sound sources. The problem in the previous research was that the results were obtained in an idealized environment, in which there were no reflected sounds and the background noise was low. In this study, in order to improve the above problem, we proposed a novel sound source localization method, in which the peak‐hold method[10] was used to decrease environment noise and reflected sound. Furthermore, in order to confirm the effectiveness of the proposed method, experiments of sound localization were carried and encouraging results were obtained. Principles of Sound Localization by Using the Binaural Model Binaural Model Figure 1 shows the binaural model and definition of the sound source direction angle, where (a) denotes the Binaural Model and (b) denotes the horizontal angle model. In the binaural model shown in Fig.1(a), a cone that passes through the sound source is shown, and the horizontal angle  is defined to be the angle from the front (Median Plane) along a horizontal plane to the cone. The vertical angle  is measured at the angle SO' H that consists of S (the sound source), O' (the center of the conic base) and H (the point where the cone intersects the horizontal plane).

www.as‐se.org/sss Studies in System Science (SSS) Volume 2, 2014

FIG. 1 BINAURAL MODEL AND DEFINITION OF SOUND SOURCE DIRECTION, WHERE  DENOTES THE AZIMUTHAL ANGLE,  THE ELEVATION ANGLE,  THE HORIZONTAL ANGLE AND



THE VERTICAL ANGLE.

Usually, the direction can be represented by using the azimuthal angle  and the elevation  . However, it is very difficult to calculate them directly. The relation between (  ,  ) and (  ,  ) can be shown by Fig. 1 (a) and the following equation.  tan   1  sin    ,   cos  .  cos    sin  

  tan 1 

(1)

Therefore, in this study, the sound source direction is represented by using the horizontal angle  and the vertical angle  instead of the azimuthal angle  and the elevation  . Model of the Horizontal Angle  Based on the horizontal angle model shown in Fig. 1 (b), we first consider the relation between an intramural time difference (ITD) and the horizontal angle  when there is a sound source in the horizontal plane (  =  ). We then assume for simplicity that a personʹs head is a ball with diameter d [cm]. As the sound wave spreads out, the sound enters the left ear directly and then goes straight on to point P and arrives at the right ear. Therefore, when distance L from the center of the head to the sound source is sufficiently larger than distance d (d/L << 1), which is the distance from ear to ear, the relation between the horizontal angle  and the ITD can be shown as follows: ITD 

d (  sin  ), (2) 2c

where c is the sound velocity. However, actually it is not perfectly equal with relations between the real horizontal angle  and the ITD although the horizontal angle  can be obtained from the ITD theoretically. It is because the head is supposed to be a ball, but actually a spread of a sound wave becomes complicated by the unevenness of the head and the influence of the pinna. We can find the ITD from the difference of the spread distance theoretically if the surface shape of the head can be modeled precisely. However, in this study, the horizontal angle  is estimated by the following equation that is obtained experimentally and shows the relations between  and the ITD.

  a  ITD  b, (3) where a and b are constant coefficients obtained experimentally. Estimating the Vertical Angle  Unlike the horizontal angle, it is impossible to obtain geometrically the vertical angle  in the binaural model. However, the way sound is heard in each ear is different, because the interference and diffraction of the sound, which are caused by the unevenness of the head surface, are different when the sound generating direction is different. Therefore, it is possible to estimate the vertical angle  when there is knowledge of this difference in advance. The authors have confirmed that the power spectrum ratio between both ears (Ears Spectrum Ratio ESR) is effective as a

Studies in System Science (SSS) Volume 2, 2014 www.as‐se.org/sss

feature value to represent the knowledge of how sound is heard [9]. By using the ESR database which has this knowledge, the vertical angle  can be estimated by the correlation between the ESR database and the ESR of the sound source.

FIG. 2 FLOWCHART OF CALCULATING THE HORIZONTAL ANGLE  AND THE VERTICAL ANGLE  .

Method of 3D Sound Source Localization in the Real Environment Flowchart of 3D Sound Source Localization The real world is different from an ideal environment in that environmental noises, such as background noise, reverberation and reflection sounds which cannot be ignored. However, in the conventional method, the presence of environmental sound is not considered. In this case, it is possible that a large error occurs in the calculation of the ITD and this will have a significant influence on the estimation result of (  ,  ). In this study, in order to overcome these problems, a new 3D sound source localization method shown in Fig.2 is proposed, where (a) shows the flowchart of estimating the horizontal angle  , and (b) shows the flowchart of estimating the vertical angle  . Estimating Horizontal Angle  by Peak‐hold Processing In this section, we describe the details of the horizontal angle  estimation as shown in Fig.2 (a). The short‐time Fourier transform (STFT) is firstly carried out on the sound signals sL (t ) , sR (t ) , which are measured from left and right mikes, respectively, and frequency components S L ( , t ) , S R ( , t ) are obtained at each time t. The Hamming window function is used when performing the STFT. The frame length of the window function must be set so that the first reflected sound and the direct sound will not be mixed in any one frame[10]. In this study, therefore, 32 samples are set by considering the sampling frequency of the data and the acoustic environment described later. The shift mount of the window function is set to one sample in this study in order to improve the horizontal angle  estimation accuracy and the time resolution of ITD, although it is set to 1/2 of the length of the frame in general. Through trial and error, the analysis length of data for applying the STFT is set to 1024 samples corresponding to previous research[9].

www.as‐se.org/sss Studies in System Science (SSS) Volume 2, 2014

Then the absolute values of the S L ( , t ) , S R ( , t ) that are obtained by the STFT are calculated and they are transformed to the time‐series data X L ( , t) , X R ( , t) for each frequency  , respectively. And the peak‐hold processing is performed along the flowchart shown in Fig. 3 for the time‐series data | X L ( , t )| , | X R ( , t )| [10]. In the following peak‐hold processing, the subscript R and L are omitted because the same processing is performed for the left and right data.

FIG. 3 FLOWCHART OF PEAK‐HOLD PROCESSING.

At first, an initial peak‐hold value is set as PH ( ) = 0 and the date | X ( , t )| at time t is compared to PH ( ) . If the condition of | X ( , t )| PH ( ) is satisfied, a new peak‐hold value and peak‐hold processing result | X ( , t )|p can be obtained by the following equations. | X ( , t )|p | X ( , t )|, (4)

PH ( ) | X ( , t )| . (5)

But if the condition of | X ( , t )| PH ( ) is not satisfied, then a new peak‐hold value and peak‐hold processing result | X ( , t )|p can be obtained by the following equations. | X ( , t )|p  PH ( ), (6) PH ( )    PH ( ), (7)

where coefficient  is a decrement coefficient of the peak‐hold value. By performing the above‐mentioned processing from the beginning to end of the data, the peak‐hold processing result can be obtained. In addition, the same processing for the sound signals of the left and right is also performed, respectively. The time‐difference according to the following equation is performed, after the logarithm is carried out on the time‐ series data | X ( , t )|p , which is obtained by the peak‐hold processing. | X ( , t )|d  log | X ( , t )|p  log | X ( , t  1)|p . (8)

Next, the normalized correlation coefficient is determined by the following equation using the data of the left and right ears, which are data after time‐difference. C R , L ( , ) 

 | X R ( , t )|d  | X L ( , t   )|d dt   2 2  | X R ( , t )|d dt  | X L ( , t )|d dt

, (9)

where  denotes the intramural time difference of the sound to arrive at both ears. By using the modified Thompson tau technique after processing the equation, the contribution ratios of the normalized correlation for every frequency are estimated based on the intramural time difference at the maximum

Studies in System Science (SSS) Volume 2, 2014 www.as‐se.org/sss

normalized correlation value. By using the modified Thompson tau technique after processing of equation (9), the contribution ratios of the normalized correlation for every frequency  are estimated based on the intramural time difference  at the maximum correlation value. If the intramural time difference is judged as an abnormal value and then all the normalized correlation values in the frequency  are replaced with zero (weight is set as 0). Next, a sum of normalized correlation values C R , L ( , ) for each frequency  is firstly calculated by using (10) and then the

intramural time difference  with a maximum value of C R , L ( , ) is determined as the ITD.

 R,L ( )   C R,L ( , ). (10) 

Finally, the horizontal angle $\alpha$ is calculated by the ITD using (3}) that is determined in advance as well as the conventional method. Estimation of the Vertical Angle 

Estimation of the vertical angle  is comprised of two parts:  estimation and ESR database creation. Here, creation of the ESR database will be explained first. To create the ESR database, the experiments of sound collecting (details are described in the next section) are carried out more than once at each sound source position. The Fourier transform is carried out on the measured data and the ESR is calculated by the power spectrum. By applying the modified Thompson tau technique to each ESR, outliers are excluded, and the selected ESR that is closest to the true value is stored in the ESR database. In the estimation of the vertical angle  , the process shown in Fig.2 (b) and the horizontal angle $\alpha$ estimated above are premised. The procedure first selects a group of the ESR data including the candidate vertical angle  from the ESR database based on the estimated horizontal angle  . Next, it calculates the degree of similarity between the ESR of the target sound and the selected group of the ESR data of the candidate  . Finally, it chooses the candidate vertical angle  with the highest degree of similarity from the calculation results and sets it as the vertical angle  . Experimental Apparatus

In this study, a laboratory room with height 2.7[[m]  width 3.0[m]  depth 7.0 [m] was selected. It has an average background noise of 45 [dB], an acoustic environment reverberation time of 0.3 [sec], and is a typical, realistic acoustic environment. The experimental apparatus is shown in Fig. 4. As is shown in the figure, the sound signal from each direction can be measured by using the dummy head. The dummy head is a model having the shape of a man's head and ear pinna, and the microphones are built into a part of each ear. Furthermore, the elevation  is set by moving the speaker along the rail and the azimuth  is set by rotating the rotor platform that supports the dummy head. The dummy head is set on a stand when there is a sound source below a horizontal plane (  =180 degrees or more). As a result, the pseudo‐state of hearing sound from below can be produced. The sound is output from the speaker, and the signal the dummy head measures is input into the sound device. The sound data acquired with the sound device are sent to the PC via a USB cable, and then the direction estimation analysis is performed.

FIG. 4 EXPERIMENTAL APPARATUS.

www.as‐se.org/sss Studies in System Science (SSS) Volume 2, 2014

TABLE 1 RESULT OF SOUND LOCALIZATION

Correct rate 

Correct rate 

All correct rate

Previous method

82.0%

89.0%

73.0%

Proposed method

94.6%

98.6%

93.0%

FIG. 5 CORRECT RATE OF HORIZONTAL ANGLE  .

Results and Discussion

Table 1 shows the results of sound source localization, where the results obtained by the conventional method are also shown for comparison. We can see from the table that the proposed method has a higher correct rate than that of the conventional method [9] for both the horizontal angle  and the vertical angle  . Fig. 5 shows an example of the results for the horizontal angle  , where (a) shows the result obtained by the conventional method and (b) shows the result by the proposed method. As is shown in the figure, in the case of the conventional method, the correctness rate decreases as the horizontal angle increases. However, in the case of the proposed method, the correctness rate improves from 40 [deg]. Especially, in the case of the proposed method, the correctness rate of the horizontal angle  =90 [deg] achieves 100%, although in the case of the conventional method, the correctness rate of the horizontal angle  =90 [deg] is 0%. This means that the proposed method allows estimation of angles that could not be be estimated conventionally. Conclusions

In this study, we proposed a novel sound source localization method for real acoustic environments, in which the peak‐hold method was used to decrease environment noise and reflected sound. Furthermore, in order to confirm the effectiveness of the proposed method, experiments of sound localization were carried out and a correctness rate 93.2% was obtained. REFERENCES

[1]

K. Sasaki and K. Hirata, “3D‐Localization of a Stationary Random Acoustic Source in Near‐Field by Using 3 Point‐Detectors”, Transactions of the Society of Instrument and Control Engineers, Vol.34, No.10, 1329‐1337, 1998.

[2]

M. Xia, R. Du, “New Method of Effective Array for 2‐D Direction‐of‐Arrival Estimation”, International Journal of Innovative Computing, Information and Control, Vol.2, No.6, 1391‐1397, 2006.

[3]

H. Kowaka, Y. Umut and M. Kominami, “3‐D sound source localization using conformal array”, Technical Report of IEICE, EA2006‐91，13‐18, 2006 (in Japanese with English abstract).

[4]

K. Haddad and J. Hald, “3D localization of acoustic sources with a spherical array”, Journal of the Acoustical Society of America, 1585‐1590, 2008.

[5]

K. Takashima, “Omni‐directional Sound Source Analysis System”, the Japan Society of Mechanical Engineers, Vol.107, No.1033, 964‐965, 2004 (in Japanese with English abstract).

[6]

H. Nakashima and T. Mukai, “A Learning System for Estimating the 3‐dimensional Sound Source Position using Binaural Hearing”, Proceedings of The 5th SICE System Integration Annual Conference(SI 2004), 1102‐1103, 2004.

Studies in System Science (SSS) Volume 2, 2014 www.as‐se.org/sss

[7]

H. Nakashima, et al., Improvement of Sound Source Localization Ability Using Inter‐aural Level Difference Information, IEICE, D‐II,Vol. J87, No.3, 919‐922, 2004.

[8]

S. Horihata, T. Katayama, T. Miyake and Z. Zhang, “3‐D Sound localization Using Binaural Model”, Transactions of the Japan Society of Mechanical Engineers, Series C, Vol.72, No.723, 3567‐3575, 2006 (in Japanese with English abstract).

[9]

Z. Zhang Kazuaki I，S. Horihata，M. Tetsuo and T. Imamura，, “Estimation of Sound Source Direction Using a Binaural Model”, International Journal of innovative Computing Information and Control, Vol.3, No.3, 551‐564, 2007.

[10] T. Suzuki, Y. Kaneda, “Improving the robustness of multiple signal classification (MUSIC) method to reflected sounds by sub‐ band peak‐hold processing”, Acoust. Sci. & Tech., Vol.30, No. 5, 387‐389, 2009.