I have played around with Praat a bit this semester and I have previously published two articles about my adventures in the land of phonetics, one about basic vowel space and one about monophthongs in different languages. I originally intended to write several articles, gradually building up to a guide for how to identify Mandarin syllables in Praat, but since I ran out of time, I’m jumping ahead in the series and publishing this article now. If you wan to try any of this yourself, you can download Praat here.
Identifying Mandarin syllables in Praat
To learn more about Chinese phonetics, I have been playing a little game with my self. I have a large number (1000+) syllables in Chinese recorded by a female speaker. I load one of the syllables into Praat randomly without looking. The goal is to figure out which syllable it is only by looking at it.
This is quite possible, although 100% accuracy is probably not achievable because some sounds are too hard to tell apart. I haven’t kept a detailed record of my score, but I think I get it completely right slightly more than 50% of the time and when I’m wrong, it’s usually just a little bit, such as mistaking “tán” for “pán” or similar.
The goal with writing this guide is primarily to help myself understand what I’m doing. It’ of course possible that someone else finds it useful, but probably not very many. This guide is basically a long discussion of what I do when I (rather successfully) identifies Mandarin syllables just by looking at the spectrogram and waveform.
If you have suggestion for how to improve the guide or references that can help me improve accuracy, let me know! Also note that I’m no expert, so please report any errors you find. I have taken a few courses in Chinese phonetics, but that’s about it, the rest I’ve learnt on my own, mostly in Chinese, so sometimes I might use inaccurate vocabulary in English, but it should be okay. Let’s get started!
Table of contents
- Step 1 – Tone
- Step 2 – Syllable structure
- Step 3 – Identify sounds
- Spectrogram challenge
- References and further reading
Step 1 – Tone
I usually start with the tone because it’s the easiest part. Basic knowledge of the contours of Chinese tones should be enough for almost all cases. Exact F0 values (pitch) seldom need to be considered because the contour is always enough. The only potential trap for beginners is to fail to recognise that both T2 and T3 fall before they rise. The main difference is that the turning point comes later and is lower for T3. Compare:
For the sake of completeness, here are typical cases of T1 and T4 as well:
Step 2 – Syllable structure
When trying to determine which syllable we’re dealing with, it’s useful to try to get a general understanding of roughly what kind of syllable stricture we’re talking about first. The following section isn’t meant to determine exactly what these parts are, but rather to pinpoint the number of sounds and general syllable structure. Since Mandarin only has slightly more than 400 syllables (since we have already dealt with tone in step one) and the structure is very rigid (a full syllable is CGVN, Consonant Glide Vowel Nasal, where all parts are optional except the main vowel). It should of course be noted that most of the possible combinations don’t exist or don’t exist for certain tones.
Initial consonant: If voiced, e.g. [m n l], it looks like a vowel, but is generally weaker. Stops are usually visible through their releases and fricatives are easy to spot because of the noise-like turbulence. Affricates are combinations of stops and fricatives.
Glides and vowels: There is some controversy in phonology if G belongs with the preceeding consonant, the following vowel or if it fills a slot on its own, but for our purposes, it’s probably best consider it a vowel in addition to the main vowel.
Final consonants: Final consonants in Mandarin can only be [n ŋ ɻ]. If there seems to be something significant going on after the vowel has ended, it’s one of these finals. None of the syllables I’ve been playing with contains any [ɻ] finals (known as Er hua), so this won’t be part of this guide.
Step 3 – Identify sounds
Use this flow chart to figure out what to do next. Coloured steps in the flowchart have detailed discussions below. If you have problems with other steps, you probably need to read basic definitions of the relevant speech sounds, please refer to the relevant entries on Wikipedia.
Identifying vowels can be tricky by simply looking at one single sample, but it’s still pretty easy to get the right idea by comparing F1 and F2 values. It also helps a lot being familiar with the syllable structure in Mandarin, because some monophtongs or diphthongs simply don’t occur in certain environments and can there fore be excluded.
For instance, if you think the syllable ends with a nasal, you don’t need to worry about the subtle differences between [i] and [y] because if there’s only one vowel sound, it has to be [i] because [y] can’t be followed directly by [n] or [ŋ]. Similarly, if you can identify one of the sibilants [ɕ ʂ s] accurately, you don’t need to differentiate the allophones of /ɨ/ because these are in complementary distribution.
So if you can’t identify the vowels exactly, narrow it down to a range of possible answers based on the general syllable structure. You will probably be able to guess which vowel it is later once you know more about the preceding and following sounds. However, it should be mentioned that vowels are usually the easiest to guess, so you probably want to gain as much information as possible in this step so you have fewer possibilities later.
There are three finals: [n ŋ ɻ]. All of these influence the preceeding vowel to different extents (a lot in the case of [ɻ]) so identifying the final involves looking at the preceeding vowel as well, not just the final itself. In the case of [ɻ], there are some general signs (such as a drop in F3 which will approach F2), but more detailed knowledge of how Erhua influence the preceeding vowel(s) is probably necessar (see 朱川, 2013, or the article about Erhua on Wikipedia). In general, the spectrum should start approaching that of [ɻ] during the pronunciation of the vowel.
For [n ŋ] the situation is similar in that there are two things happening. First, they influence the quality of the preceeding vowel, and, second, the final itself is different. The easiest part to spot is that F2 and F3 are higher for [n] compared with [ŋ]. Let’s look at the spectrograms for [an] and [ɑŋ]:Identifying these finals only by looking at the finals themselves is hard, but as noted, [n] is more likely to have F2 cancelled out. This is far from completely reliable, though, but it is a clue.
Fricatives all have noise-like turbulence and can be told apart by looking at the energy of the turbulence at different frequency ranges. In Mandarin, there are six fricatives [f ɕ ʂ s x ʐ]. Let’s first deal with some of the easier ones.
- [ʐ] can be esaily identified because it’s voiced (see Identifying voiced consonants below). Remember to combine the information about the fricative with the following vowel since many of the fricatives are in complementary distribution.
- [f] is a non-sibiliant and generally a lot weaker than the other fricatives (including [x] and shouldn’t be too hard to identify. The energy is also quite uniformly increasing with frequency (see picture below).
- [x] has a less evenly distributed energy (several discrenible contentrations at different frequency levels. Compare the below pictures of ”heng” and ”feng”:
Now let’s have a look at the three remaining fricatives [ɕ ʂ s]. The first thing you need to do when identifying fricatives is to make sure you’re displaying the whole spectrogram (Praat is by default set to show 0-5000, which is not enough; set the upper limit to at least 10000, possibly even 15000).
If you don’t know anything about the speaker, it will be difficult, because all of these things are individual, but if you see a few sample, you can still calibrate your guesses. The easies way to deal with [ɕ] is to look at the following vowel (which is usually relatively easy to identify). Since [ɕ] is in complementary distirbution with [ʂ s], we will only look at how to tell the latter two a part here.
In general, the main difference between the retroflex affricate [ʂ] and its non-retroflex friend [s] is that the intensity of [ʂ] starts much, much lower, see the spectrograms of ”sa” and ”sha” below. The exact freqncy ranges might be different depending on the environment, so [ʂa] might not be identical to [tʂa], but the general trend is still there (and the difference is usually very large).
This is by far the hardest part and I don’t think it’s theoretically possible to reach a very high accuracy. The reason is that the stops are too brief to identify properly and aren’t in complementary distribution, so looking at the following vowel seldom help. The only clue is often formant transitions.
According to locus theory, all consonants have a target frequency for each formant, even though this might be influenced by adjacent sounds. This means that the transition of the formants (F2 and F3) can help us identify the plosives themselves. This picture is taken from Kevin Russel’s phonetics site (Univeristy of Manitoba).
In general, we can see a pattern that looks as follows:
- Bilabial locus frequency: Low F2, low F3
- Alveolar locus frequency: Mid F2, high F3
- Velar locus frequency: High F2, mid F3
Read more here, here and here. This is all very good in theory, but I find it very hard to actually use this to determine the plosive in question. Sometimes the transitions are hard to see or they simply don’t fit the patterns described above.
Identifying aspiration is usually not very difficult, but can be somewhat complicated by affricates (which look al ittle bit like aspirated stops) and aspirated affricates such as [t͡ɕʰ t͡sʰ ʈ͡ʂʰ]. Let’s start with the main difference between the non-aspirated stops [p t k] and their aspirated counterparts.
The main difference is in the interval between the stop release and the voice onset (VoT). Non-aspirated stops have very short VoT, usually 10-35 ms, whereas aspirated stops have a much longer VoT, usually 70-100 ms (Chao & Chen, 2008). Let’s look at the [t tʰ] pair as an example:The next problem is to separate affricates from aspirated stops. This is relatively easy if we know what fricatives look like (and we do, see Identifying fricatives above). The aspirated part looks very much like breathing out sharply [h], which is the frictionless version of Pinyin ”h”. The following spectrogram is such a (relatively) frictionless [h] in ”ha”:
As we know from our comparisons of fricatives, they don’t have such a uniform frequency distributions, so if we compare the pair [tʰ t͡sʰ], it should be relatively easy to see both the friction and the aspiration, although the two certainly overlaps to a certain extent:Finally, we need to look at aspirated versus non-aspirated affricates, e.g. Pinyin ”z” [t͡s] and ”c” [t͡sʰ]. As expected, we see that the fricative part similar to [s] is there for both affricates, but that the aspirated [h] part is missing for [t͡s] and it therefore has a substantially shorter VoT: If you can’t see the fricative, you probably need to adjust the spectrogram settings. The above diagram stops at 6000 Hz, which isn’t really enough to analyse fricatives, see Identifying fricatives above.
Identifying voiced consonants
This is one of the trickier parts. There are four voiced (initial) consonants [m n l ʐ]. First, [ʐ] is a fricative and should be quite easy to identify. If you look at the picture below, you can clearly see the fricative turbulence and the voicing:The remaining three are much, much harder and are often indistinguishable just by looking at the spectrogram because they have similar F1 and F2. I have found no way of reliably telling them apart this way, but there are clues in the waveform.
Let’s start with [l], which has a glottal perturbation (creak) in each cycle, which is fairly easy to spot (the ”craggy” looking bits, compare this with the waveforms of [m n] below):I have found no reliable way of separating [m n], but F2 seems more likely to be cancelled out by anti-resonance in [n] compared to [m].
Formant transitions for [m] are similar to those for [b p], while those for [n] are similar to [d t s z], but this can be very hard to see. Read more baout this here.
I’d be really surprised if anyone actually reads this far, but if you do and think this is interesting and/or fun, feel free to have a go at the following spectrograms. Which Mandarin syllables do they represent? Post a comment with your answers!
It’s been both entertaining and educating to write this guide. There’s obviously more to spectrogram analysis that I have written here. My goal was simply to use what i have learnt in the past year or so to see what I could do with Mandarin syllables (which are a lot easier to analyse than, say, English or Swedish). This article probably contains some errors, so if you find anything that looks weird let me know! If you want more challenges, you can head over to Robert Hagiwara’s Monthly Mystery Spectrogram page. It hasn’t been updated for a long time, but it still contains a lot of useful information!
References and further reading
Here is a list of books, articles and websites that I’ve found useful. I also want to thank professors 朱川 and 曾金金 whose courses in phonetics I have attended. It’s so much easier to learn these things in collaborative discussions in class compared with on one’s own!
鄭靜宜. (2011). 語音聲學：說話聲音的科學. 心理出版社.
王理嘉、林燾. (2013). 語音學教程. 五南出版社.
曾金金. (2008). 華語語音資料庫及數位學習應用. 新學出版社林.
朱川. (2013). 外國學生漢語語音學習對策(增訂本). 新學林出版社.
Boersma, P., & Weenink, D. (2005). Praat: doing phonetics by computer (Version 4.3.01).
Chao, K. Y., & Chen, L. M. (2008). A cross-linguistic study of voice onset time in stop consonant productions. Computational Linguistics and Chinese Language Processing, 13(2), 215-232.
Duanmu, San. (2007). The phonology of standard Chinese. Oxford University Press.
McQuarie University. (2008). Speech Acoustics Topics.
Wikipedia. Mandarin Phonology.
Wikipedia. Acoustic phonetics (and related topics).