|
Introduction to IPhOD
A growing body of computational psycholinguistic evidence indicates that we segment (Saffran et al., 1996),
respond (Vitevitch et al., 1999), produce (Vitevitch et al., 2004), and remember (Majerus et al., 2004) speech in ways that are
affected by phonological frequency information. However, speech research is restricted by the limited number of pronunciation
collections or utilities, and those phonemic frequency distributions often derive their estimates narrowly (for instance, using
small word collections), or do not take stress variations into consideration where it may be useful. While some collections,
notably CELEX, address some of these concerns, they use British stress and pronunciation, which is suboptimal for speech research
using American English trained subjects. The Phonotactic Probability
Calculator (Vitevitch & Luce, 2004) uses American English pronunciations to compute position-specific phonotactic probabilities,
but provides no density estimates, frequency weighting options, or stress considerations. Despite growing interest in phonotactic
information, it remains unavailable or difficult to derive for novel hypotheses.
The current version (1.3) of the Irvine Phonotactic Online Dictionary (IPhOD) is a collection of phonotactic
estimates calculated across a broad sample to enable precise verbal stimuli selection for speech research and application in
cognitive science, computational linguistics, and natural language processing. IPhOD contains phonotactic and density estimates,
American English transcriptions of 2-17 phonemes, and word frequencies for 33,432 words and 815,066 pseudowords. Pseudowords are
defined here as word-like transcriptions consisting entirely of phoneme-pairs from real English words. Pseudowords like these are
used in computational psycholinguistics to study non-semantic language processes, since they have little meaning or association but
are consistent enough with a language to sound like typical words. The collection and searches using variable and range specification
are freely available online in the Tools section at http://lcbr.ss.uci.edu/ or
http://www.iphod.com.
Each IPhOD entry contains an American English phonetic transcription from the Carnegie-Mellon Pronouncing Dictionary
(Weide, 1994), and Kucera-Francis written word frequencies (1967) from the MRC Psycholinguistic Dictionary (Wilson, 1988). Neighborhood
density and word averaged phoneme-sequence probabilities were extrapolated from those data using the same formulas for words and
pseudowords, so that entries of either type could be chosen using identical criteria. IPhOD is calculated broadly, over the entire
word set in calculations for phonotactic probability and neighborhood density, similar to Vitevitch and Luce (1999).
Estimates contained in IPhOD, by column:
Column | Heading | Description |
1 | Word | Orthographic form of word, or altered word that generated pseudoword |
2 | NPHON | Number of phonemes |
3 | NSYL | Number of syllables |
4 ... 20 | PH01...17 | CMU Pronunciation Dictionary phonetic transcription (1, 2, 0 stress) |
21 | strDENS | Stressed phonological neighborhood density; distinct stressed-vowels |
22 | strFDEN | strDENS weighted with Kucera-Francis frequency of neighbors |
23 | strLDEN | strDENS weighted with Kucera-Francis log frequency of neighbors |
24 | unsDENS | Unstressed phonological neighborhood density; vowel-stress ignored |
25 | unsFDEN | unsDENS weighted with Kucera-Francis frequency of neighbors |
26 | unsLDEN | unsDENS weighted with Kucera-Francis log frequency of neighbors |
27 | strBPAV | Stressed biphoneme probability average; distinct stressed-vowels |
28 | strFBPAV | strBPAV weighted with Kucera-Francis word frequency |
29 | strLBPAV | strBPAV weighted with log Kucera-Francis word frequency |
30 | unsBPAV | Unstressed biphoneme probability average; vowel-stress ignored |
31 | unsFBPAV | unsBPAV weighted with Kucera-Francis word frequency |
32 | unsLBPAV | unsBPAV weighted with log Kucera-Francis word frequency |
33 | strTPAV | Stressed triphoneme probability average; distinct stressed-vowels |
34 | strFTPAV | strTPAV weighted with Kucera-Francis frequency |
35 | strLTPAV | strTPAV weighted with log Kucera-Francis frequency |
36 | unsTPAV | Unstressed triphoneme probability average; vowel-stress ignored |
37 | unsFTPAV | unsTPAV weighted with Kucera-Francis frequency |
38 | unsLTPAV | unsTPAV weighted with log Kucera-Francis frequency |
39 | strPOSPAV | Stressed positional probability average; distinct stressed-vowels |
40 | strFPOSPAV | strPOSPAV weighted with Kucera-Francis frequency |
41 | strLPOSPAV | strPOSPAV weighted with log Kucera-Francis frequency |
42 | unsPOSPAV | Unstressed positional probability; vowel-stress ignored |
43 | unsFPOSPAV | unsPOSPAV weighted with Kucera-Francis frequency |
44 | unsLPOSPAV | unsPOSPAV weighted with log Kucera-Francis frequency |
45 | KFFREQ | Kucera-Francis Written Word Frequency for real words |
46 | LOGFRQ | log Kucera-Francis Written Word Frequency for real words |
References
Griffin, Zenzi M. and Bock, Kathryn. 1998. Constraint, Word Frequency, and the Relationship between Lexical Processing Levels in Spoken Word Production. Journal of Memory and Language, 38(3): 313-338.
Hulme, C.; Roodenrys, S.; Schweickert, R.; Brown, G.D.A.; Martin, S.; and Stuart, G. 1997. Wordfrequency effects on short-term memory tasks. Journal of Experimental Psychology - Learning, Memory and Cognition, 23(5): 1217–1232.
Kucera, Henry and Francis, W. Nelson. 1967. Computational analysis of present-day American English. Providence, Brown University Press.
Majerus, Steve; Van der Linden, Martial; Mulder, Ludivine; Meulemans, Thierry; and Peters, Frederic 2004. Verbal short-term memory reflects the sublexical organization of the phonological language network. Journal of Memory and Language, 51(2):297-306.
Saffran, Jenny R.; Newport, Elissa L.; and Aslin, Richard N. 1996. Word Segmentation: The Role of Distributional Cues. Journal of Memory and Language, 35(4):606–621.
Vitevitch, Michael S.; Armbr¨uster, Jonna; and Chu, Shinying. 2004. Sublexical and Lexical Representations in Speech Production: Effects of Phonotactic Probability and Onset Density. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30(2):514-529.
Vitevitch, Michael S.; Luce, Paul A.; Pisoni, David B.; and Auer, Edward T. 1999. Phonotactics, Neighborhood Activation, and Lexical Access for Spoken Words. Brain and Language, 68(1):306-311.
Vitevitch, Michael S. and Luce, Paul A. 2004. A Web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers 36 (3): 481–487.
Weide, Robert L. 1994. CMU Pronouncing Dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
Wilson, M. 1988. The MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural ResearchMethods, Instruments and Computers, 20(1), 6-11.
|