| |
| |
Last Webpage Update: December 1, 2009
News: IPHOD.com was hacked, but has been repaired and secured.
Sorry for the disruption! The long-awaited IPhOD version 2.0 is now
available for download. Version 1.4 can also be downloaded, or searched
online.
|
| |
|
The Irvine Phonotactic Online Dictionary (IPhOD)
IPhOD is a large collection of English words and pseudowords developed at UC Irvine
for research on phonological processes in speech perception and production. It may be used to select
items for experiments, according to sublexical and lexical phonological measures: for example, how
many words sound like "cat", how often do the "KA" or "AT" sound sequences occur in English? IPhOD
is freely available online to search or download, so other researchers can use it in their studies.
There is also an blog to provide a forum for feedback,
questions, and suggestions - or use email to contact: Kenny Vaden.
|
| |
|
Citing IPhOD: Please cite your use of IPhOD in the following way (either as version 1.4 or 2.0):
Vaden, K.I., Hickok, G.S., & Halpin, H.R. (2009). Irvine Phonotactic Online Dictionary, Version 2.0*. [Data file].
Available from http://www.iphod.com.
|
| |
|
Introduction
A growing body of evidence demonstrates that we segment (Saffran et al., 1996), respond (Vitevitch et al., 1999),
produce (Vitevitch et al., 2004), and remember (Majerus et al., 2004) speech in ways that are affected by different kinds of
phonological frequency information.
However, this speech research is restricted by the limited number of pronunciation collections or utilities, which often derive
their estimates narrowly (for instance, using small word collections) or present limited measurement choices. While some collections,
notably CELEX, address some of these concerns, they use British stress and pronunciation, which is suboptimal for speech research
using American English trained subjects. The Phonotactic Probability
Calculator (Vitevitch & Luce, 2004) uses American English pronunciations to compute position-specific phonotactic probabilities,
but provides no density estimates or frequency weighting options. Despite growing interest in phonotactic information, it remains
unavailable or difficult to derive for novel hypotheses with contemporary tools.
The current version (2.0) of the Irvine Phonotactic Online Dictionary (IPhOD) is a collection of phonotactic
estimates calculated across a broad sample to enable precise verbal stimuli selection for speech research and application in
cognitive science, computational linguistics, and natural language processing. IPhOD contains phonotactic and density estimates,
American English transcriptions of 1-28 phonemes, and word frequencies for 54030 word and 814840 pseudoword entries. Pseudowords are
defined here as word-like transcriptions consisting entirely of phoneme-pairs from real English words. Pseudowords like these are
used in computational psycholinguistics to study non-semantic language processes, since they have little meaning or association but
are consistent enough with a language to sound like typical words. The collection is freely available for download.
Each IPhOD entry contains an American English phonetic transcription from the Carnegie-Mellon Pronouncing Dictionary
(Weide, 1994), and written word frequencies from the SUBTLEXus database (Brysbaert & New, 2009). Neighborhood density and word averaged
phoneme-sequence probabilities were extrapolated from those data using the same formulas for words and
pseudowords, so that entries of either type could be chosen using identical criteria. IPhOD is calculated broadly, over the entire
word set in calculations for phonotactic probability and neighborhood density, after the approach of Vitevitch and Luce (1999).
Phonotactic probabilities refer to the relative frequency for the sound sequences that are present in a
given word. The phonotactic measures in IPhOD extend upon definitions from Vitevitch and Luce (1999), elaborated upon below. The
database was calculated with an additional consideration for the impact of differentiating stressed forms of vowels to constrain
phonetic probabilities, as opposed to ignoring stress-differentiation in speech.
|
| |
|
Accessing IPhOD: Downloads, Searches, Calculator
The IPhOD can be used in one of several ways. First, it may be useful to find words or pseudowords with values
in a specific range. For example, what pseudowords have between 20 and 25 phonological neighbors?
A second approach is to determine what the values are for specified words or pseudowords, for
example: what are the word frequencies for cat, dog, tree, car? We have developed several ways to access the information
in IPhOD, depending on your goal.
The IPhOD database can be downloaded in its entirety (text files) from the
download page. These files can be
opened using most available spreadsheet programs, or custom PERL scripts. A second option is to
search the database online, by
entering value ranges or word lists to obtain results. Finally, there is an
online calculator that produces phonotactic and
density values for lists of phonemic transcriptions that are entered by the user. An advantage of the latter two approaches is
that you can specify which output fields to include in results, and leave out columns that are not of interest. The online
calculator is helpful for words or pseudowords that are not included in the IPhOD database.
*Please note that the online search and calculator are based on version 1.4 of IPhOD, which used Kucera-Francis
(1967) word frequencies as the basis for frequency-weighted calculations. Version 2.0 instead used SUBTLEXus (Brysbaert & New, 2009).
For more information on the differences between the earlier and current version of IPhOD, click here.
|
| |
|
Credits
The database was developed by Kenny Vaden advised by Greg Hickok (Cognitive Sciences, UC Irvine).
We thankfully aknowledge the help of Jean-Claude Falmagne (Department of Cognitive Sciences, UC Irvine) in developing formal equations
(not shown), Harry Halpin (Informatics, University of Edinburgh) for the original XML markup and online search functions, Kai Okada
(Cognitive Sciences, UC Irvine) for her extensive consultation and clarifications. Error-checking assisted by undergraduate
research assistants: Yasmine Omidvar (Fall 2003 - Spring 2004) and Corica Rodgers (Summer 2004). Thank you.
|
| |
|
References
Brysbaert, M. and New, B. 2009. Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41: 997-990.
Griffin, Zenzi M. and Bock, Kathryn. 1998. Constraint, Word Frequency, and the Relationship between Lexical Processing Levels in Spoken Word Production. Journal of Memory and Language, 38(3): 313-338.
Hulme, C.; Roodenrys, S.; Schweickert, R.; Brown, G.D.A.; Martin, S.; and Stuart, G. 1997. Wordfrequency effects on short-term memory tasks. Journal of Experimental Psychology - Learning, Memory and Cognition, 23(5): 1217-1232.
Kucera, Henry and Francis, W. Nelson. 1967. Computational analysis of present-day American English. Providence, Brown University Press.
Majerus, Steve; Van der Linden, Martial; Mulder, Ludivine; Meulemans, Thierry; and Peters, Frederic 2004. Verbal short-term memory reflects the sublexical organization of the phonological language network. Journal of Memory and Language, 51(2):297-306.
Saffran, Jenny R.; Newport, Elissa L.; and Aslin, Richard N. 1996. Word Segmentation: The Role of Distributional Cues. Journal of Memory and Language, 35(4):606-621.
Vitevitch, Michael S.; Armbruster, Jonna; and Chu, Shinying. 2004. Sublexical and Lexical Representations in Speech Production: Effects of Phonotactic Probability and Onset Density. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30(2):514-529.
Vitevitch, Michael S.; Luce, Paul A.; Pisoni, David B.; and Auer, Edward T. 1999. Phonotactics, Neighborhood Activation, and Lexical Access for Spoken Words. Brain and Language, 68(1):306-311.
Vitevitch, Michael S. and Luce, Paul A. 2004. A Web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers 36 (3): 481-487.
Weide, Robert L. 1994. CMU Pronouncing Dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
Wilson, M. 1988. MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural Research Methods, Instruments & Computers, 20(1), 6-11. http://www.psy.uwa.edu.au/mrcdatabase/uwa_mrc.htm
|
| |
|
Publications Citing IPhOD:
Creel, S.C., Aslin, R.N., Tanenhaus, M.K. (2008). Heeding the voice of experience: the role of talker variation in lexical access. Cognition, 106, 633-664.
Desai, R. Binder, J.R., Medler, D.A., Conant, L.L., Seidenberg, M.S. (2006). Activation of sensory-motor areas by sentences. Fifth International Conference on Development and Learning, 2006.
Sabri, M., Binder, J.R., Desai, R., Medler, D.A., Leitl, M.D., Libenthal, E. (2007). Attentional and linguistic interactions in speech perception. NeuroImage, 39(3), 1444-1456.
Vaden, K., Hickok, G. (2009). Sublexical and lexical processing in temporal and frontal lobes during word recognition. Society for Neuroscience, Chicago, IL.
|
| |
|
|