IPHOD: HOME, BLOG, DOWNLOAD, SEARCH, CALCULATOR, DETAILS, KENNY VADEN
Last Webpage Update: December 1, 2009

News: IPHOD.com was hacked, but has been repaired and secured. Sorry for the disruption! The long-awaited IPhOD version 2.0 is now available for download. Version 1.4 can also be downloaded, or searched online.

The Irvine Phonotactic Online Dictionary (IPhOD)

IPhOD is a large collection of English words and pseudowords developed at UC Irvine for research on phonological processes in speech perception and production. It may be used to select items for experiments, according to sublexical and lexical phonological measures: for example, how many words sound like "cat", how often do the "KA" or "AT" sound sequences occur in English? IPhOD is freely available online to search or download, so other researchers can use it in their studies. There is also an blog to provide a forum for feedback, questions, and suggestions - or use email to contact: Kenny Vaden.

Citing IPhOD: Please cite your use of IPhOD in the following way (either as version 1.4 or 2.0):

Vaden, K.I., Hickok, G.S., & Halpin, H.R. (2009). Irvine Phonotactic Online Dictionary, Version 2.0*. [Data file]. Available from http://www.iphod.com.

Introduction

A growing body of evidence demonstrates that we segment (Saffran et al., 1996), respond (Vitevitch et al., 1999), produce (Vitevitch et al., 2004), and remember (Majerus et al., 2004) speech in ways that are affected by different kinds of phonological frequency information. However, this speech research is restricted by the limited number of pronunciation collections or utilities, which often derive their estimates narrowly (for instance, using small word collections) or present limited measurement choices. While some collections, notably CELEX, address some of these concerns, they use British stress and pronunciation, which is suboptimal for speech research using American English trained subjects. The Phonotactic Probability Calculator (Vitevitch & Luce, 2004) uses American English pronunciations to compute position-specific phonotactic probabilities, but provides no density estimates or frequency weighting options. Despite growing interest in phonotactic information, it remains unavailable or difficult to derive for novel hypotheses with contemporary tools.

The current version (2.0) of the Irvine Phonotactic Online Dictionary (IPhOD) is a collection of phonotactic estimates calculated across a broad sample to enable precise verbal stimuli selection for speech research and application in cognitive science, computational linguistics, and natural language processing. IPhOD contains phonotactic and density estimates, American English transcriptions of 1-28 phonemes, and word frequencies for 54030 word and 814840 pseudoword entries. Pseudowords are defined here as word-like transcriptions consisting entirely of phoneme-pairs from real English words. Pseudowords like these are used in computational psycholinguistics to study non-semantic language processes, since they have little meaning or association but are consistent enough with a language to sound like typical words. The collection is freely available for download.

Each IPhOD entry contains an American English phonetic transcription from the Carnegie-Mellon Pronouncing Dictionary (Weide, 1994), and written word frequencies from the SUBTLEXus database (Brysbaert & New, 2009). Neighborhood density and word averaged phoneme-sequence probabilities were extrapolated from those data using the same formulas for words and pseudowords, so that entries of either type could be chosen using identical criteria. IPhOD is calculated broadly, over the entire word set in calculations for phonotactic probability and neighborhood density, after the approach of Vitevitch and Luce (1999).

Phonotactic probabilities refer to the relative frequency for the sound sequences that are present in a given word. The phonotactic measures in IPhOD extend upon definitions from Vitevitch and Luce (1999), elaborated upon below. The database was calculated with an additional consideration for the impact of differentiating stressed forms of vowels to constrain phonetic probabilities, as opposed to ignoring stress-differentiation in speech.

Accessing IPhOD: Downloads, Searches, Calculator

The IPhOD can be used in one of several ways. First, it may be useful to find words or pseudowords with values in a specific range. For example, what pseudowords have between 20 and 25 phonological neighbors? A second approach is to determine what the values are for specified words or pseudowords, for example: what are the word frequencies for cat, dog, tree, car? We have developed several ways to access the information in IPhOD, depending on your goal.

The IPhOD database can be downloaded in its entirety (text files) from the download page. These files can be opened using most available spreadsheet programs, or custom PERL scripts. A second option is to search the database online, by entering value ranges or word lists to obtain results. Finally, there is an online calculator that produces phonotactic and density values for lists of phonemic transcriptions that are entered by the user. An advantage of the latter two approaches is that you can specify which output fields to include in results, and leave out columns that are not of interest. The online calculator is helpful for words or pseudowords that are not included in the IPhOD database.

*Please note that the online search and calculator are based on version 1.4 of IPhOD, which used Kucera-Francis (1967) word frequencies as the basis for frequency-weighted calculations. Version 2.0 instead used SUBTLEXus (Brysbaert & New, 2009). For more information on the differences between the earlier and current version of IPhOD, click here.

Credits

The database was developed by Kenny Vaden advised by Greg Hickok (Cognitive Sciences, UC Irvine). We thankfully aknowledge the help of Jean-Claude Falmagne (Department of Cognitive Sciences, UC Irvine) in developing formal equations (not shown), Harry Halpin (Informatics, University of Edinburgh) for the original XML markup and online search functions, Kai Okada (Cognitive Sciences, UC Irvine) for her extensive consultation and clarifications. Error-checking assisted by undergraduate research assistants: Yasmine Omidvar (Fall 2003 - Spring 2004) and Corica Rodgers (Summer 2004). Thank you.

References

Brysbaert, M. and New, B. 2009. Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41: 997-990.

Griffin, Zenzi M. and Bock, Kathryn. 1998. Constraint, Word Frequency, and the Relationship between Lexical Processing Levels in Spoken Word Production. Journal of Memory and Language, 38(3): 313-338.

Hulme, C.; Roodenrys, S.; Schweickert, R.; Brown, G.D.A.; Martin, S.; and Stuart, G. 1997. Wordfrequency effects on short-term memory tasks. Journal of Experimental Psychology - Learning, Memory and Cognition, 23(5): 1217-1232.

Kucera, Henry and Francis, W. Nelson. 1967. Computational analysis of present-day American English. Providence, Brown University Press.

Majerus, Steve; Van der Linden, Martial; Mulder, Ludivine; Meulemans, Thierry; and Peters, Frederic 2004. Verbal short-term memory reflects the sublexical organization of the phonological language network. Journal of Memory and Language, 51(2):297-306.

Saffran, Jenny R.; Newport, Elissa L.; and Aslin, Richard N. 1996. Word Segmentation: The Role of Distributional Cues. Journal of Memory and Language, 35(4):606-621.

Vitevitch, Michael S.; Armbruster, Jonna; and Chu, Shinying. 2004. Sublexical and Lexical Representations in Speech Production: Effects of Phonotactic Probability and Onset Density. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30(2):514-529.

Vitevitch, Michael S.; Luce, Paul A.; Pisoni, David B.; and Auer, Edward T. 1999. Phonotactics, Neighborhood Activation, and Lexical Access for Spoken Words. Brain and Language, 68(1):306-311.

Vitevitch, Michael S. and Luce, Paul A. 2004. A Web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers 36 (3): 481-487.

Weide, Robert L. 1994. CMU Pronouncing Dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

Wilson, M. 1988. MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural Research Methods, Instruments & Computers, 20(1), 6-11. http://www.psy.uwa.edu.au/mrcdatabase/uwa_mrc.htm

Publications Citing IPhOD:

Creel, S.C., Aslin, R.N., Tanenhaus, M.K. (2008). Heeding the voice of experience: the role of talker variation in lexical access. Cognition, 106, 633-664.

Desai, R. Binder, J.R., Medler, D.A., Conant, L.L., Seidenberg, M.S. (2006). Activation of sensory-motor areas by sentences. Fifth International Conference on Development and Learning, 2006.

Sabri, M., Binder, J.R., Desai, R., Medler, D.A., Leitl, M.D., Libenthal, E. (2007). Attentional and linguistic interactions in speech perception. NeuroImage, 39(3), 1444-1456.

Vaden, K., Hickok, G. (2009). Sublexical and lexical processing in temporal and frontal lobes during word recognition. Society for Neuroscience, Chicago, IL.