SECTIONS:

Search IPhOD Online, Download IPhOD Version 1.3, Explanation

Welcome to the Irvine Phonotactic Online Dictionary

Irvine Phonotactic Online Dictionary (IPhOD) is a publicly available collection of 33,432 words and 815,066 pseudowords with corresponding Kucera-Francis frequencies (1967), CMU Pronouncing Dictionary transcriptions (Weide, 1994), phonological neighborhood density, positional probabilities, and second- and third-order phoneme-sequence probabilities. Below, we describe the motivation for the database, estimate computations, and suggestions for use in computational psycholinguistics.

IPhOD was developed in the Laboratory for Cognitive Brain Research in the Department of Cognitive Sciences at UC Irvine by Kenny Vaden, . Database conversion and web interface by Harry Halpin, in the School of Informatics at the University of Edinburgh. We thank Jean-Claude Falmagne (Dept. Cognitive Sciences, UCI) for helping compose formal probability equations.

March 17.2004 Notes:
The webpage and database are in development, please report bugs in the interface to Harry Halpin, .

Introduction to IPhOD

A growing body of computational psycholinguistic evidence indicates that we segment (Saffran et al., 1996), respond (Vitevitch et al., 1999), produce (Vitevitch et al., 2004), and remember (Majerus et al., 2004) speech in ways that are affected by phonological frequency information. However, speech research is restricted by the limited number of pronunciation collections or utilities, and those phonemic frequency distributions often derive their estimates narrowly (for instance, using small word collections), or do not take stress variations into consideration where it may be useful. While some collections, notably CELEX, address some of these concerns, they use British stress and pronunciation, which is suboptimal for speech research using American English trained subjects. The Phonotactic Probability Calculator (Vitevitch & Luce, 2004) uses American English pronunciations to compute position-specific phonotactic probabilities, but provides no density estimates, frequency weighting options, or stress considerations. Despite growing interest in phonotactic information, it remains unavailable or difficult to derive for novel hypotheses.

The current version (1.3) of the Irvine Phonotactic Online Dictionary (IPhOD) is a collection of phonotactic estimates calculated across a broad sample to enable precise verbal stimuli selection for speech research and application in cognitive science, computational linguistics, and natural language processing. IPhOD contains phonotactic and density estimates, American English transcriptions of 2-17 phonemes, and word frequencies for 33,432 words and 815,066 pseudowords. Pseudowords are defined here as word-like transcriptions consisting entirely of phoneme-pairs from real English words. Pseudowords like these are used in computational psycholinguistics to study non-semantic language processes, since they have little meaning or association but are consistent enough with a language to sound like typical words. The collection and searches using variable and range specification are freely available online in the Tools section at http://lcbr.ss.uci.edu/ or http://www.iphod.com.

Each IPhOD entry contains an American English phonetic transcription from the Carnegie-Mellon Pronouncing Dictionary (Weide, 1994), and Kucera-Francis written word frequencies (1967) from the MRC Psycholinguistic Dictionary (Wilson, 1988). Neighborhood density and word averaged phoneme-sequence probabilities were extrapolated from those data using the same formulas for words and pseudowords, so that entries of either type could be chosen using identical criteria. IPhOD is calculated broadly, over the entire word set in calculations for phonotactic probability and neighborhood density, similar to Vitevitch and Luce (1999).


Estimates contained in IPhOD, by column:

Column

Heading

Description

1

Word

Orthographic form of word, or altered word that generated pseudoword

2

NPHON

Number of phonemes

3

NSYL

Number of syllables

4 ... 20

PH01...17

CMU Pronunciation Dictionary phonetic transcription (1, 2, 0 stress)

21

strDENS

Stressed phonological neighborhood density; distinct stressed-vowels

22

strFDEN

strDENS weighted with Kucera-Francis frequency of neighbors

23

strLDEN

strDENS weighted with Kucera-Francis log frequency of neighbors

24

unsDENS

Unstressed phonological neighborhood density; vowel-stress ignored

25

unsFDEN

unsDENS weighted with Kucera-Francis frequency of neighbors

26

unsLDEN

unsDENS weighted with Kucera-Francis log frequency of neighbors

27

strBPAV

Stressed biphoneme probability average; distinct stressed-vowels

28

strFBPAV

strBPAV weighted with Kucera-Francis word frequency

29

strLBPAV

strBPAV weighted with log Kucera-Francis word frequency

30

unsBPAV

Unstressed biphoneme probability average; vowel-stress ignored

31

unsFBPAV

unsBPAV weighted with Kucera-Francis word frequency

32

unsLBPAV

unsBPAV weighted with log Kucera-Francis word frequency

33

strTPAV

Stressed triphoneme probability average; distinct stressed-vowels

34

strFTPAV

strTPAV weighted with Kucera-Francis frequency

35

strLTPAV

strTPAV weighted with log Kucera-Francis frequency

36

unsTPAV

Unstressed triphoneme probability average; vowel-stress ignored

37

unsFTPAV

unsTPAV weighted with Kucera-Francis frequency

38

unsLTPAV

unsTPAV weighted with log Kucera-Francis frequency

39

strPOSPAV

Stressed positional probability average; distinct stressed-vowels

40

strFPOSPAV

strPOSPAV weighted with Kucera-Francis frequency

41

strLPOSPAV

strPOSPAV weighted with log Kucera-Francis frequency

42

unsPOSPAV

Unstressed positional probability; vowel-stress ignored

43

unsFPOSPAV

unsPOSPAV weighted with Kucera-Francis frequency

44

unsLPOSPAV

unsPOSPAV weighted with log Kucera-Francis frequency

45

KFFREQ

Kucera-Francis Written Word Frequency for real words

46

LOGFRQ

log Kucera-Francis Written Word Frequency for real words

References

Griffin, Zenzi M. and Bock, Kathryn. 1998. Constraint, Word Frequency, and the Relationship between Lexical Processing Levels in Spoken Word Production. Journal of Memory and Language, 38(3): 313-338.

Hulme, C.; Roodenrys, S.; Schweickert, R.; Brown, G.D.A.; Martin, S.; and Stuart, G. 1997. Wordfrequency effects on short-term memory tasks. Journal of Experimental Psychology - Learning, Memory and Cognition, 23(5): 1217–1232.

Kucera, Henry and Francis, W. Nelson. 1967. Computational analysis of present-day American English. Providence, Brown University Press.

Majerus, Steve; Van der Linden, Martial; Mulder, Ludivine; Meulemans, Thierry; and Peters, Frederic 2004. Verbal short-term memory reflects the sublexical organization of the phonological language network. Journal of Memory and Language, 51(2):297-306.

Saffran, Jenny R.; Newport, Elissa L.; and Aslin, Richard N. 1996. Word Segmentation: The Role of Distributional Cues. Journal of Memory and Language, 35(4):606–621.

Vitevitch, Michael S.; Armbr¨uster, Jonna; and Chu, Shinying. 2004. Sublexical and Lexical Representations in Speech Production: Effects of Phonotactic Probability and Onset Density. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30(2):514-529.

Vitevitch, Michael S.; Luce, Paul A.; Pisoni, David B.; and Auer, Edward T. 1999. Phonotactics, Neighborhood Activation, and Lexical Access for Spoken Words. Brain and Language, 68(1):306-311.

Vitevitch, Michael S. and Luce, Paul A. 2004. A Web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers 36 (3): 481–487.

Weide, Robert L. 1994. CMU Pronouncing Dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

Wilson, M. 1988. The MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural ResearchMethods, Instruments and Computers, 20(1), 6-11.