Last Webpage Update: July 2, 2009
Notes: Database is available here for download, in tab-delimited format. The online search is temporarily disabled. Please refer all questions to Kenny Vaden, . (Use 'reload' to update website contents.)

Welcome to the Irvine Phonotactic Online Dictionary

This is a research tool that was developed at UC Irvine in 2003 for word and pseudoword selection, to control or manipulate sublexical or lexical phonological aspects of stimuli. The database is publically available online for downloads, so other researchers may use it in their studies.

Irvine Phonotactic Online Dictionary (IPhOD) is a collection of 33,432 words and 815,066 pseudowords with Kucera-Francis word frequencies (1967), CMU Pronouncing Dictionary transcriptions (Weide, 1994), and several values that we derived: phonological neighborhood density, positional probabilities, and second- and third-order phoneme-sequence probabilities. Below, we describe the motivation for the database, estimate computations, and suggestions for their use in psycholinguistic experiments.

This database is no longer being developed, although we may post additional information on this webpage from time to time. There are also PERL scripts below for searching through the massive textfiles that may be useful, as opposed to using Excel or other spreadsheet programs. Please send questions, comments, criticisms, or praise of the database to Kenny Vaden .

Citing IPhOD. If you use the database in thesis or published research, please cite it in the following way:

Vaden, K.I., Hickok, G.S., & Halpin, H.R. (2005). Irvine Phonotactic Online Dictionary. [Data file].
Available from http://www.iphod.com.

If you use the database and are published in a peer-reviewed journal or conference paper, or thesis, please send me your citation data. This helps justify future tool developments, and helps researchers understand what kind of research this database is used for. The list is maintained at the bottom of this page, below my bibliography.

Introduction

A growing body of computational psycholinguistic evidence indicates that we segment (Saffran et al., 1996), respond (Vitevitch et al., 1999), produce (Vitevitch et al., 2004), and remember (Majerus et al., 2004) speech in ways that are affected by phonological frequency information. However, speech research is restricted by the limited number of pronunciation collections or utilities, and those phonemic frequency distributions often derive their estimates narrowly (for instance, using small word collections), or do not take stress variations into consideration where it may be useful. While some collections, notably CELEX, address some of these concerns, they use British stress and pronunciation, which is suboptimal for speech research using American English trained subjects. The Phonotactic Probability Calculator (Vitevitch & Luce, 2004) uses American English pronunciations to compute position-specific phonotactic probabilities, but provides no density estimates, frequency weighting options, or stress considerations. Despite growing interest in phonotactic information, it remains unavailable or difficult to derive for novel hypotheses.

The current version (1.3) of the Irvine Phonotactic Online Dictionary (IPhOD) is a collection of phonotactic estimates calculated across a broad sample to enable precise verbal stimuli selection for speech research and application in cognitive science, computational linguistics, and natural language processing. IPhOD contains phonotactic and density estimates, American English transcriptions of 2-17 phonemes, and word frequencies for 33,432 words and 815,066 pseudowords. Pseudowords are defined here as word-like transcriptions consisting entirely of phoneme-pairs from real English words. Pseudowords like these are used in computational psycholinguistics to study non-semantic language processes, since they have little meaning or association but are consistent enough with a language to sound like typical words. The collection is freely available online below.

Each IPhOD entry contains an American English phonetic transcription from the Carnegie-Mellon Pronouncing Dictionary (Weide, 1994), and Kucera-Francis written word frequencies (1967) from the MRC Psycholinguistic Dictionary (Wilson, 1988). Neighborhood density and word averaged phoneme-sequence probabilities were extrapolated from those data using the same formulas for words and pseudowords, so that entries of either type could be chosen using identical criteria. IPhOD is calculated broadly, over the entire word set in calculations for phonotactic probability and neighborhood density, similar to Vitevitch and Luce (1999).

Phonotactic probabilities refer to the concurrence likelihood of some sequence of sounds that are present in a given word. The phonotactic measures in IPhOD extend upon definitions from Vitevitch and Luce (1999), elaborated upon below. The database was calculated with an additional consideration for the impact of differentiating stressed forms of vowels to constrain phonetic probabilities, as opposed to ignoring stress-differentiation in speech.

The current instantiation of IPhOD, Version 1.3, was developed and error-checked during Winter 2005. There are other frequency-based or information-theoretic measures that might be added to the next version. Suggestions relevant to the current contents of IPhOD are welcome!

Currently options for searching IPhOD include tab-delimited textfiles that can be downloaded and searched manually, with spreadsheet software or PERL scripts. Using Excel (or freeware spinoffs) will work reasonably well in most cases, for sorting words or pseudowords, and producing scatterplots or other graphs to display the distributions of values for your stimuli.

As of February 2009, the SQL versions and web based search-interface at IPhOD.com (and iphod.com itself) are no longer functional. I apologize for any inconveniences this may cause. They may return at a future date.

Download IPhOD

The IPhOD is freely available for research purposes. Here are links for the archive files:
1. ZIP archive containing textfiles for all IPHOD WORDS (updated Feb 8, 2005).
2. ZIP archive containing textfiles for all IPHOD PSEUDOWORDS (updated Feb 8, 2005).
3. Release Notes, February 8, 2005 (included in the ZIP archive).
4. CMU Pronunciation Key, with (my) IPA Glyphs PDF (updated Feb 20, 2009).

Please contact me if you'd like to ask about the database, or give me suggestions. Note that IPhOD is an academic work, and as such if this program is used in an academic work, such as a journal or conference proceedings, we would like to be explicitly cited (reference to the IPhOD, datafile available from iphod.com) and to be notified so that we may know who is using this data. Please contact us for further details if there are any questions. IPhOD is free software, copyrighted by Kenny Vaden and distributed under the GPL.

Organization

The database is now divided into two major categories: Real Words, and Pseudowords; there is a single archive that can be downloaded for each. The archive for Real Words contains a single tab-delimited textfile that lists all IPhOD words and their values, row by row. Since there are so many pseudowords (815,066), these were organized into 16 textfiles, all included in the Pseudoword archive. Each pseudoword textfile is organized by the number of phonemes, so file #2 contains two-phoneme long pseudowords, file #3 contains three-phoneme long pseudowords, etc., up to file #17.

Nearly Identical File Structure ... with TWO key differences.

The database contains identically organized columns in the word and pseudoword textfiles, with TWO exceptions. First, in the Word collection, the last two columns show Kucera Francis frequencies, while pseudowords have neither values. The second difference is that the first column of the pseudoword file shows the *word that was changed to produce the pseudoword*. The pseudoword files seem confusing at first, since many people read the "word" column entry, and don't see the different MRC transcription, which is really the pseudoword, as it is pronounced. Each pseudoword was generated by changing one phoneme from a real word, so it helps to see what that word was when you're going to try to pronounce it correctly. For example, "Fox" might show up as the pseudoword "word" entry - but reading the transcription columns tells you "F AH Z", so it is pronounced "Foz".

Summary of Columns and Contents

Each file of the database contains columns 1-44, and the portion of the database for words contain Kucera-Francis word frequencies as well in columns 45 and 46. The values can be used for selecting stimuli, or to look values up for specified words.

All Values by Column Number and Title:

Column

Heading

Description

1

Word

Orthographic form of word, or altered word that generated pseudoword

2

NPHON

Number of phonemes

3

NSYL

Number of syllables

4 ... 20

PH01...17

CMU Pronunciation Dictionary phonetic transcription (1, 2, 0 stress)

21

strDENS

Stressed phonological neighborhood density; distinct stressed-vowels

22

strFDEN

strDENS weighted with Kucera-Francis frequency of neighbors

23

strLDEN

strDENS weighted with Kucera-Francis log frequency of neighbors

24

unsDENS

Unstressed phonological neighborhood density; vowel-stress ignored

25

unsFDEN

unsDENS weighted with Kucera-Francis frequency of neighbors

26

unsLDEN

unsDENS weighted with Kucera-Francis log frequency of neighbors

27

strBPAV

Stressed biphoneme probability average; distinct stressed-vowels

28

strFBPAV

strBPAV weighted with Kucera-Francis word frequency

29

strLBPAV

strBPAV weighted with log Kucera-Francis word frequency

30

unsBPAV

Unstressed biphoneme probability average; vowel-stress ignored

31

unsFBPAV

unsBPAV weighted with Kucera-Francis word frequency

32

unsLBPAV

unsBPAV weighted with log Kucera-Francis word frequency

33

strTPAV

Stressed triphoneme probability average; distinct stressed-vowels

34

strFTPAV

strTPAV weighted with Kucera-Francis frequency

35

strLTPAV

strTPAV weighted with log Kucera-Francis frequency

36

unsTPAV

Unstressed triphoneme probability average; vowel-stress ignored

37

unsFTPAV

unsTPAV weighted with Kucera-Francis frequency

38

unsLTPAV

unsTPAV weighted with log Kucera-Francis frequency

39

strPOSPAV

Stressed positional probability average; distinct stressed-vowels

40

strFPOSPAV

strPOSPAV weighted with Kucera-Francis frequency

41

strLPOSPAV

strPOSPAV weighted with log Kucera-Francis frequency

42

unsPOSPAV

Unstressed positional probability; vowel-stress ignored

43

unsFPOSPAV

unsPOSPAV weighted with Kucera-Francis frequency

44

unsLPOSPAV

unsPOSPAV weighted with log Kucera-Francis frequency

45

KFFREQ

Kucera-Francis Written Word Frequency for real words

46

LOGFRQ

log Kucera-Francis Written Word Frequency for real words

Using PERL to search IPhOD (Files will be added soon, around 2/28/2009)

In many cases, using PERL scripts may allow you to search more elegantly and powerfully than using a spreadsheet program like Excel, but at a cost of programming time - decide wisely. If you have some programming background, you can modify these scripts to create new search functions better suited to your research questions. For example, I modified this script to search for CVC items only, or CVC words which share the CV-onset. I am interested in seeing your code if you modify mine to improve it. The instructions below are based on my development OS (Windows), and assume you have downloaded and unpacked the word or pseudoword contents (above).

1. If you do not have PERL on your PC yet, then install it. Active State PERL (Windows, free). I wrote these scripts on Windows machine, but some slight changes allow it to run beautifully in Linux, which includes PERL by default in most cases. MAC OS may also include PERL without any installation, but I don't know.

2. Download the PERL search script and search textfile (archived): IPhod_Search.ZIP
.... ZIP archive containing search script and query file (archive updated Mar 17, 2009).

3. Unzip contents of IPhod_Search.ZIP into the directory containing either word OR pseudoword textfiles.

4. Edit SEARCH_VALS.TXT using a text editor or spreadsheet program. Column #1 shows a value label that corresponds to the header row of the word or pseudoword files. Column #2 gives the minimum allowed value, and Column #3 is the maximum. If you do not specify a value (blank field), then that variable is ignored when filtering the results.

5. Execute the IPHOD_SEARCH.PL script. Using the DOS prompt or command window, navigate to the directory containing all the search files, including iphod_search.pl and files you are searching, then type "iphod_search.pl". The search output (Output.txt) should contain only words or pseudowords within the value range specified in step 4, above. The layout of the search_vals.txt and command line, using a real example are shown in the figure, below (click for larger image).

Using PERL to Calculate New Values

If a word or pseudoword isn't in the IPhOD, there is another perl script I wrote to calculate new density values and phonotactic probabilities the same way that they were originally done, for a list of items in CMU transcription format. This is advanced IPhOD useage only - so contact me with a list or to obtain those additional PERL files at .

Credits

The database was developed by UCI PhD Psychology LCBR graduate student, Kenny Vaden, advised by Greg Hickok. We thankfully aknowledge the help of Jean-Claude Falmagne (Department of Cognitive Sciences, UCI) in developing formal equations (not shown). XML markup and search GUI (no longer available) by Harry Halpin, PhD Informatics graduate student at University of Edinburgh. Consulted Kai Okado, another LCBR graduate student, for clarifications. Error-checking assisted by undergraduate research assistants: Yasmine Omidvar (Fall 2003 - Spring 2004) and Corica Rodgers (Summer 2004).

References

Griffin, Zenzi M. and Bock, Kathryn. 1998. Constraint, Word Frequency, and the Relationship between Lexical Processing Levels in Spoken Word Production. Journal of Memory and Language, 38(3): 313-338.

Hulme, C.; Roodenrys, S.; Schweickert, R.; Brown, G.D.A.; Martin, S.; and Stuart, G. 1997. Wordfrequency effects on short-term memory tasks. Journal of Experimental Psychology - Learning, Memory and Cognition, 23(5): 1217–1232.

Kucera, Henry and Francis, W. Nelson. 1967. Computational analysis of present-day American English. Providence, Brown University Press.

Majerus, Steve; Van der Linden, Martial; Mulder, Ludivine; Meulemans, Thierry; and Peters, Frederic 2004. Verbal short-term memory reflects the sublexical organization of the phonological language network. Journal of Memory and Language, 51(2):297-306.

Saffran, Jenny R.; Newport, Elissa L.; and Aslin, Richard N. 1996. Word Segmentation: The Role of Distributional Cues. Journal of Memory and Language, 35(4):606–621.

Vitevitch, Michael S.; Armbr¨uster, Jonna; and Chu, Shinying. 2004. Sublexical and Lexical Representations in Speech Production: Effects of Phonotactic Probability and Onset Density. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30(2):514-529.

Vitevitch, Michael S.; Luce, Paul A.; Pisoni, David B.; and Auer, Edward T. 1999. Phonotactics, Neighborhood Activation, and Lexical Access for Spoken Words. Brain and Language, 68(1):306-311.

Vitevitch, Michael S. and Luce, Paul A. 2004. A Web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers 36 (3): 481–487.

Weide, Robert L. 1994. CMU Pronouncing Dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

Wilson, M. 1988. MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural Research Methods, Instruments & Computers, 20(1), 6-11. http://www.psy.uwa.edu.au/mrcdatabase/uwa_mrc.htm

Published Studies Using IPhOD:

Creel, S.C., Aslin, R.N., Tanenhaus, M.K. (2008). Heeding the voice of experience: the role of talker variation in lexical access. Cognition, 106, 633-664.

Sabri, M., Binder, J.R., Desai, R., Medler, D.A., Leitl, M.D., Libenthal, E. (2007). Attentional and linguistic interactions in speech perception. NeuroImage, 39(3), 1444-1456.

Desai, R. Binder, J.R., Medler, D.A., Conant, L.L., Seidenberg, M.S. (2006). Activation of sensory-motor areas by sentences. Fifth International Conference on Development and Learning, 2006.