| |
|
Last Webpage Update: July 2, 2009
Notes: Database is available here for download, in tab-delimited format. The online search is temporarily disabled.
Please refer all questions to Kenny Vaden, .
(Use 'reload' to update website contents.)
|
|
Welcome to the Irvine Phonotactic Online Dictionary
This is a research tool that was developed at UC Irvine in 2003 for word and pseudoword selection, to control or
manipulate sublexical or lexical phonological aspects of stimuli. The database is publically available online for downloads, so
other researchers may use it in their studies.
Irvine Phonotactic Online Dictionary (IPhOD) is a collection of 33,432 words and 815,066 pseudowords with
Kucera-Francis word frequencies (1967), CMU Pronouncing Dictionary transcriptions (Weide, 1994), and several values that we derived:
phonological neighborhood density, positional probabilities, and second- and third-order phoneme-sequence probabilities. Below, we
describe the motivation for the database, estimate computations, and suggestions for their use in psycholinguistic experiments.
This database is no longer being developed, although we may post additional information on this webpage from time
to time. There are also PERL scripts below for searching through the massive textfiles that may be useful, as opposed to using Excel or
other spreadsheet programs. Please send questions, comments, criticisms, or praise of the database to Kenny Vaden .
|
|
Citing IPhOD. If you use the database in thesis or published research, please cite it in the following way:
Vaden, K.I., Hickok, G.S., & Halpin, H.R. (2005). Irvine Phonotactic Online Dictionary. [Data file].
Available from http://www.iphod.com.
|
|
If you use the database and are published in a peer-reviewed journal or conference paper, or thesis, please send me your
citation data. This helps justify future tool developments, and helps researchers understand what kind of research this database is used for.
The list is maintained at the bottom of this page, below my bibliography.
|
|
Introduction
A growing body of computational psycholinguistic evidence indicates that we segment (Saffran et al., 1996),
respond (Vitevitch et al., 1999), produce (Vitevitch et al., 2004), and remember (Majerus et al., 2004) speech in ways that are
affected by phonological frequency information. However, speech research is restricted by the limited number of pronunciation
collections or utilities, and those phonemic frequency distributions often derive their estimates narrowly (for instance, using
small word collections), or do not take stress variations into consideration where it may be useful. While some collections,
notably CELEX, address some of these concerns, they use British stress and pronunciation, which is suboptimal for speech research
using American English trained subjects. The Phonotactic Probability
Calculator (Vitevitch & Luce, 2004) uses American English pronunciations to compute position-specific phonotactic probabilities,
but provides no density estimates, frequency weighting options, or stress considerations. Despite growing interest in phonotactic
information, it remains unavailable or difficult to derive for novel hypotheses.
The current version (1.3) of the Irvine Phonotactic Online Dictionary (IPhOD) is a collection of phonotactic
estimates calculated across a broad sample to enable precise verbal stimuli selection for speech research and application in
cognitive science, computational linguistics, and natural language processing. IPhOD contains phonotactic and density estimates,
American English transcriptions of 2-17 phonemes, and word frequencies for 33,432 words and 815,066 pseudowords. Pseudowords are
defined here as word-like transcriptions consisting entirely of phoneme-pairs from real English words. Pseudowords like these are
used in computational psycholinguistics to study non-semantic language processes, since they have little meaning or association but
are consistent enough with a language to sound like typical words. The collection is freely available online below.
Each IPhOD entry contains an American English phonetic transcription from the Carnegie-Mellon Pronouncing Dictionary
(Weide, 1994), and Kucera-Francis written word frequencies (1967) from the MRC Psycholinguistic Dictionary (Wilson, 1988). Neighborhood
density and word averaged phoneme-sequence probabilities were extrapolated from those data using the same formulas for words and
pseudowords, so that entries of either type could be chosen using identical criteria. IPhOD is calculated broadly, over the entire
word set in calculations for phonotactic probability and neighborhood density, similar to Vitevitch and Luce (1999).
Phonotactic probabilities refer to the concurrence likelihood of some sequence of sounds that are present in a
given word. The phonotactic measures in IPhOD extend upon definitions from Vitevitch and Luce (1999), elaborated upon below. The
database was calculated with an additional consideration for the impact of differentiating stressed forms of vowels to constrain
phonetic probabilities, as opposed to ignoring stress-differentiation in speech.
The current instantiation of IPhOD, Version 1.3, was developed and error-checked during Winter 2005.
There are other frequency-based or information-theoretic measures that might be added to the next version.
Suggestions relevant to the current contents of IPhOD are welcome!
Currently options for searching IPhOD include tab-delimited textfiles that can be downloaded and searched manually,
with spreadsheet software or PERL scripts. Using Excel (or freeware spinoffs)
will work reasonably well in most cases, for sorting words or pseudowords, and producing scatterplots or other graphs
to display the distributions of values for your stimuli.
As of February 2009, the SQL versions and web based search-interface at IPhOD.com (and iphod.com itself) are no longer
functional. I apologize for any inconveniences this may cause. They may return at a future date.
|
|
Download IPhOD
The IPhOD is freely available for research purposes. Here are links for the archive files:
1. ZIP archive containing textfiles for all IPHOD WORDS (updated Feb 8, 2005).
2. ZIP archive containing textfiles for all IPHOD PSEUDOWORDS (updated Feb 8, 2005).
3. Release Notes, February 8, 2005 (included in the ZIP archive).
4. CMU Pronunciation Key, with (my) IPA Glyphs PDF (updated Feb 20, 2009).
Please contact me if you'd like to ask about the database, or give me suggestions. Note that IPhOD is
an academic work, and as such if this program is used in an academic work, such as a journal or conference proceedings, we would like
to be explicitly cited (reference to the IPhOD, datafile available from iphod.com) and to be notified so that we may know who is
using this data. Please contact us for further details if there are any questions. IPhOD is free software, copyrighted by Kenny
Vaden and distributed under the GPL.
|
|
Organization
The database is now divided into two major categories: Real Words, and Pseudowords; there is a single archive that
can be downloaded for each. The archive for Real Words contains a single tab-delimited textfile that lists all IPhOD words and
their values, row by row. Since there are so many pseudowords (815,066), these were organized into 16 textfiles, all included in the
Pseudoword archive. Each pseudoword textfile is organized by the number of phonemes, so file #2 contains two-phoneme long pseudowords,
file #3 contains three-phoneme long pseudowords, etc., up to file #17.
Nearly Identical File Structure ... with TWO key differences.
The database contains identically organized columns in the word and pseudoword textfiles, with TWO exceptions. First, in the Word
collection, the last two columns show Kucera Francis frequencies, while pseudowords have neither values. The second difference is that
the first column of the pseudoword file shows the *word that was changed to produce the pseudoword*. The pseudoword files seem confusing at first,
since many people read the "word" column entry, and don't see the different MRC transcription, which is really the pseudoword, as it is pronounced.
Each pseudoword was generated by changing one phoneme from a real word, so it helps to see what that word was when you're going to try to pronounce
it correctly. For example, "Fox" might show up as the pseudoword "word" entry - but reading the transcription columns tells you "F AH Z", so it is
pronounced "Foz".
|
|
Summary of Columns and Contents
Each file of the database contains columns 1-44, and the portion of the database for words contain Kucera-Francis word
frequencies as well in columns 45 and 46. The values can be used for selecting stimuli, or to look values up for specified words.
All Values by Column Number and Title:
Column | Heading | Description |
1 | Word | Orthographic form of word, or altered word that generated pseudoword |
2 | NPHON | Number of phonemes |
3 | NSYL | Number of syllables |
4 ... 20 | PH01...17 | CMU Pronunciation Dictionary phonetic transcription (1, 2, 0 stress) |
21 | strDENS | Stressed phonological neighborhood density; distinct stressed-vowels |
22 | strFDEN | strDENS weighted with Kucera-Francis frequency of neighbors |
23 | strLDEN | strDENS weighted with Kucera-Francis log frequency of neighbors |
24 | unsDENS | Unstressed phonological neighborhood density; vowel-stress ignored |
25 | unsFDEN | unsDENS weighted with Kucera-Francis frequency of neighbors |
26 | unsLDEN | unsDENS weighted with Kucera-Francis log frequency of neighbors |
27 | strBPAV | Stressed biphoneme probability average; distinct stressed-vowels |
28 | strFBPAV | strBPAV weighted with Kucera-Francis word frequency |
29 | strLBPAV | strBPAV weighted with log Kucera-Francis word frequency |
30 | unsBPAV | Unstressed biphoneme probability average; vowel-stress ignored |
31 | unsFBPAV | unsBPAV weighted with Kucera-Francis word frequency |
32 | unsLBPAV | unsBPAV weighted with log Kucera-Francis word frequency |
33 | strTPAV | Stressed triphoneme probability average; distinct stressed-vowels |
34 | strFTPAV | strTPAV weighted with Kucera-Francis frequency |
35 | strLTPAV | strTPAV weighted with log Kucera-Francis frequency |
36 | unsTPAV | Unstressed triphoneme probability average; vowel-stress ignored |
37 | unsFTPAV | unsTPAV weighted with Kucera-Francis frequency |
38 | unsLTPAV | unsTPAV weighted with log Kucera-Francis frequency |
39 | strPOSPAV | Stressed positional probability average; distinct stressed-vowels |
40 | strFPOSPAV | strPOSPAV weighted with Kucera-Francis frequency |
41 | strLPOSPAV | strPOSPAV weighted with log Kucera-Francis frequency |
42 | unsPOSPAV | Unstressed positional probability; vowel-stress ignored |
43 | unsFPOSPAV | unsPOSPAV weighted with Kucera-Francis frequency |
44 | unsLPOSPAV | unsPOSPAV weighted with log Kucera-Francis frequency |
45 | KFFREQ | Kucera-Francis Written Word Frequency for real words |
46 | LOGFRQ | log Kucera-Francis Written Word Frequency for real words |
|
|
Using PERL to search IPhOD (Files will be added soon, around 2/28/2009)
In many cases, using PERL scripts may allow you to search more elegantly and powerfully than using a spreadsheet program
like Excel, but at a cost of programming time - decide wisely. If you have some programming background, you can modify these scripts
to create new search functions better suited to your research questions. For example, I modified this script to search for CVC items
only, or CVC words which share the CV-onset. I am interested in seeing your code if you modify mine to improve it. The instructions below
are based on my development OS (Windows), and assume you have downloaded and unpacked the word or pseudoword contents
(above).
1. If you do not have PERL on your PC yet, then install it. Active State PERL (Windows, free).
I wrote these scripts on Windows machine, but some slight changes allow it to run beautifully in Linux, which includes PERL by default in most cases.
MAC OS may also include PERL without any installation, but I don't know.
2. Download the PERL search script and search textfile (archived): IPhod_Search.ZIP
.... ZIP archive containing search script and query file (archive updated Mar 17, 2009).
3. Unzip contents of IPhod_Search.ZIP into the directory containing either word OR pseudoword textfiles.
4. Edit SEARCH_VALS.TXT using a text editor or spreadsheet program. Column #1 shows a value label that corresponds
to the header row of the word or pseudoword files. Column #2 gives the minimum allowed value, and Column #3 is the maximum.
If you do not specify a value (blank field), then that variable is ignored when filtering the results.
5. Execute the IPHOD_SEARCH.PL script. Using the DOS prompt or command window, navigate to the directory containing all
the search files, including iphod_search.pl and files you are searching, then type "iphod_search.pl". The search output (Output.txt) should
contain only words or pseudowords within the value range specified in step 4, above. The layout of the search_vals.txt and command line,
using a real example are shown in the figure, below (click for larger image).
Using PERL to Calculate New Values
If a word or pseudoword isn't in the IPhOD, there is another perl script I wrote to calculate new density values and
phonotactic probabilities the same way that they were originally done, for a list of items in CMU transcription format. This is advanced
IPhOD useage only - so contact me with a list or to obtain those additional PERL files at .
Credits
The database was developed by UCI PhD Psychology LCBR graduate student, Kenny Vaden, advised by Greg Hickok.
We thankfully aknowledge the help of Jean-Claude Falmagne (Department of Cognitive Sciences, UCI) in developing formal equations
(not shown). XML markup and search GUI (no longer available) by Harry Halpin, PhD Informatics graduate student at University of
Edinburgh. Consulted Kai Okado, another LCBR graduate student, for clarifications. Error-checking assisted by undergraduate
research assistants: Yasmine Omidvar (Fall 2003 - Spring 2004) and Corica Rodgers (Summer 2004).
References
Griffin, Zenzi M. and Bock, Kathryn. 1998. Constraint, Word Frequency, and the Relationship between Lexical Processing Levels in Spoken Word Production. Journal of Memory and Language, 38(3): 313-338.
Hulme, C.; Roodenrys, S.; Schweickert, R.; Brown, G.D.A.; Martin, S.; and Stuart, G. 1997. Wordfrequency effects on short-term memory tasks. Journal of Experimental Psychology - Learning, Memory and Cognition, 23(5): 1217–1232.
Kucera, Henry and Francis, W. Nelson. 1967. Computational analysis of present-day American English. Providence, Brown University Press.
Majerus, Steve; Van der Linden, Martial; Mulder, Ludivine; Meulemans, Thierry; and Peters, Frederic 2004. Verbal short-term memory reflects the sublexical organization of the phonological language network. Journal of Memory and Language, 51(2):297-306.
Saffran, Jenny R.; Newport, Elissa L.; and Aslin, Richard N. 1996. Word Segmentation: The Role of Distributional Cues. Journal of Memory and Language, 35(4):606–621.
Vitevitch, Michael S.; Armbr¨uster, Jonna; and Chu, Shinying. 2004. Sublexical and Lexical Representations in Speech Production: Effects of Phonotactic Probability and Onset Density. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30(2):514-529.
Vitevitch, Michael S.; Luce, Paul A.; Pisoni, David B.; and Auer, Edward T. 1999. Phonotactics, Neighborhood Activation, and Lexical Access for Spoken Words. Brain and Language, 68(1):306-311.
Vitevitch, Michael S. and Luce, Paul A. 2004. A Web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers 36 (3): 481–487.
Weide, Robert L. 1994. CMU Pronouncing Dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
Wilson, M. 1988. MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural Research Methods, Instruments & Computers, 20(1), 6-11. http://www.psy.uwa.edu.au/mrcdatabase/uwa_mrc.htm
|
|
Published Studies Using IPhOD:
Creel, S.C., Aslin, R.N., Tanenhaus, M.K. (2008). Heeding the voice of experience: the role of talker variation in lexical access. Cognition, 106, 633-664.
Sabri, M., Binder, J.R., Desai, R., Medler, D.A., Leitl, M.D., Libenthal, E. (2007). Attentional and linguistic interactions in speech perception. NeuroImage, 39(3), 1444-1456.
Desai, R. Binder, J.R., Medler, D.A., Conant, L.L., Seidenberg, M.S. (2006). Activation of sensory-motor areas by sentences. Fifth International Conference on Development and Learning, 2006.
|
|