IPHOD: HOME, BLOG, DOWNLOAD, SEARCH, CALCULATOR, DETAILS, KENNY VADEN

Details on IPhOD: organization and measures

The IPhOD was calculated over a large word set in calculations for phonotactic probability and neighborhood density, after the approach of Vitevitch and Luce (1999). Phonotactic probabilities refer to the concurrence likelihood of some sequence of sounds that are present in a given word. Phonological neighborhood density counts the number of words that share all but one phoneme with a particular word or pseudoword. Positional probabilities refer to the average likelihood of each phoneme occurring in each position of a word. These counting and probability measures also were weighted using frequency and log frequency to reflect their occurrence in natural language.

The IPhOD measures extend on definitions from Vitevitch and Luce (1999), by performing these calculations while distinguishing vowels with different syllable stress placement or not. In stressed calculations, otherwise identical vowel sounds are considered to be distinct phonemes depending on primary, secondary, or no-stress placement. The so-called unstressed calculations collapse vowel sounds into single phoneme categories. This may allow syllable stress related hypotheses to be tested, since the information was available in the transcriptions in the Carnegie-Mellon Pronouncing Dictionary (Weide, 1994).

Version 2.0 versus 1.4

IPhOD version 2.0 contains phonotactic and density estimates, American English transcriptions of 1-28 phonemes, and word frequencies for 54,030 word and 814,840 pseudoword entries. Each entry contains an American English phonetic transcription from the Carnegie-Mellon Pronouncing Dictionary (Weide, 1994), and written word frequencies from the SUBLEXus database (Brysbaert & New, 2009). Neighborhood density and word averaged phoneme-sequence probabilities were extrapolated from those data using the same formulas for words and pseudowords, so that entries of either type could be chosen using identical criteria.

IPhOD version 1.4 contains transcriptions of 1-17 phonemes, and word frequencies for 33,432 words and 814,840 pseudowords. Each entry contains an American English phonetic transcription from the Carnegie-Mellon Pronouncing Dictionary (Weide, 1994), and Kucera-Francis written word frequencies (1967) from the MRC Psycholinguistic Dictionary (Wilson, 1988).

Additional Information about Version 2.0

Different ways of saying the same thing? IPhOD version 2.0 introduced homophones and homographs to the database. This addition required special steps to be taken to avoid double-counting pronunciations or double-weighting with written word frequencies. For each measure in the database, homophones were counted separately in weighted counts since they had different written frequencies in SUBTLEXus, because words that are pronounced identically but have different spellings and therefore different written frequencies. However, homophone entries were counted only once for raw counts, since their pronunciations are indistinct. Homographs were handled oppositely: weighted counts used only one entry since there was no way of assigning written word frequency to multiple pronunciations of the same orthographic item. Meanwhile, all of the various pronunciations of homographic entries could be counted separately for the raw count. Previous versions of IPhOD had a single pronunciation for each spelling, and did not treat homographs differently from the other words.

Version 2.0: All Values by Column Number and Title:

Column #

Column Name

Description

1

Indx

Index number for word or pseudoword collections.

2

Word

Orthographic form of word, or altered word that generated pseudoword

3

UnTrn

CMU Pronouncing Dictionary transcription. Phoneme glyphs separated by period marks. Unstressed; contains no syllable stress information.

4

StTrn

Stressed transcription; 0,1,2 indicates unstressed, primary or secondary stressed syllable.

5

NSYL

Number of syllables

6

NPHON

Number of phonemes

7,8,9,10

unsDENS, unsFDEN, unsLDEN, unsCDEN

Unstressed phonological neighborhood density.*

11,12,13,14

strDENS, strFDEN, strLDEN, strCDEN

Stressed phonological neighborhood density.*

15,16,17,18

unsBPAV, unsFBPAV, unsLBPAV, unsCBPAV

Unstressed, word-average biphoneme probability (relative frequencies for ordered phoneme pairs).*

19,20,21,22

strBPAV, strFBPAV, strLBPAV, strCBPAV

Stressed, word-average biphoneme probability.*

23,24,25,26

unsTPAV, unsFTPAV, unsLTPAV, unsCTPAV

Unstressed, word-average triphoneme probability (relative frequencies for ordered phoneme triplets).*

27,28,29,30

strTPAV, strFTPAV, strLTPAV, strCTPAV

Stressed, word-average triphoneme probability.*

31,32,33,34

unsPOSPAV, unsFPOSPAV, unsLPOSPAV, unsCPOSPAV

Unstressed, word-average positional probability. (frequency of each phoneme occuring in specific position, e.g. first, second, etc.)*

35,36,37,38

strPOSPAV, strFPOSPAV, strLPOSPAV, strCPOSPAV

Stressed, word-average positional probability.*

39,40,41,42

unsLCPOSPAV, unsFLCPOSPAV, unsLLCPOSPAV, unsCLCPOSPAV

Unstressed, length-constrained word-average positional probability. Similar to positional probability, but only counts phonemes in the specific position - among words that contain the same number of phonemes.*

43,44,45,46

strLCPOSPAV, strFLCPOSPAV, strLLCPOSPAV, strCLCPOSPAV

Stressed, length-constrained word-average positional probability.*

47

SFreq

SUBTLEXus word frequency. **

48

SCDcnt

SUBTLEXus CD count, another measure of word frequency. **

*Note: all measures above that are listed in groups of four were calculated either as unweighted counts or weighted with different frequency measures. Each quad is ordered: unweighted, SUBTLEXus weighted, log (base 10) SUBTLEXus weighted, Context Count weighted (SUBTLEXus), respectively.

**Note: SUBTLEX word frequency columns (47,48) are only available for words (not pseudowords).


Additional Information about Version 1.4

Words & Pseudowords: nearly identical file structure ... with 2 key differences. The database contains identically organized columns in the word and pseudoword textfiles, with TWO exceptions. First, in the Word collection, the last two columns show Kucera Francis frequencies, while pseudowords have neither values. The second difference is that the first column of the pseudoword file shows the *word that was changed to produce the pseudoword*. The pseudoword files seem confusing at first, since many people read the "word" column entry, and don't see the different MRC transcription, which is really the pseudoword, as it is pronounced. Each pseudoword was generated by changing one phoneme from a real word, so it helps to see what that word was when you're going to try to pronounce it correctly. For example, "Fox" might show up as the pseudoword "word" entry - but reading the transcription columns tells you "F AH Z", so it is pronounced "Foz".

Version 1.4: Summary of Columns and Contents

Each file of the database contains columns 1-44, and all word entries contain Kucera-Francis word frequencies (columns 45, 46). IPhOD values are used for finding items, or to quantify aspects of specified wordlists (# neighbors, etc.).

Version 1.4: All Values by Column Number and Title:

Column

Heading

Description

1

Word

Orthographic form of word, or altered word that generated pseudoword

2

NPHON

Number of phonemes

3

NSYL

Number of syllables

4 ... 20

PH01...17

CMU Pronunciation Dictionary phonetic transcription (1, 2, 0 stress)

21

strDENS

Stressed phonological neighborhood density; distinct stressed-vowels

22

strFDEN

strDENS weighted with Kucera-Francis frequency of neighbors

23

strLDEN

strDENS weighted with Kucera-Francis log frequency of neighbors

24

unsDENS

Unstressed phonological neighborhood density; vowel-stress ignored

25

unsFDEN

unsDENS weighted with Kucera-Francis frequency of neighbors

26

unsLDEN

unsDENS weighted with Kucera-Francis log frequency of neighbors

27

strBPAV

Stressed biphoneme probability average; distinct stressed-vowels

28

strFBPAV

strBPAV weighted with Kucera-Francis word frequency

29

strLBPAV

strBPAV weighted with log Kucera-Francis word frequency

30

unsBPAV

Unstressed biphoneme probability average; vowel-stress ignored

31

unsFBPAV

unsBPAV weighted with Kucera-Francis word frequency

32

unsLBPAV

unsBPAV weighted with log Kucera-Francis word frequency

33

strTPAV

Stressed triphoneme probability average; distinct stressed-vowels

34

strFTPAV

strTPAV weighted with Kucera-Francis frequency

35

strLTPAV

strTPAV weighted with log Kucera-Francis frequency

36

unsTPAV

Unstressed triphoneme probability average; vowel-stress ignored

37

unsFTPAV

unsTPAV weighted with Kucera-Francis frequency

38

unsLTPAV

unsTPAV weighted with log Kucera-Francis frequency

39

strPOSPAV

Stressed positional probability average; distinct stressed-vowels

40

strFPOSPAV

strPOSPAV weighted with Kucera-Francis frequency

41

strLPOSPAV

strPOSPAV weighted with log Kucera-Francis frequency

42

unsPOSPAV

Unstressed positional probability; vowel-stress ignored

43

unsFPOSPAV

unsPOSPAV weighted with Kucera-Francis frequency

44

unsLPOSPAV

unsPOSPAV weighted with log Kucera-Francis frequency

45

KFFREQ

Kucera-Francis Written Word Frequency for real words

46

LOGFRQ

log Kucera-Francis Written Word Frequency for real words