Data about word structure was acquired from 18 languages: [Indo-European] (1) English, (2) German, (3) Spanish, (4) Bengali, (5) Bosnian; [Altaic] (6) Turkish; [American] (7) Inukitut, (8) Taino, (9) Yucatec Maya; (African) (10) Lango, (11) Somali, (12) Wolof, (13) Zulu, (14) Haya; [Austronesian] (15) Fijian, (16) Malagasy; [Dravidian] (17) Tamil; [East Asian] (18) Japanese. In each case we acquired a sample of common words (an average of 937 (sterr = 134) words per language); our analysis confined itself to those words having three or fewer non-sonorants (an average of 775 (sterr = 103) words per language). In many cases, data were obtained from transliterated dictionaries, and the phonological interpretation of the transliteration (for which we cared only about whether phonemes were plosives, fricatives or sonorants) obtained from a variety of sources (some included in Table 1). Each word in the sample was measured by converting each plosive to a ‘b’, each fricative to an ‘s’, and any adjacent sequence of sonorants to an ‘a’. Sonorants included vowels, as well as sonorant consonants (like y, w, l, r, m, n, and ng). Also, words beginning with a vowel typically begin with a glottal consonant, which was treated as a plosive, and coded as starting with a ‘b’ before the ‘a’ of the vowel. Affricates (like “ch” and “j”) were coded as ‘bs’. Table 2 shows the counts for each structure type within each sampled language. For words beginning with a sonorant, only those having two or fewer non-sonorants were included; this is because, as discussed in the main text, these sonorant-start words are predicted as cases where a ring was initiated with an inaudible hit. As a test of the methodology for determining word structure type from words, a naïve observer was asked to code the 863 words with three or fewer non-sonorants for our sample of German; when plotted against the frequency counts of the structure types as coded by the first author, the best-fit equation on a log-log plot was y = 0.95x0.92, or nearly the identity (y = x), with a correlation R2 = 0.88.
Our hypothesis is that it is the physical events among macroscopic solid objects that principally drives the competencies of our auditory system, and thus coders were trained to measure sequences of hits and slides in the physical events found in videos. To avoid any potential auditory bias to hear speech-like patterns among natural event sounds, measurements were made visually (i.e., with the video’s audio muted). Measurements were made from several categories of video, each chosen because of the likelihood of finding “typical” kinds of solid-object physical events. Categories were as shown below, followed by links to the videos (and their lengths).
Cooking (23 minutes)
http://www.youtube.com/watch?v=Enytl9Epfcs&feature=related (9:50)
Assembly instructions (17 minutes)
Children playing with toys (7 minutes)
http://www.youtube.com/watch?v=BSbV4U62Mg0&feature=related (1:45)
Acrobatics (8 minutes)
http://www.youtube.com/watch?v=KXpbCQ6kIVQ&feature=related (1:59)
Family gatherings (11 minutes)
These amount to 67 minutes of video in total. The average (across the three viewers) total number of events with three or fewer physical interactions (i.e., hits or slides) among these videos was 504.7. The correlations between the relative frequency distributions for the three viewers were R2 = 0.51, R2 = 0.63, R2 = 0.48. These three coders also measured from the same videos a second time, this time with the sound present; the average distribution for vision only was highly correlated with the average distribution for audition-and-vision (R2 = 0.857). Also, as part of the training for coding, a “ground truth” auditory file was created by the first author with sample physical event types, and the two coders measured, via audition only, the distribution, and had correlations of R2 = 0.63 and R2 = 0.64 with the ground truth source.