Now all you have to do is code it up, and worry about all the exceptions. The exceptions remind me of a joke by Emo Phillips.
Most states do not end in the letter "a." The only ones that do are Alabama, Georgia, Florida, Louisiana, Oklahoma, Arizona, California, Nevada, Alaska, Montana, Nebraska, South Dakota, North Dakota, Minnesota, Iowa, Indiana, Pennsylvania, North Carolina, South Carolina, West Virginia, east Virginia, and Missouri.
The rhythm can be foretold by looking at where the vowels are, right? So "rhythm" has ... err... two syllables, because it's split by the Y which counts as a vowel, whereas "foretold" obviously has three syllables, centred around the three vowels. Or is that centered?
> seems to do what you want. Except... "The rhythm of life" contains two > syllables. Half a syllable per word.
> Good luck. This is a hard problem.
Maybe from a linguistic point of view, it is hard. But algorithmically, it seems somewhat easy: English has about 1,000,000 words (with very inclusive counting) and the number of syllables in each of them is known. So just do a table look-up. This algorithm also has the advantage of being applicable to any language (and it will be easier as English has a huge vocabulary).
It's a finite problem and in fact smaller than, say, the problem of finding phone numbers based on name and address. The interesting part would be to use frequency information about words to make the look-up fast; or to find a good data structure to reduce memory consumption.
Of course, there is the issue of words being added to the language. However, a rule based algorithm should not be expected to cope with the new words either: its rules are just designed to deal with the known words.
>> seems to do what you want. Except... "The rhythm of life" contains two >> syllables. Half a syllable per word.
>> Good luck. This is a hard problem.
> Maybe from a linguistic point of view, it is hard. But algorithmically, it > seems somewhat easy: English has about 1,000,000 words (with very inclusive > counting) and the number of syllables in each of them is known. So just do a > table look-up. This algorithm also has the advantage of being applicable to > any language (and it will be easier as English has a huge vocabulary).
> It's a finite problem and in fact smaller than, say, the problem of finding > phone numbers based on name and address. The interesting part would be to > use frequency information about words to make the look-up fast; or to find a > good data structure to reduce memory consumption.
How about a hash-map for both of those.
Actually, with only 1 million words, the entirety of the data structure can easily fit in memory on even the cheapest of today's desktop/server machines (mobile/embedded are a different story). Making look up extremely fast.
"Daniel T." <danie...@earthlink.net> wrote: > The exceptions remind me of a joke by Emo Phillips.
> Most states do not end in the letter "a." The only ones that do are > Alabama, Georgia, Florida, Louisiana, Oklahoma, Arizona, California, > Nevada, Alaska, Montana, Nebraska, South Dakota, North Dakota, > Minnesota, Iowa, Indiana, Pennsylvania, North Carolina, South > Carolina, West Virginia, east Virginia, and Missouri.
On Mar 18, 9:19 pm, Andy Champ <no....@nospam.invalid> wrote:
> Daniel T. wrote: > > m...@privacy.net wrote: > >> I'm pulling my hair out trying to figure out code for > >> parsing and counting syllables in simple English > >> sentences. > >> Can someone throw the dog a bone on where to start? > > Google is your friend: > >http://english.glendale.cc.ca.us/phonics.rules.html > <snip> > Pay special attention to rule 1. > The rhythm can be foretold by looking at where the vowels are, > right? So "rhythm" has ... err... two syllables, because it's > split by the Y which counts as a vowel,
The y is the only possible vowel, so rhythm can't have more than one syllable. Except that as I hear it (and according to dictionaries), it has two: in this case, the m acts as a syllable.
> whereas "foretold" obviously has three syllables, centred > around the three vowels. > Or is that centered?
Rule 7 and the second point under 1 in the Basic Syllable Rules do imply that silent e's don't count:-). (Of course, they don't give any hint as to how a program is to determine whether an e is silent or not.)
> This web site > http://www.wordcalc.com/ > seems to do what you want. Except... "The rhythm of life" > contains two syllables. Half a syllable per word. > Good luck. This is a hard problem.
To put it mildly. Compare "ccoper" with the beginning of "cooperation".
And that's without internationalization: the rules will be distinctly different in French or in German than in English.
For starters, you'll probably want to see http://tug.org/docs/liang/. To my knowledge, no one has done better since (and it works for all, or at least most languages, with a simple replacement of machine generated tables).
Daniel Pitts wrote: > How about a hash-map for both of those.
> Actually, with only 1 million words, the entirety of the data structure > can easily fit in memory on even the cheapest of today's desktop/server > machines (mobile/embedded are a different story). Making look up > extremely fast.
Thinking about it even a lookup won't fix the problem. There are a few words where the spelling is the same, and the pronunciation is different. So you'll have to perform a linguistic analysis.
eg:
moped (as in sulked) 1 syllable; (as in small motorcycle) 2 syllables.
Andy Champ wrote: > Daniel Pitts wrote: >> How about a hash-map for both of those.
>> Actually, with only 1 million words, the entirety of the data >> structure can easily fit in memory on even the cheapest of today's >> desktop/server machines (mobile/embedded are a different story). >> Making look up extremely fast.
> Thinking about it even a lookup won't fix the problem. There are a few > words where the spelling is the same, and the pronunciation is > different. So you'll have to perform a linguistic analysis.
> eg:
> moped (as in sulked) 1 syllable; (as in small motorcycle) 2 syllables.
> I think we've lost the C++ content now!
See footnote 274 ([lib.streambuf.virt.get]) in the current standard.