Words, sub-words and compound words!
Not everything I write here is going to be about CS or linguistics however! So don't be discouraged from reading in the future even if you find this painfully boring!
In the last year I was lucky enough to be able to work on the project WordMine2, a linguistics project which is trying to gather Orthographic (letter sizes and differences), Phonological (phonemes or sound properties) and Semantic (meanings) information about words. Currently managed by Dr. Lori Buchanan of the University of Windsor I re-designed the UI for searching this database of about 64 thousand words and now I'm looking at more data mining.
One of the benefits of doing this work in Dr. Buchanan's lab is that I am able to sit in on meetings she has with her thesis/graduate students. This allows me to be able to see what kind of data and access to the data they require for their studies. Recently a few students have been looking into compound words and that's what sparked my interest in this topic.
So my current self-assigned task is to take our word list and try and find all of the compound words within it. I divided this into two separate operations in the code, the first being to find all of the words in the list that were parts of other words (this will make more sense later) and then take those words that i found and see if they make up the whole word.
Find Sub Words
So from another analysis project I did I already had the code to import the words and place them into a unique "word" class that stored other information about them as it was calculated. The brute force way of finding the sub words is relatively simple. have two sentinel values and just go through the array once per word. Given an array of n length, this is bad as far as code complexity is concerned, especially with 64k words.
for i : 0 .. n
for j : 1 .. n
if(w[i].contains(w[j]))
w[i].addStem(j)
Knowing I was going to take a crack at this problem I was discussing it with another student and good friend of mine Dan, he suggested sorting the array by length. That way you'd only have to check forward in the array for example cat.contains(apple) cat can't even fit 5 letters so we shouldn't bother checking all of the letters against each other! Dan saved me a good chunk of calculations as my code becomes:
sort_by_length(w)
for i : 1 .. n
for j : 1 .. i - 1
if(w[i].contains(w[j]))
w[i].addStem(j)
But I'm still comparing a ton of words together that are of the same length and these aren't going to be a match because the word list contains no duplicates, so lets skip words that are the same size
sort_by_length(w)
for i : 0 .. n
for j : 0 .. i - 1
if(w[j].length > w[i].length && w[i].contains(w[j]))
w[i].addStem(j)
That works! Runs on my desktop with the full word list in 1min 14sec 32ms
the next step is checking each word that has "stems" and seeing if they can be re-arranged to create the word again. This may however produce some false positives for words like:
- buttonhole -- "but" "ton" "hole"
- catalog -- "cat" "a" "log"?
- champ'ion'ship
- cot'ton'mouth (& cot'ton'tail & cot'ton'wood)
- counter'at'tack
- cub'by'hole
as pointed out in the blog Laws of Scilence
Comments
Daurade (not verified)
Mon, 12/12/2011 - 06:24
Permalink
Thanks for mentioning LoS.
Thanks for mentioning LoS. It occurs to me that these false positives are something to eliminate in work such as yours, but could be interesting for someone looking for word play, puns, etc., the find the hidden words inside the words.... :)
fake watches (not verified)
Wed, 02/08/2012 - 02:20
Permalink
welcome here
fake watches sale
Replica Rolex Submariner
Cartier Ballon Bleu
Replica Cartier Tank
Replica Cartier Santos
Replica Rolex Daytona
Add new comment