The Problem
Given a list of words, use each word as a node. An edge exists between words if the levenshtein distance between them is 1. Find paths between words and count how many words are connected to a root word.
The "Solution"
grep "^[[:lower:]][[:lower:]]\+$" /usr/share/dict/words | sort -R | head -n 10000 > small_list.txt
Doing the full word list takes quite some time, but things are fairly fast with just 10000 words. The words file has a lot of proper nouns, and words with punctuation, so I filter all of those out. I also filter out one letter words, as they are only one distance away from a blank character.
CREATE TABLE base_words
(
word text NOT NULL,
CONSTRAINT "Primary Key" PRIMARY KEY (word)
);
COPY base_words FROM '.../small_list.txt';
Pretty basic here, just stuffing the words into a simple table.
CREATE TABLE edges (a text, b text);
CREATE index b_idx ON edges (b);
CREATE index a_idx ON edges (a);
INSERT INTO edges ( SELECT a.word, b.word FROM base_words a INNER JOIN base_words b ON a.word > b.word AND levenshtein(a.word, b.word) = 1 );
Here we create a list of edges based off of the levenshtein distance. You will need to install contrib/fuzzystrmatch.sql onto your db if you get the error "HINT: No function matches the given name and argument types. You might need to add explicit type casts"
CREATE VIEW all_edges (a,b) AS SELECT a,b FROM edges UNION ALL SELECT b,a FROM edges;
The second method is much cleaner if we have both directions of the edges.
CREATE VIEW brute AS (
WITH RECURSIVE w(a, b) AS
(
SELECT 'week', ''
UNION
SELECT CASE WHEN ( wi.a < wi.b) THEN wi.a ELSE wi.b END,
CASE WHEN ( wi.a < wi.b) THEN wi.b ELSE wi.a END
FROM w
JOIN edges wi
ON w.a = wi.a OR w.b = wi.a
OR w.a = wi.b OR w.b = wi.b
)
SELECT w.a, w.b FROM w
);
Here we recursively walk through the tree, starting with the word 'week', and store all of the links that we come across.
SELECT COUNT(*) FROM (SELECT a FROM brute UNION SELECT b FROM brute ) AS s3;
With my words, I have 440 connected words.
While cool, it would be nice to be able to see the path through the graph
CREATE VIEW path AS (
WITH RECURSIVE w(a, b, path) AS
(
SELECT '', 'week', ARRAY['week'] AS path
UNION ALL
SELECT wi.a , wi.b, w.path || wi.b
FROM w
JOIN all_edges wi
ON w.b = wi.a
AND wi.b <> ALL (w.path[2:array_upper(w.path,1)])
)
SELECT w.a, w.b, path FROM w
);
This isn't really the efficient way to do things. It covers every possible route away from 'week', and lists them. It won't get caught in a loop, but it will list out every possible path between each word that is connected to 'week'
No comments:
Post a Comment