Thursday, September 29, 2011

Graphs in Postgres

A long time ago, I came across a youtube video describe some of the more interesting features in PostgreSQL. Since I am stuck using MySQL for most of my paid projects, I waited until a side project grabbed my interest enough to devote a small bit of time to it.

The Problem


Given a list of words, use each word as a node. An edge exists between words if the levenshtein distance between them is 1. Find paths between words and count how many words are connected to a root word.

The "Solution"


grep "^[[:lower:]][[:lower:]]\+$" /usr/share/dict/words | sort -R | head -n 10000 > small_list.txt

Doing the full word list takes quite some time, but things are fairly fast with just 10000 words. The words file has a lot of proper nouns, and words with punctuation, so I filter all of those out. I also filter out one letter words, as they are only one distance away from a blank character.

CREATE TABLE base_words
(
word text NOT NULL,
CONSTRAINT "Primary Key" PRIMARY KEY (word)
);
COPY base_words FROM '.../small_list.txt';


Pretty basic here, just stuffing the words into a simple table.

CREATE TABLE edges (a text, b text);
CREATE index b_idx ON edges (b);
CREATE index a_idx ON edges (a);
INSERT INTO edges ( SELECT a.word, b.word FROM base_words a INNER JOIN base_words b ON a.word > b.word AND levenshtein(a.word, b.word) = 1 );


Here we create a list of edges based off of the levenshtein distance. You will need to install contrib/fuzzystrmatch.sql onto your db if you get the error "HINT: No function matches the given name and argument types. You might need to add explicit type casts"

CREATE VIEW all_edges (a,b) AS SELECT a,b FROM edges UNION ALL SELECT b,a FROM edges;

The second method is much cleaner if we have both directions of the edges.

CREATE VIEW brute AS (
WITH RECURSIVE w(a, b) AS
(
SELECT 'week', ''
UNION
SELECT CASE WHEN ( wi.a < wi.b) THEN wi.a ELSE wi.b END,
CASE WHEN ( wi.a < wi.b) THEN wi.b ELSE wi.a END
FROM w
JOIN edges wi
ON w.a = wi.a OR w.b = wi.a
OR w.a = wi.b OR w.b = wi.b
)
SELECT w.a, w.b FROM w
);


Here we recursively walk through the tree, starting with the word 'week', and store all of the links that we come across.

SELECT COUNT(*) FROM (SELECT a FROM brute UNION SELECT b FROM brute ) AS s3;

With my words, I have 440 connected words.

While cool, it would be nice to be able to see the path through the graph

CREATE VIEW path AS (
WITH RECURSIVE w(a, b, path) AS
(
SELECT '', 'week', ARRAY['week'] AS path
UNION ALL
SELECT wi.a , wi.b, w.path || wi.b
FROM w
JOIN all_edges wi
ON w.b = wi.a
AND wi.b <> ALL (w.path[2:array_upper(w.path,1)])
)
SELECT w.a, w.b, path FROM w
);


This isn't really the efficient way to do things. It covers every possible route away from 'week', and lists them. It won't get caught in a loop, but it will list out every possible path between each word that is connected to 'week'