Random things with SQL: Graphs in Postgres

A long time ago, I came across a youtube video describe some of the more interesting features in PostgreSQL. Since I am stuck using MySQL for most of my paid projects, I waited until a side project grabbed my interest enough to devote a small bit of time to it.

The Problem

Given a list of words, use each word as a node. An edge exists between words if the levenshtein distance between them is 1. Find paths between words and count how many words are connected to a root word.

The "Solution"

grep "^[[:lower:]][[:lower:]]\+$" /usr/share/dict/words | sort -R | head -n 10000 > small_list.txt

Doing the full word list takes quite some time, but things are fairly fast with just 10000 words. The words file has a lot of proper nouns, and words with punctuation, so I filter all of those out. I also filter out one letter words, as they are only one distance away from a blank character.

CREATE TABLE base_words ( word text NOT NULL, CONSTRAINT "Primary Key" PRIMARY KEY (word) ); COPY base_words FROM '.../small_list.txt';

Pretty basic here, just stuffing the words into a simple table.

CREATE TABLE edges (a text, b text); CREATE index b_idx ON edges (b); CREATE index a_idx ON edges (a); INSERT INTO edges ( SELECT a.word, b.word FROM base_words a INNER JOIN base_words b ON a.word > b.word AND levenshtein(a.word, b.word) = 1 );

Here we create a list of edges based off of the levenshtein distance. You will need to install contrib/fuzzystrmatch.sql onto your db if you get the error "HINT: No function matches the given name and argument types. You might need to add explicit type casts"

CREATE VIEW all_edges (a,b) AS SELECT a,b FROM edges UNION ALL SELECT b,a FROM edges;

The second method is much cleaner if we have both directions of the edges.

CREATE VIEW brute AS ( WITH RECURSIVE w(a, b) AS ( SELECT 'week', '' UNION SELECT CASE WHEN ( wi.a < wi.b) THEN wi.a ELSE wi.b END, CASE WHEN ( wi.a < wi.b) THEN wi.b ELSE wi.a END FROM w JOIN edges wi ON w.a = wi.a OR w.b = wi.a OR w.a = wi.b OR w.b = wi.b ) SELECT w.a, w.b FROM w );

Here we recursively walk through the tree, starting with the word 'week', and store all of the links that we come across.

SELECT COUNT(*) FROM (SELECT a FROM brute UNION SELECT b FROM brute ) AS s3;

With my words, I have 440 connected words.

While cool, it would be nice to be able to see the path through the graph

CREATE VIEW path AS ( WITH RECURSIVE w(a, b, path) AS ( SELECT '', 'week', ARRAY['week'] AS path UNION ALL SELECT wi.a , wi.b, w.path || wi.b FROM w JOIN all_edges wi ON w.b = wi.a AND wi.b <> ALL (w.path[2:array_upper(w.path,1)]) ) SELECT w.a, w.b, path FROM w );

This isn't really the efficient way to do things. It covers every possible route away from 'week', and lists them. It won't get caught in a loop, but it will list out every possible path between each word that is connected to 'week'

Random things with SQL

Thursday, September 29, 2011

Graphs in Postgres

The Problem

The "Solution"

No comments:

Post a Comment

Blog Archive

Followers

About Me