Thursday, September 29, 2011

Graphs in Postgres

A long time ago, I came across a youtube video describe some of the more interesting features in PostgreSQL. Since I am stuck using MySQL for most of my paid projects, I waited until a side project grabbed my interest enough to devote a small bit of time to it.

The Problem


Given a list of words, use each word as a node. An edge exists between words if the levenshtein distance between them is 1. Find paths between words and count how many words are connected to a root word.

The "Solution"


grep "^[[:lower:]][[:lower:]]\+$" /usr/share/dict/words | sort -R | head -n 10000 > small_list.txt

Doing the full word list takes quite some time, but things are fairly fast with just 10000 words. The words file has a lot of proper nouns, and words with punctuation, so I filter all of those out. I also filter out one letter words, as they are only one distance away from a blank character.

CREATE TABLE base_words
(
word text NOT NULL,
CONSTRAINT "Primary Key" PRIMARY KEY (word)
);
COPY base_words FROM '.../small_list.txt';


Pretty basic here, just stuffing the words into a simple table.

CREATE TABLE edges (a text, b text);
CREATE index b_idx ON edges (b);
CREATE index a_idx ON edges (a);
INSERT INTO edges ( SELECT a.word, b.word FROM base_words a INNER JOIN base_words b ON a.word > b.word AND levenshtein(a.word, b.word) = 1 );


Here we create a list of edges based off of the levenshtein distance. You will need to install contrib/fuzzystrmatch.sql onto your db if you get the error "HINT: No function matches the given name and argument types. You might need to add explicit type casts"

CREATE VIEW all_edges (a,b) AS SELECT a,b FROM edges UNION ALL SELECT b,a FROM edges;

The second method is much cleaner if we have both directions of the edges.

CREATE VIEW brute AS (
WITH RECURSIVE w(a, b) AS
(
SELECT 'week', ''
UNION
SELECT CASE WHEN ( wi.a < wi.b) THEN wi.a ELSE wi.b END,
CASE WHEN ( wi.a < wi.b) THEN wi.b ELSE wi.a END
FROM w
JOIN edges wi
ON w.a = wi.a OR w.b = wi.a
OR w.a = wi.b OR w.b = wi.b
)
SELECT w.a, w.b FROM w
);


Here we recursively walk through the tree, starting with the word 'week', and store all of the links that we come across.

SELECT COUNT(*) FROM (SELECT a FROM brute UNION SELECT b FROM brute ) AS s3;

With my words, I have 440 connected words.

While cool, it would be nice to be able to see the path through the graph

CREATE VIEW path AS (
WITH RECURSIVE w(a, b, path) AS
(
SELECT '', 'week', ARRAY['week'] AS path
UNION ALL
SELECT wi.a , wi.b, w.path || wi.b
FROM w
JOIN all_edges wi
ON w.b = wi.a
AND wi.b <> ALL (w.path[2:array_upper(w.path,1)])
)
SELECT w.a, w.b, path FROM w
);


This isn't really the efficient way to do things. It covers every possible route away from 'week', and lists them. It won't get caught in a loop, but it will list out every possible path between each word that is connected to 'week'

Wednesday, July 8, 2009

Closure Trees

The code here is PHP and MySQL. It should be trivial to translate into other languages and databases.

I needed to store a tree in a database (mysql) the other day, and I stumbled across this: (It starts on slide 68)



The basic idea is to store paths to all ancestors and all descendants in a separate table. This makes a lot of actions very fast at the cost of space. Many of the basic modifications of the tree are covered in the slides, but some of them took me some thought to work out.

To represent the tree of (1,2(3,4(5))), we'll do:


INSERT INTO closure (ancestor, descendant) VALUES (1,1);
INSERT INTO closure (ancestor, descendant) VALUES (2,2);
INSERT INTO closure (ancestor, descendant) VALUES (1,2);
INSERT INTO closure (ancestor, descendant) VALUES (3,3);
INSERT INTO closure (ancestor, descendant) VALUES (2,3);
INSERT INTO closure (ancestor, descendant) VALUES (1,3);
INSERT INTO closure (ancestor, descendant) VALUES (4,4);
INSERT INTO closure (ancestor, descendant) VALUES (2,4);
INSERT INTO closure (ancestor, descendant) VALUES (1,4);
INSERT INTO closure (ancestor, descendant) VALUES (5,5);
INSERT INTO closure (ancestor, descendant) VALUES (4,5);
INSERT INTO closure (ancestor, descendant) VALUES (2,5);
INSERT INTO closure (ancestor, descendant) VALUES (1,5);


Building the Tree in Memory

If we want to build the tree in memory, first we grab a list of all of each node's ancestors:


SELECT
descendant AS id,
GROUP_CONCAT(ancestor) as ancestors
FROM
closure
GROUP BY (descendant);

+----+-----------+
| id | ancestors |
+----+-----------+
| 1 | 1 |
| 2 | 2,1 |
| 3 | 3,1,2 |
| 4 | 4,1,2 |
| 5 | 5,4,1,2 |
+----+-----------+


When we bring it into our language of choice, the first thing to do it sort the list by number of ancestors.
Top level nodes will only have themselves as an ancestor, and any node with two ancestors is dependant solely on a node with one ancestor. After we have sorted our ancestor list, we either create the first node with the first item in the list, or we create an empty node to start as $tree. I used an empty node, as I have many top level nodes. If you have only one node without ancestors other than itself, use it for the first node.

To traverse the tree, the next node will always be the intersection between the list of ancestors, and the available ancestors.

If there is only one ancestor left, it is to be a new child of whatever node you end up on.


function add_node($ancestors, &$tree) {
if (count($ancestors) == 1) {
$tree->add_child($ancestors);
return;
}
$next_node = array_intersect($ancestors, $tree->get_children_ids());
$this->add_node(
array_diff($ancestors, $next_node) ,
$tree->get_child_by_id(array_pop($next_node))
);
}

Moving a Segment

Moving a segment of the tree is probably the most painful task. It is divided into two parts.
1) Remove all the links from the node and its descendants to all of its ancestors
2) Insert all the links from the knode and its descendants to the new ancestors

So, if we want to move node 4 under node 3, first we:
delete these descendants:
SELECT descendant FROM closure WHERE ancestor=4;
from these ancestors:
SELECT ancestor FROM closure WHERE descendant=4 AND ancestor != 4;

With some help from FurnaceBoy on #MySQL, the query is:
DELETE q FROM closure
q NATURAL JOIN (
SELECT a.ancestor, d.descendant
FROM closure a
CROSS JOIN closure d
WHERE a.descendant=4 AND a.ancestor != 4 AND d.ancestor=4
) t;


Now, to insert the new links.

First we get the ancestors of the new node (3)
SELECT ancestor FROM closure WHERE descendant=3;
Then we get the descendants of the old node (4)
SELECT descendant FROM closure WHERE ancestor=4;
Then we cross join them.

SELECT a.ancestor, d.descendant
FROM closure a
CROSS JOIN closure d
WHERE a.descendant=2 AND d.ancestor=1;



INSERT INTO closure (ancestor, descendant)
SELECT a.ancestor, d.descendant
FROM closure a CROSS JOIN closure d
WHERE a.descendant=3 AND d.ancestor=4;