Over the last 3 months (it's now late July, 2011) I've been re-writing Graph::Easy, for various reasons:
The last point only really became obvious when I examined the original code. I couldn't follow it! But more on that later.
It must be admitted there's a lot of code in there which I have not tried to include in my re-write, Graph::Easy::Marpa, particularly to do with the built-in formatter.
Graph::Easy is on CPAN, but also has its own extensive manual at Graph::Easy manual. This manual is actually for the little language invented by the original author to make it extremely simple to specify a graph's elements.
Here's how to reproduce the graph on that manual page, the syntax being for Graph::Easy::Marpa, which is often just the same as the syntax for Graph::Easy.
Call this file x.raw:
node {shape: box; style: filled}
[Teamwork] {fillcolor: yellow}
-> {label: is the key to}
[Victory] {fillcolor: red}
Here, node {...} overrides some default attributes for the nodes, '[]' delimits node names, and specific attributes follow the nodes and the edge '->'.
And the resultant graph:
Graph::Easy::Marpa ships with gem.pl, so let's use that to create the graph:
perl scripts/gem.pl -f svg -i x.raw -o $DR/x.svg -rankdir LR
Here, -f is for the output format, -i is for the input file, -o is for the output file, and -rankdir means output must be left-to-right, overriding the default of top-to-bottom. $DR is my web server's doc root.
The default rendering engine is Graph::Easy::Marpa::Renderer::GraphViz2 (note the upper case 'v').
The script 'graph-easy', which ships with Graph::Easy, won't process this input file, because it does not accept all the options of the underlying package Graphviz (note the lower case 'v'), or it renames them for some reason.
Graph::Easy::Marpa uses exactly the same names as the options for AT&T's Graphviz, even though you can write your own rendering engine to plug into Graph::Easy::Marpa.
If fact, Graphviz is actually a suite of programs, but the little language AT&T defined as their input format is called 'DOT', and the program you're most likely to run is called 'dot'.
So, hereafter, 'dot' will mean both the 'DOT' language and any of the programs in that suite. This is the convention I've adopted for both GraphViz2 and Graph::Easy::Marpa.
After the shock of reading the source of some Graph::Easy modules, I decided to re-write it from scratch. That just means I abandoned the plan of in situ patches to Graph::Easy.
Now the question becomes: How do I tackle this problem?
I decided to focus on the output phase, rather than processing the little language for the graph definition. The reason for this choice was that I was already using the Perl module GraphViz (note the capital 'V') to render database schema, and the fact that one of the output formats supported by Graph::Easy was GraphViz told me I should look into this first.
In fact, Graph::Easy can output a 'dot' file which can be input directly to Graphviz, with the latter being the rendering engine underlying GraphViz.
Léon Brocard, the author of GraphViz, gave me co-maint of his module when I offered to update it, to ensure it supported the latest options offered by Graphviz.
However, inspection of GraphViz revealed it hard-coded the set of options it accepted for passing on to 'dot'. So I decided to re-write it, too, from scratch.
This process of recursively re-writing a module's pre-requisites is called 'yak shaving', a phrase whose emtymology I'm not sure I want to know.
Firstly, I wrote 4 little scripts to download parts of the docs for Graphviz, and to extract the precise lists of options. These are now included in the source code of GraphViz2, under the __DATA__ token. They are read in using the beautiful simplicity of Data::Section::Simple.
The rest of the re-write of GraphViz2 was labourious but not difficult. Sample output can be seen here.
The name GraphViz2 rather than GraphViz was chosen so people installing GraphViz2 would not find their pre-existing scripts which used GraphViz suddenly broken.
OK, so now I had a means of communicating from the unfinished Graph::Easy::Marpa::Parser to AT&T's Graphviz, via Graph::Easy::Marpa::Renderer::GraphViz2.
Also, this meant I could, by design, test all of these separately:
Since I wasn't parsing (really, lexing) the graph definition language from Graph::Easy, I needed some form of input to Graph::Easy::Marpa::Parser. For that, I simply invented my own little language, and decided that to start with I'd use a CVS file to talk from the non-existant lexer to the parser. In other words, I was going to rig the input to the parser (and, later, the lexer) by hand-writing the samples, until I got both the lexer and the parser working.
I am aware that in practice, a lexer is often rolled in with a parser, but the complexity of each here, and the step-wise development of each, made thinking and working with 2 separate modules the natural way to do it. Besides, I'd never used Marpa before, and didn't want the complexity of learning that to cloud the issue.
So the plan is (making it look like a graph :-):
x.raw (graph definition language file) -> lexer -> x.cooked (intermediary language, discussed next) -> parser -> rendering engine (GraphViz2) -> Graphviz (AT&T)
So if we take the graph definition language sample from above (x.raw):
node {shape: box; style: filled}
[Teamwork] {fillcolor: yellow}
-> {label: is the key to}
[Victory] {fillcolor: red}
and feed it thru the yet-to-be-written lexer, we get this input to the parser (x:cooked):
"key","value"
class , 'node'
left_brace , {
class_attribute_name_id , 'style'
colon , :
attribute_value_id , 'filled'
semi_colon , ;
right_brace , }
left_brace , {
class_attribute_name_id , 'shape'
colon , :
attribute_value_id , 'box'
semi_colon , ;
right_brace , }
left_bracket , [
node_id , 'Teamwork'
right_bracket , ]
left_brace , {
attribute_name_id , 'fillcolor'
colon , :
attribute_value_id , 'yellow'
semi_colon , ;
right_brace , }
edge_id , '->'
left_brace , {
attribute_name_id , 'label'
colon , :
attribute_value_id , 'is the key to'
semi_colon , ;
right_brace , }
left_bracket , [
node_id , 'Victory'
right_bracket , ]
left_brace , {
attribute_name_id , 'fillcolor'
colon , :
attribute_value_id , 'red'
semi_colon , ;
right_brace , }
The raw version is 117 bytes, and the cooked version is 1177 bytes, but we don't care.
And no, there is absolutely no need for the leading spaces before the commas. But, I say, we don't care.
What we care about is the structure, and in particular the language determining that structure, x.cooked, and even more, the 'grammar' describing that language.
To repeat, writing the parser requires fabricating x.cooked by hand.
Given that the language in x.cooked is so regular, we can guarantee that it's possible to define a grammar which in turn precisely defines which statements are valid within that language, and which aren't.
Enter Marpa.
You can see a small, early, part of the Marpa grammar graphed with GraphViz2 here.
It looks complex because it is. Nevertheless, don't despair. It's a learning exercise for everyone who ventures into the field of parsing.
You might also notice something strange: Although our grammar is written as though x.raw is the input, in reality x.cooked is what is fed into the parser. I didn't find this a problem, but some people might. Of course, x.raw and x.cooked are just 2 representations of the same information.
After many trial runs, and with the help of Jeffrey Kegler, the author of Marpa, I did get the grammar completed.
This means that in the process outlined above:
x.raw (graph definition language) -> lexer -> x.cooked (intermediary language, discussed next) -> parser -> rendering engine (GraphViz2) -> Graphviz (AT&T)
we've dealt with the last 4 lines, by, roughly, working backwards.
Here, we process the input stream a character at a time, and react according to the current context and the new character. For that we need a Discrete Finite Automaton (DFA), often called a Finite State Machine (FSM).
There are several candidates on CPAN, and they're discussed in "See Also" in Set::FA, so I won't repeat any of that here.
The best candidate, IMHO, is Set::FA, so I emailed the author and offered to take over maintenance of it, i.e. to give it some TLC (Tender Loving Care). As an aside: TLC is actually the name of a Dry Cleaning business near where I live.
Specifically, I chose Set::FA::Element, since Set::FA is more to do with sets of DFAs. Presumably the original idea was that you'd fabricate a set of DFAs, and feed the same input into each of them to see how they responded.
Anyway, the problem now is to define the State Transition Table (STT) which must be supplied to the constructor of Set::FA::Element. This takes weeks, since I had to examine the manual for Graph::Easy, and samples, to determine what exactly was permissible within Graph::Easy's little language. To make things worse, several examples I cut-and-pasted from the manual simply did not work. Sigh.
As you can image, along the way I had to confront the problem we all face when re-writing a module: Do I write it in such a way as to preserve exactly the same behaviour as the original code? You'll recognize this as the 'backward-compatibility dilemma'.
I soon decided, no, I wouldn't do that. So, how do I justify such a decision? By balancing the constraints, after giving a suitable weighting to each:
Let's take them one at a time.
I have mixed feelings on this issue. For example, choosing a standard (protocol) such as CGI, TCP, etc, give us constraints, but simultaneouly allow us to treat the chosen thing as a type of building block. This means we become dependent on it by relying on it. With famous standards this is an easy decision. With a single piece of software, less so.
For Graph::Easy's little language, I wanted to preserve as much as was reasonable, since the more of it that is supported, the less pain users will feel in switching to my code.
However, I did not want to develop code under some sort of compulsion that that little language was inviolable. This is actually my long-standing policy: If I feel that some problems can be rectified, then a time such as this is when I must take the opportunity to fix them.
Clearly, what items exactly are chosen as not-to-be-supported is often subjective. Hence the next point.
One thing that worried me about Graph::Easy was that it accepted edge labels embedded with the label itself, as distinct from (and as well as) being specified in the list of attributes at the end of the edge, as is done with nodes etc. This makes it difficult to parse, but not impossible. This latter fact is what swung me into not supporting in situ labels, at least not yet.
If the demand is there, the STT can be extended to switch into label - i.e. attribute - mode within an edge, based on the convention that the 1st non-blank character after the label is recognized as a resumption of the 'name' of the edge. With the STT approach, this is simply (I hope) a matter of tweaking the definition within the STT of the edge and attribute characteristics.
So, given that that syntax is supportable one day, I decided that the benefit of getting the basic software published sooner overrode the considerations of the time it would take to accept that style, and the extra, pre-existing, graph definition files which will not now be acceptable without a bit of editing.
The other appeal of the STT is in the dual nature of STTs - they are amazingly flexible and yet very rigid in what they recognize. For a control freak, the rigidity would be paramount, but for me the rigidity simply makes my life easier, much easier. Any quirky constructs in the original syntax can just be ruled out with an STT.
There is still the disappointment of potential users, but it does mean the resultant software, and the specification of what exactly is acceptable syntax, is much simpler than it would be otherwise.
This ties in with the comments just above about the dual nature of the beast. In opting for a mechanism which has rigidity built in to it, I had to accept that the benefits with accrue could not just be for me. The DFA-based solution had to provide benefits to two other groups of users: People who maintain the code, and those who run the code.
Here again the dual nature is evident. The flexibility means the time saved from not handling quirky syntax can be spent more profitably in the code as a whole, for the benefit (mostly) of the user, while keeping the code clean for the maintainer. And, simultanelously, the constraints on the user due to the rigidity mean some effort in data cleaning is forced on to the user.
Even though Graph::Easy is sophisticated, it's still limited in what it can accept, and hence what it can pass on to Graphviz, so this re-write is also an opportunity to ensure the new code accepts all options available under the most recent version of Graphviz, which is currently V 2.26.3 (July, 2011).
The user will never see the representation of the graph in RAM, although there is provision for producing a report of the way the items in the original graph are known to the code.
So, I'm now free to use any representation I like. And in this case I chose an array, because the linear nature of the graph, since it's written and parsed in a linear text format, matches the linear nature of arrays.
It's just a matter of saying: The elements of the array will each hold a representation of an element of the input stream, with each array element being a hashref, so the values of the input items can be stored along with some text to identify what the lexer and parser recognise those input items as being.
The structure is detailed in "FAQ" in Graph::Easy::Marpa::Lexer, but in brief, here's a node and its attribute:
{
count => $n,
name => 'Name',
type => 'node',
value => '',
}
{
count => $n,
name => 'color',
type => 'attribute',
value => 'red',
}
Note: Some values of $n are used internally, but the final array has all items renumbered from 1 up.
These would be 2 sequential array elements, using the convention that attributes always immediately follow the edge, node, etc, to which they belong.
This also applies to global things of various kinds, whose attributes can be applied to the graph itself (directed, undirected), and to nodes, say, which could have the effect of changing the default shape or whatever of all nodes.
Naturally an attribute attached to an edge or node must override the default or explicit global specification of the items' attributes.
The result is very straight-forward, and can be interpreted by any rendering engine without imposing arbitrary constraints on what that engine can do, precisely because all available and necessary information is provided in an agnostic way.
I've only provided one default rendering engine for handling the final output, via GraphViz2. Anyone pining for text or XML output will have to write their own, unless I can be persuaded to do it.
For text output, you might consider trying Layout::Manager. I've never used it, but there are not going to be many choices on MetaCPAN.
And, as mentioned, various syntactic constructs available under Graph::Easy are not supported by Graph::Easy::Marpa.
I've said enough. I'll just finish with a list of the modules in the package:
Calls the lexer and the parser.
Calls the DFA.
Lexes the input stream.
Uses Marpa to process the output of the lexer.
The default rendering engine.
For internal use only.
Demo output (from test code, so not designed as a showcase of capabilities):
http://savage.net.au/Perl-modules/html/graph.easy.marpa/index.html.
http://savage.net.au/Perl-modules/html/graphviz2/index.html.