Olmar : Process C++ Programs in Ocaml
Olmar
connects Elsa, the Elkhound based C/C++ parser and typechecker, with Ocaml. More precisely,
the Olmar extension can
translate Elsa's internal abstract syntax tree
into a value of an Ocaml variant type. This value can then be
further processed with a pure Ocaml program. I prefer to have
standalone Ocaml programs. Therefore I let Elsa
marshal the abstract syntax tree as an Ocaml value to disk.
However, it is also possible to link the Ocaml code into the Elsa
executable.
Distribution
In principle Olmar is a patch
for the astgen tool and for Elsa. In the future Olmar will
hopefully get integrated into the Elsa/Elkhound distribution. At
the moment Olmar is based on
the latest Elsa distribution 2005.08.22b.
For
simplicity I only distribute a complete
smbase/Ast/Elkhound/Elsa/Olmar system now. If
you want to have pure Elsa, please download it from Elsa website.
Download / Compile / Use
System requirements
- C++ compiler
- Flex
- perl5
- (I believe some Yacc variant, like bison)
- and of course Ocaml
See also Elsa's requirements (under point Download) and Elsa's
success/failure matix
(It appears to run on a 64bit system. However, there are quite a
few warning about casts between pointer and integer. I guess it
is pure luck that it passes the regression tests.)
Download Elsa+Olmar
choose from the following alternative:
Configure
configure -no-dash-O2
Leave out the -no-dash-O2 option if you want to
compile the C++ code with -O2. You can use the
environment variables CC and CXX to set the
C and C++ compiler, respectively.
The whole thing consists of five packages/sudirectories: smbase,
ast, elkhound, elsa, and asttools. The configure script and the
makefile of the base directory simply start the appropriate
action in each subdirectory. configure --help in the
base directory will therefore give you the help text of all the
configure scripts in the subdirectories.
Compile
make
this will create the C++ parser elsa/ccparse, the AST
Graph utility asttools/ast_graph, and the Olmar example
application asttool/count-ast.
Try it
- preprocess the C++ sources (elsa does not include a preprocessor):
g++ -E -o crc.ii smbase/crc.cpp
- run elsa on it and marshal the abstract syntax tree into
crc.oast:
elsa/ccparse -oc crc.oast crc.ii
- use AST Graph to generate a dot graph description
asttools/ast_graph -o crc.dot crc.oast
- View the graph, using one of the following lines
zgrviewer crc.dot
dotty crc.dot
dot -Tps crc.dot -o crc.ps; gv crc.ps
- or generate a png from it (dot -Tpng):
New features in the elsa parser ccparse
In ccparse the option -oc <file> will
activate Olmar and set the
filename to write the abstract syntax tree to. Alternatively one
can use the tracing option marshalToOcaml (add
-tr marshalToOcaml). Then ccparse will derive the
filename for the abstract syntax tree itself
(input-file.oast).
Olmar's contribution: ast_graph, visualizing C++ syntax trees
At the moment the asttools subdirectory in the
distribution contains only one useful tool:
Ast graph.
Ast graph
generates the abstract syntax tree in the dot
language. One can then use the tools from the graphviz package to
visualize the syntax tree.
Usage
- preprocess the source code: g++ -E -o file.ii file.cc
- run elsa on it: ccparse -oc file.oast file.ii
(add -tr c_lang for C files)
- generate dot: ast_graph file.oast -o file.dot
- See below for viewing options
Normal C++ files tend to have abstract synctax trees
with 10.000 to 1.000.000 nodes. Including iostream
alone gives more than 150.000 nodes. Most of the graphics
software I tried fails on the sheer size of these graphs. To
visualize the tree I have found the following possibilities:
- zgrviewer
- zgrviewer can display dot files directly (relying on a dot
background job). It has nice zooming and scrolling functions.
It's a pity that java runs out of memory on graphs with 10.000
nodes already.
- convert to postscript (dot -Tps) and use gv
- Works for huge graphs. Scrolling in gv is ok, zooming
relatively poor. gv seems to allocate a pixmap in the X server.
For huge graphs one has therefore to limit the bounding box using
ast_graph's -size option (or putting a
suitable size attribute into the dot file).
- convert to xfig (dot -Tfig) and use xfig
- (xfig worked great for me until last week, then some debian
etch update broke it) xfig is the fastest of the alternatives.
Zooming is relatively good in xfig,
scrolling a bit poor. The display is cluttered with all sorts of
handles (because xfig assumes you want to change the graph).
Technology
The goal of Olmar is to make
the abstract syntax tree of a C or C++ program available as an
Ocaml variant type, such that one can use pattern matching to
process C and C++ programs.
Elsa can output its internal abstract syntax tree in XML, or
(mainly for debugging purposes) in plain ASCII. In principle
one could read the XML into Ocaml, for instance with PXP. PXP reads XML into an Ocaml object hierarchy.
As far as I know, there is, however, no simple way to translate
XML into an Ocaml variant type. With PXP one could either write a
pull parser or a visitor on the ocaml object tree. Both
approaches are a kind of high-level XML parsing that require some
form of typechecking the XML and a lot of error code. I did not
want to write this kind of XML typechecking code. Therefore Olmar uses a
completely different approach.
Olmar simply
adds a method toOcaml to each class in Elsa's abstract
syntax tree. This method traverses the syntax tree, thereby
reconstructing it in Ocaml. At the end the Ocaml value is
marshaled into a file. Elsa is linked with some Ocaml code, the
ocaml runtime and some C++ glue code. (In reality the whole story
is slightly more complicated, because Elsa's ast can be circular
and because C++ pointers might be NULL. Anyway ...)
Elsa internal abstract syntax tree falls into two parts. About 35
different node types (about 150 classes) describe the C++ syntax.
Elsa's type checker adds some more types of nodes to describe C++
types in a syntax independent way. A node type might be split
into several subtypes (very similar to Ocaml variants). The node
type for C++ expressions, for instance, is modelled with 36
classes, for each kind of expression one. In Ocaml such node
types are of course modelled with a variant type. Elsa's abstract
syntax tree contains also unstructured node types (i.e., without
subtypes). In Ocaml those nodes are represented as a tuple or a
record.
I wanted to keep Olmar mostly
independent from the encoding of variant constructors in Ocaml.
Therefore, I register an Ocaml callback function for each variant
constructor and each tuple type. The C++ code calls these
callbacks in order to construct Ocaml values (instead of
allocating memory itself and filling it). Only list and option
values are created directly in C++. For now I prefer this
hopefully less error prone variant over more efficient code.
The code for the 35 syntax node types is generated automatically
from an ast description file. Therefore, to add the
toOcaml method to these syntax classes one only needs
to patch astgen. With Olmar astgen
additionally generates an Ocaml type definition and Ocaml code
for the abovementioned callback functions. Finally astgen also
generates the toOcaml method in C++.
The syntax tree nodes for Elsa's typechecker are, unfortunately,
not generated from ast descriptions. I had to write all the
necessary Ocaml and C++ code myself. In the end this turned out
to be much more work than improving astgen...
Using Olmar
You can use Olmar in two ways:
- write a standalone Ocaml program that unmarshals the abstract
syntax tree from the disk
- link additional modules into the Elsa parser ccparse and
arrange for calls from ccparse.
In asttools/count-ast.ml you find a
very simple Olmar example
application.
Abstract syntax tree type definition
The type definition is in the following files
- elsa/cc_ast_gen_type.ml
- contains the type definition of all ast nodes
- elsa/cc_ml_types.ml
- flag types used in the syntax nodes
- elsa/ml_ctype.ml
- flag types used in type nodes
- elsa/ast_annotation.mli
- ast annotations (see below)
The whole abstract syntax tree that is marshaled from ccparse has type
annotated translationUnit_type = annotated * (annotated topForm_type list)
Ast annotations
The whole abstract syntax tree is polymorphic in one type
parameter, which is a placeholder for user defined annotations.
Every node of the syntax tree carries a (unique) slot of this
annotation type. Annotations are meant for client use. One can
easily define a new annotation type and use it to store client
data in it.
ccparse generates the abstract syntax tree with annoations of
type annotated (see
elsa/ast_annotation.mli). Every node is guaranteed to
contain a unique annotation value. The annotation carries an
(positive) integer (accessible via
id_annotation) that uniquely identifies the syntax
tree node of this annotation. In addition the annotation contains
the address of the C++ object from which the ocaml value was
generated.
Complications
The abstract syntax tree is circular. A naive iteration over the
tree will therefore in general not terminate. Currently there are
three fields that might make the tree circular:
- var_type in type variable
- funcDefn in type variable
- self_type in compound_info
All these fields have a (hopefully hinting) option ref type. An
iteration over the abstract syntax tree will terminate, if you do
not recurse into these three fields. However, there might be some
tree nodes only reachable via one of these fields.
The Olmar example count-ast.ml shows how to use
annotations and dense sets to traverse all nodes in a syntax
tree.
Utilities
- Dense sets of positive integers
- The interface is a subset of the module Set.S of Ocamls
standard library (of course with with type elt = int). Internally
it uses an array of strings as bitmap.
- Syntax tree utilities
- Contains functions to access fields that are present in each
variant of a given node type. For instance for annotations and
source locations.
Problems/Questions/Suggestions
Feel free to contact me at tews@cs.ru.nl with anything that
is Olmar or
Elsa related.
Known problems
- missing type links
- In Ocaml one can currently not access the type information of an
identifier or an Expression, although this seems possible in C++.
- missing node types
- Some syntax node classes seem not to appear in all of Elsas regression
test programs. They are BaseClassSubobj, DependentQType. The
toOcaml method of these classes currently just
contains an assert(false). I am greatful for any
example program that triggers these assertions.
last changed on
7 Sep 2006
by Hendrik