Olmar : Process C++ Programs in Ocaml
Olmar
connects Elsa, the Elkhound based C/C++ parser and type checker, with Ocaml. More precisely,
the Olmar extension can
translate Elsa's internal abstract syntax tree
into a value of an Ocaml variant type. This value can then be
further processed with a pure Ocaml program. I prefer to have
stand alone Ocaml programs. Therefore I let Elsa
marshal the abstract syntax tree as an Ocaml value to disk.
However, it is also possible to link the Ocaml code into the Elsa
executable.
News
- 30 Oct 2006: Version 0.2 released, changes:
-
- Serialize all xml annotated fields to Ocaml. Trees are much
bigger now, for instance every expression contains a field with
its (elsa derived) type.
Distribution
In principle Olmar is a patch
for the astgen tool and for Elsa. In the future Olmar will
hopefully get integrated into the Elsa/Elkhound distribution. At
the moment Olmar is based on
the latest Elsa distribution 2005.08.22b. After this release work
will start to adopt olmar to the latest Elsa versions available
from Oink.
For
simplicity I only distribute a complete
smbase/Ast/Elkhound/Elsa/Olmar system now. If
you want to have pure Elsa, please download it from the Elsa
website or (for something newer) from the Oink-Stack
repository.
Download / Compile / Use
System requirements
- C++ compiler
- Flex
- perl5
- (I believe some Yacc variant, like bison)
- and of course Ocaml
See also Elsa's requirements (under point Download) and Elsa's
success/failure matrix
(It appears to run on a 64bit system. However, there are quite a
few warning about casts between pointer and integer. I guess it
is pure luck that it passes the regression tests.)
Download Elsa+Olmar
choose from the following alternative:
Configure
configure -no-dash-O2
Leave out the -no-dash-O2 option if you want to
compile the C++ code with -O2. You can use the
environment variables CC and CXX to set the
C and C++ compiler, respectively.
The whole thing consists of five packages/subdirectories: smbase,
ast, elkhound, elsa, and asttools. The configure script and the
makefile of the base directory simply start the appropriate
action in each subdirectory. configure --help in the
base directory will therefore give you the help text of all the
configure scripts in the subdirectories.
Compile
make
this will create the C++ parser elsa/ccparse, the AST
Graph utility asttools/ast_graph, and the Olmar example
application asttool/count-ast.
Try it
- preprocess the C++ sources (elsa does not include a preprocessor):
g++ -E -o crc.ii smbase/crc.cpp
- run elsa on it and marshal the abstract syntax tree into
crc.oast:
elsa/ccparse -oc crc.oast crc.ii
- use AST Graph to generate a dot graph description
asttools/ast_graph -o crc.dot crc.oast
- View the graph, using one of the following lines
zgrviewer crc.dot
dotty crc.dot
dot -Tps crc.dot -o crc.ps; gv crc.ps
- or generate a png from it (dot -Tpng):
New features in the elsa parser ccparse
In ccparse the option -oc <file> will
activate Olmar and set the
filename to write the abstract syntax tree to. Alternatively one
can use the tracing option marshalToOcaml (add
-tr marshalToOcaml). Then ccparse will derive the
filename for the abstract syntax tree itself
(input-file.oast).
Olmar's contribution: ast_graph, visualising C++ syntax trees
At the moment the asttools subdirectory in the
distribution contains only one useful tool:
Ast graph.
Ast graph
generates the abstract syntax tree in the dot
language. One can then use the tools from the graphviz package to
visualise the syntax tree.
Usage
- preprocess the source code: g++ -E -o file.ii file.cc
- run elsa on it: ccparse -oc file.oast file.ii
(add -tr c_lang for C files)
- generate dot: ast_graph file.oast -o file.dot
- See below for viewing options
Normal C++ files tend to have abstract syntax trees
with 10.000 to 1.000.000 nodes. Including iostream
alone gives almost 250.000 nodes. Most of the graphics
software I tried fails on the sheer size of these graphs. To
visualise the tree I have found the following possibilities:
- zgrviewer
- zgrviewer can display dot files directly (relying on a dot
background job). It has nice zooming and scrolling functions.
It's a pity that Java runs out of memory on graphs with 10.000
nodes already.
- convert to postscript (dot -Tps) and use gv
- Works for huge graphs. Scrolling in gv is ok, zooming
relatively poor. gv seems to allocate a pixmap in the X server.
For huge graphs one has therefore to limit the bounding box using
ast_graph's -size option (or putting a
suitable size attribute into the dot file).
- convert to xfig (dot -Tfig) and use xfig
- xfig is the fastest of the alternatives.
Zooming is relatively good in xfig,
scrolling a bit poor. The display is cluttered with all sorts of
handles (because xfig assumes you want to change the graph).
(xfig worked great for me until some debian etch update pleased me
the folowing bugs #373005, #387296, #387298)
Technology
The goal of Olmar is to make
the abstract syntax tree of a C or C++ program available as an
Ocaml variant type, such that one can use pattern matching to
process C and C++ programs.
Elsa can output its internal abstract syntax tree in XML, or
(mainly for debugging purposes) in plain ASCII. In principle
one could read the XML into Ocaml, for instance with PXP. PXP reads XML into an Ocaml object hierarchy.
As far as I know, there is, however, no simple way to translate
XML into an Ocaml variant type. With PXP one could either write a
pull parser or a visitor on the Ocaml object tree. Both
approaches are a kind of high-level XML parsing that require some
form of type checking the XML and a lot of error code. I did not
want to write this kind of XML type-checking code. Therefore Olmar uses a
completely different approach.
Olmar simply
adds a method toOcaml to each class in Elsa's abstract
syntax tree. This method traverses the syntax tree, thereby
reconstructing it in Ocaml. At the end the Ocaml value is
marshaled into a file. Elsa is linked with some Ocaml code, the
Ocaml runtime and some C++ glue code. (In reality the whole story
is slightly more complicated, because Elsa's abstract syntax tree can be circular
and because C++ pointers might be NULL. Anyway ...)
Elsa internal abstract syntax tree falls into two parts. About 35
different node types (about 150 classes) describe the C++ syntax.
Elsa's type checker adds some more types of nodes to describe C++
types in a syntax independent way. A node type might be split
into several subtypes (very similar to Ocaml variants). The node
type for C++ expressions, for instance, is modelled with 36
classes, for each kind of expression one. In Ocaml such node
types are of course modelled with a variant type. Elsa's abstract
syntax tree contains also unstructured node types (i.e., without
subtypes). In Ocaml those nodes are represented as a tuple or a
record.
I wanted to keep Olmar mostly
independent from the encoding of variant constructors in Ocaml.
Therefore, I register an Ocaml call-back function for each variant
constructor and each tuple type. The C++ code calls these
call-backs in order to construct Ocaml values (instead of
allocating memory itself and filling it). Only list and option
values are created directly in C++. For now I prefer this
hopefully less error prone variant over more efficient code.
The code for the 35 syntax node types is generated automatically
from an ast description file. Therefore, to add the
toOcaml method to these syntax classes one only needs
to patch astgen. With Olmar astgen
additionally generates an Ocaml type definition and Ocaml code
for the above mentioned call-back functions. Finally astgen also
generates the toOcaml method in C++.
The syntax tree nodes for Elsa's type checker are, unfortunately,
not generated from ast descriptions. I had to write all the
necessary Ocaml and C++ code myself. In the end this turned out
to be much more work than improving astgen...
Using Olmar
You can use Olmar in two ways:
- write a stand alone Ocaml program that unmarshals the abstract
syntax tree from the disk
- link additional modules into the Elsa parser ccparse and
arrange for calls from ccparse.
In asttools/count-ast.ml you find a
very simple Olmar example
application.
Abstract syntax tree type definition
The type definition is in the following files
- elsa/cc_ast_gen_type.ml
- contains the type definition of all ast nodes
- elsa/cc_ml_types.ml
- flag types used in the syntax nodes
- elsa/ml_ctype.ml
- flag types used in type nodes
- elsa/ast_annotation.mli
- ast annotations (see below)
The whole abstract syntax tree that is marshaled from ccparse has type
annotated translationUnit_type = annotated * (annotated topForm_type list)
Ast annotations
The whole abstract syntax tree is polymorphic in one type
parameter, which is a placeholder for user defined annotations.
Every node of the syntax tree carries a (unique) slot of this
annotation type. Annotations are meant for client use. One can
easily define a new annotation type and use it to store client
data in it.
ccparse generates the abstract syntax tree with annotations of
type annotated (see
elsa/ast_annotation.mli). Every node is guaranteed to
contain a unique annotation value. The annotation carries an
(positive) integer (accessible via
id_annotation) that uniquely identifies the syntax
tree node of this annotation. These integers are dense, that is
all integers between 1 and the maximal one are used in the
abstract syntax tree.
In addition to this unique integer the annotation contains
the address of the C++ object from which the Ocaml value was
generated (shifted to the right to fit into an Ocaml int).
Complications
The abstract syntax tree is circular. A naive iteration over the
tree will therefore in general not terminate. Currently there are
three fields that might make the tree circular:
- var_type in type variable
- value in type variable
- funcDefn in type variable
- self_type in compound_info
All these fields have a (hopefully hinting) option ref type. An
iteration over the abstract syntax tree will terminate, if you do
not recurse into these three fields. However, there might be some
tree nodes only reachable via one of these fields.
The Olmar example count-ast.ml shows how to use
annotations and dense sets to traverse all nodes in a syntax
tree.
Utilities
- Dense sets of positive integers
- The interface is a subset of the module Set.S of Ocamls
standard library (of course with with type elt = int). Internally
it uses an array of strings as bitmap.
- Syntax tree utilities
- Contains functions to access fields that are present in each
variant of a given node type. For instance for annotations and
source locations.
Problems/Questions/Suggestions
Feel free to contact me at tews@cs.ru.nl with anything that
is Olmar or
Elsa related.
Known problems
- does not compile on 64-bit platforms
- The elsa code contains some pointer <--> unsigned
int casts and is thus not 64-bit clean. If you really want
64 bit try this patch (ignore
the error on ChangeLog, insert the line from chunk #2
sm_config.pm manually, fix smbase/Makefile to pass
-DARCH_64_BIT to smbase/mysig.cc and configure with -64bit).
- inefficiency on large graphs
- When generating the Ocaml tree every node in the C++ tree
contains a pointer to the corresponding Ocaml node. These
pointers are scanned for every minor collection, which results in
a big performance hit for the default minor heap size of Ocaml
(as distributed Olmar increases it
from 32K to 8MB). See also the related discussion on the Ocaml list.
- missing node types
- The syntax node class BaseClassSubobj seems not to appear in
all of Elsa's regression test programs. The toOcaml
method of this classes currently just contains an
assert(false). I am grateful for any example program
that triggers these assertions.
last changed on
30 Oct 2006
by Hendrik