ast of hello-world-border.c Olmar : Process C++ Programs in Ocaml

Olmar connects Elsa, the Elkhound based C/C++ parser and typechecker, with Ocaml. More precisely, the Olmar extension can translate Elsa's internal abstract syntax tree into a value of an Ocaml variant type. This value can then be further processed with a pure Ocaml program. I prefer to have standalone Ocaml programs. Therefore I let Elsa marshal the abstract syntax tree as an Ocaml value to disk. However, it is also possible to link the Ocaml code into the Elsa executable.

Distribution

In principle Olmar is a patch for the astgen tool and for Elsa. In the future Olmar will hopefully get integrated into the Elsa/Elkhound distribution. At the moment Olmar is based on the latest Elsa distribution 2005.08.22b.

For simplicity I only distribute a complete smbase/Ast/Elkhound/Elsa/Olmar system now. If you want to have pure Elsa, please download it from Elsa website.

Download / Compile / Use

System requirements

See also Elsa's requirements (under point Download) and Elsa's success/failure matix

(It appears to run on a 64bit system. However, there are quite a few warning about casts between pointer and integer. I guess it is pure luck that it passes the regression tests.)

Download Elsa+Olmar

choose from the following alternative:

Configure

configure -no-dash-O2
Leave out the -no-dash-O2 option if you want to compile the C++ code with -O2. You can use the environment variables CC and CXX to set the C and C++ compiler, respectively.

The whole thing consists of five packages/sudirectories: smbase, ast, elkhound, elsa, and asttools. The configure script and the makefile of the base directory simply start the appropriate action in each subdirectory. configure --help in the base directory will therefore give you the help text of all the configure scripts in the subdirectories.

Compile

make
this will create the C++ parser elsa/ccparse, the AST Graph utility asttools/ast_graph, and the Olmar example application asttool/count-ast.

Try it

New features in the elsa parser ccparse

In ccparse the option -oc <file> will activate Olmar and set the filename to write the abstract syntax tree to. Alternatively one can use the tracing option marshalToOcaml (add -tr marshalToOcaml). Then ccparse will derive the filename for the abstract syntax tree itself (input-file.oast).

Olmar's contribution: ast_graph, visualizing C++ syntax trees

At the moment the asttools subdirectory in the distribution contains only one useful tool: Ast graph. Ast graph generates the abstract syntax tree in the dot language. One can then use the tools from the graphviz package to visualize the syntax tree.

Usage

Normal C++ files tend to have abstract synctax trees with 10.000 to 1.000.000 nodes. Including iostream alone gives more than 150.000 nodes. Most of the graphics software I tried fails on the sheer size of these graphs. To visualize the tree I have found the following possibilities:
zgrviewer
zgrviewer can display dot files directly (relying on a dot background job). It has nice zooming and scrolling functions. It's a pity that java runs out of memory on graphs with 10.000 nodes already.
convert to postscript (dot -Tps) and use gv
Works for huge graphs. Scrolling in gv is ok, zooming relatively poor. gv seems to allocate a pixmap in the X server. For huge graphs one has therefore to limit the bounding box using ast_graph's -size option (or putting a suitable size attribute into the dot file).
convert to xfig (dot -Tfig) and use xfig
(xfig worked great for me until last week, then some debian etch update broke it) xfig is the fastest of the alternatives. Zooming is relatively good in xfig, scrolling a bit poor. The display is cluttered with all sorts of handles (because xfig assumes you want to change the graph).

syntax tree of ocaml's
  minor garbage collector Technology

The goal of Olmar is to make the abstract syntax tree of a C or C++ program available as an Ocaml variant type, such that one can use pattern matching to process C and C++ programs.

Elsa can output its internal abstract syntax tree in XML, or (mainly for debugging purposes) in plain ASCII. In principle one could read the XML into Ocaml, for instance with PXP. PXP reads XML into an Ocaml object hierarchy. As far as I know, there is, however, no simple way to translate XML into an Ocaml variant type. With PXP one could either write a pull parser or a visitor on the ocaml object tree. Both approaches are a kind of high-level XML parsing that require some form of typechecking the XML and a lot of error code. I did not want to write this kind of XML typechecking code. Therefore Olmar uses a completely different approach.

Olmar simply adds a method toOcaml to each class in Elsa's abstract syntax tree. This method traverses the syntax tree, thereby reconstructing it in Ocaml. At the end the Ocaml value is marshaled into a file. Elsa is linked with some Ocaml code, the ocaml runtime and some C++ glue code. (In reality the whole story is slightly more complicated, because Elsa's ast can be circular and because C++ pointers might be NULL. Anyway ...)

Elsa internal abstract syntax tree falls into two parts. About 35 different node types (about 150 classes) describe the C++ syntax. Elsa's type checker adds some more types of nodes to describe C++ types in a syntax independent way. A node type might be split into several subtypes (very similar to Ocaml variants). The node type for C++ expressions, for instance, is modelled with 36 classes, for each kind of expression one. In Ocaml such node types are of course modelled with a variant type. Elsa's abstract syntax tree contains also unstructured node types (i.e., without subtypes). In Ocaml those nodes are represented as a tuple or a record.

I wanted to keep Olmar mostly independent from the encoding of variant constructors in Ocaml. Therefore, I register an Ocaml callback function for each variant constructor and each tuple type. The C++ code calls these callbacks in order to construct Ocaml values (instead of allocating memory itself and filling it). Only list and option values are created directly in C++. For now I prefer this hopefully less error prone variant over more efficient code.

The code for the 35 syntax node types is generated automatically from an ast description file. Therefore, to add the toOcaml method to these syntax classes one only needs to patch astgen. With Olmar astgen additionally generates an Ocaml type definition and Ocaml code for the abovementioned callback functions. Finally astgen also generates the toOcaml method in C++.

The syntax tree nodes for Elsa's typechecker are, unfortunately, not generated from ast descriptions. I had to write all the necessary Ocaml and C++ code myself. In the end this turned out to be much more work than improving astgen...

Using Olmar

You can use Olmar in two ways: In asttools/count-ast.ml you find a very simple Olmar example application.

Abstract syntax tree type definition

The type definition is in the following files
elsa/cc_ast_gen_type.ml
contains the type definition of all ast nodes
elsa/cc_ml_types.ml
flag types used in the syntax nodes
elsa/ml_ctype.ml
flag types used in type nodes
elsa/ast_annotation.mli
ast annotations (see below)
The whole abstract syntax tree that is marshaled from ccparse has type
annotated translationUnit_type = annotated * (annotated topForm_type list)

Ast annotations

The whole abstract syntax tree is polymorphic in one type parameter, which is a placeholder for user defined annotations. Every node of the syntax tree carries a (unique) slot of this annotation type. Annotations are meant for client use. One can easily define a new annotation type and use it to store client data in it.

ccparse generates the abstract syntax tree with annoations of type annotated (see elsa/ast_annotation.mli). Every node is guaranteed to contain a unique annotation value. The annotation carries an (positive) integer (accessible via id_annotation) that uniquely identifies the syntax tree node of this annotation. In addition the annotation contains the address of the C++ object from which the ocaml value was generated.

Complications

The abstract syntax tree is circular. A naive iteration over the tree will therefore in general not terminate. Currently there are three fields that might make the tree circular: All these fields have a (hopefully hinting) option ref type. An iteration over the abstract syntax tree will terminate, if you do not recurse into these three fields. However, there might be some tree nodes only reachable via one of these fields.

The Olmar example count-ast.ml shows how to use annotations and dense sets to traverse all nodes in a syntax tree.

Utilities

Dense sets of positive integers
The interface is a subset of the module Set.S of Ocamls standard library (of course with with type elt = int). Internally it uses an array of strings as bitmap.
Syntax tree utilities
Contains functions to access fields that are present in each variant of a given node type. For instance for annotations and source locations.

Problems/Questions/Suggestions

Feel free to contact me at tews@cs.ru.nl with anything that is Olmar or Elsa related.

Known problems

missing type links
In Ocaml one can currently not access the type information of an identifier or an Expression, although this seems possible in C++.
missing node types
Some syntax node classes seem not to appear in all of Elsas regression test programs. They are BaseClassSubobj, DependentQType. The toOcaml method of these classes currently just contains an assert(false). I am greatful for any example program that triggers these assertions.

last changed on 7 Sep 2006 by Hendrik