ast of hello-world-border.c Olmar : Process C++ Programs in Ocaml

Olmar connects Elsa, the Elkhound based C/C++ parser and type checker, with Ocaml. More precisely, the Olmar extension can translate Elsa's internal abstract syntax tree into a value of an Ocaml variant type. This value can then be further processed with a pure Ocaml program. I prefer to have stand alone Ocaml programs. Therefore I let Elsa marshal the abstract syntax tree as an Ocaml value to disk. However, it is also possible to link the Ocaml code into the Elsa executable.

News

30 Oct 2006: Version 0.2 released, changes:

Distribution

In principle Olmar is a patch for the astgen tool and for Elsa. In the future Olmar will hopefully get integrated into the Elsa/Elkhound distribution. At the moment Olmar is based on the latest Elsa distribution 2005.08.22b. After this release work will start to adopt olmar to the latest Elsa versions available from Oink.

For simplicity I only distribute a complete smbase/Ast/Elkhound/Elsa/Olmar system now. If you want to have pure Elsa, please download it from the Elsa website or (for something newer) from the Oink-Stack repository.

Download / Compile / Use

System requirements

See also Elsa's requirements (under point Download) and Elsa's success/failure matrix

(It appears to run on a 64bit system. However, there are quite a few warning about casts between pointer and integer. I guess it is pure luck that it passes the regression tests.)

Download Elsa+Olmar

choose from the following alternative:

Configure

configure -no-dash-O2
Leave out the -no-dash-O2 option if you want to compile the C++ code with -O2. You can use the environment variables CC and CXX to set the C and C++ compiler, respectively.

The whole thing consists of five packages/subdirectories: smbase, ast, elkhound, elsa, and asttools. The configure script and the makefile of the base directory simply start the appropriate action in each subdirectory. configure --help in the base directory will therefore give you the help text of all the configure scripts in the subdirectories.

Compile

make
this will create the C++ parser elsa/ccparse, the AST Graph utility asttools/ast_graph, and the Olmar example application asttool/count-ast.

Try it

New features in the elsa parser ccparse

In ccparse the option -oc <file> will activate Olmar and set the filename to write the abstract syntax tree to. Alternatively one can use the tracing option marshalToOcaml (add -tr marshalToOcaml). Then ccparse will derive the filename for the abstract syntax tree itself (input-file.oast).

Olmar's contribution: ast_graph, visualising C++ syntax trees

At the moment the asttools subdirectory in the distribution contains only one useful tool: Ast graph. Ast graph generates the abstract syntax tree in the dot language. One can then use the tools from the graphviz package to visualise the syntax tree.

Usage

Normal C++ files tend to have abstract syntax trees with 10.000 to 1.000.000 nodes. Including iostream alone gives almost 250.000 nodes. Most of the graphics software I tried fails on the sheer size of these graphs. To visualise the tree I have found the following possibilities:
zgrviewer
zgrviewer can display dot files directly (relying on a dot background job). It has nice zooming and scrolling functions. It's a pity that Java runs out of memory on graphs with 10.000 nodes already.
convert to postscript (dot -Tps) and use gv
Works for huge graphs. Scrolling in gv is ok, zooming relatively poor. gv seems to allocate a pixmap in the X server. For huge graphs one has therefore to limit the bounding box using ast_graph's -size option (or putting a suitable size attribute into the dot file).
convert to xfig (dot -Tfig) and use xfig
xfig is the fastest of the alternatives. Zooming is relatively good in xfig, scrolling a bit poor. The display is cluttered with all sorts of handles (because xfig assumes you want to change the graph). (xfig worked great for me until some debian etch update pleased me the folowing bugs #373005, #387296, #387298)

syntax tree of ocaml's
  minor garbage collector Technology

The goal of Olmar is to make the abstract syntax tree of a C or C++ program available as an Ocaml variant type, such that one can use pattern matching to process C and C++ programs.

Elsa can output its internal abstract syntax tree in XML, or (mainly for debugging purposes) in plain ASCII. In principle one could read the XML into Ocaml, for instance with PXP. PXP reads XML into an Ocaml object hierarchy. As far as I know, there is, however, no simple way to translate XML into an Ocaml variant type. With PXP one could either write a pull parser or a visitor on the Ocaml object tree. Both approaches are a kind of high-level XML parsing that require some form of type checking the XML and a lot of error code. I did not want to write this kind of XML type-checking code. Therefore Olmar uses a completely different approach.

Olmar simply adds a method toOcaml to each class in Elsa's abstract syntax tree. This method traverses the syntax tree, thereby reconstructing it in Ocaml. At the end the Ocaml value is marshaled into a file. Elsa is linked with some Ocaml code, the Ocaml runtime and some C++ glue code. (In reality the whole story is slightly more complicated, because Elsa's abstract syntax tree can be circular and because C++ pointers might be NULL. Anyway ...)

Elsa internal abstract syntax tree falls into two parts. About 35 different node types (about 150 classes) describe the C++ syntax. Elsa's type checker adds some more types of nodes to describe C++ types in a syntax independent way. A node type might be split into several subtypes (very similar to Ocaml variants). The node type for C++ expressions, for instance, is modelled with 36 classes, for each kind of expression one. In Ocaml such node types are of course modelled with a variant type. Elsa's abstract syntax tree contains also unstructured node types (i.e., without subtypes). In Ocaml those nodes are represented as a tuple or a record.

I wanted to keep Olmar mostly independent from the encoding of variant constructors in Ocaml. Therefore, I register an Ocaml call-back function for each variant constructor and each tuple type. The C++ code calls these call-backs in order to construct Ocaml values (instead of allocating memory itself and filling it). Only list and option values are created directly in C++. For now I prefer this hopefully less error prone variant over more efficient code.

The code for the 35 syntax node types is generated automatically from an ast description file. Therefore, to add the toOcaml method to these syntax classes one only needs to patch astgen. With Olmar astgen additionally generates an Ocaml type definition and Ocaml code for the above mentioned call-back functions. Finally astgen also generates the toOcaml method in C++.

The syntax tree nodes for Elsa's type checker are, unfortunately, not generated from ast descriptions. I had to write all the necessary Ocaml and C++ code myself. In the end this turned out to be much more work than improving astgen...

Using Olmar

You can use Olmar in two ways: In asttools/count-ast.ml you find a very simple Olmar example application.

Abstract syntax tree type definition

The type definition is in the following files
elsa/cc_ast_gen_type.ml
contains the type definition of all ast nodes
elsa/cc_ml_types.ml
flag types used in the syntax nodes
elsa/ml_ctype.ml
flag types used in type nodes
elsa/ast_annotation.mli
ast annotations (see below)
The whole abstract syntax tree that is marshaled from ccparse has type
annotated translationUnit_type = annotated * (annotated topForm_type list)

Ast annotations

The whole abstract syntax tree is polymorphic in one type parameter, which is a placeholder for user defined annotations. Every node of the syntax tree carries a (unique) slot of this annotation type. Annotations are meant for client use. One can easily define a new annotation type and use it to store client data in it.

ccparse generates the abstract syntax tree with annotations of type annotated (see elsa/ast_annotation.mli). Every node is guaranteed to contain a unique annotation value. The annotation carries an (positive) integer (accessible via id_annotation) that uniquely identifies the syntax tree node of this annotation. These integers are dense, that is all integers between 1 and the maximal one are used in the abstract syntax tree. In addition to this unique integer the annotation contains the address of the C++ object from which the Ocaml value was generated (shifted to the right to fit into an Ocaml int).

Complications

The abstract syntax tree is circular. A naive iteration over the tree will therefore in general not terminate. Currently there are three fields that might make the tree circular: All these fields have a (hopefully hinting) option ref type. An iteration over the abstract syntax tree will terminate, if you do not recurse into these three fields. However, there might be some tree nodes only reachable via one of these fields.

The Olmar example count-ast.ml shows how to use annotations and dense sets to traverse all nodes in a syntax tree.

Utilities

Dense sets of positive integers
The interface is a subset of the module Set.S of Ocamls standard library (of course with with type elt = int). Internally it uses an array of strings as bitmap.
Syntax tree utilities
Contains functions to access fields that are present in each variant of a given node type. For instance for annotations and source locations.

Problems/Questions/Suggestions

Feel free to contact me at tews@cs.ru.nl with anything that is Olmar or Elsa related.

Known problems

does not compile on 64-bit platforms
The elsa code contains some pointer <--> unsigned int casts and is thus not 64-bit clean. If you really want 64 bit try this patch (ignore the error on ChangeLog, insert the line from chunk #2 sm_config.pm manually, fix smbase/Makefile to pass -DARCH_64_BIT to smbase/mysig.cc and configure with -64bit).

inefficiency on large graphs
When generating the Ocaml tree every node in the C++ tree contains a pointer to the corresponding Ocaml node. These pointers are scanned for every minor collection, which results in a big performance hit for the default minor heap size of Ocaml (as distributed Olmar increases it from 32K to 8MB). See also the related discussion on the Ocaml list.

missing node types
The syntax node class BaseClassSubobj seems not to appear in all of Elsa's regression test programs. The toOcaml method of this classes currently just contains an assert(false). I am grateful for any example program that triggers these assertions.

last changed on 30 Oct 2006 by Hendrik