String coordinate implementation for RapidMiner

Download the extension + source code + example application: rapidminer_string_coordinate.0.1.2.zip (2.0 MB, date: September 24, 2015)

Feel free to contact me if you have any problems using this software: christian [dot] leitold [at] gmail.com. Be sure to use the latest version, as it will have fixes for bugs present in older versions.

1. Introduction

StringCoordinate is an extension (plugin) for RapidMiner, a machine learning and data mining framework. It implements the non-linear string reaction coordinate method to be used from within RapidMiner. More information on the method can be found in W. Lechner, J. Rogal, J Juraszek, B. Ensing, and P. G. Bolhuis, J. Chem. Phys. 133, 174110 (2010), and more information on the example application shown later is provided in C. Leitold, W. Lechner, and C. Dellago, J. Phys. Cond. Matt. 27, 194126 (2015).

2. Installation and compatibility

First, you have to install the base version of RapidMiner from rapidminer.com. Note that you will have to create an account in order to download the software. While there is an extendend commercial version available for sale, you will only need the free open-source version in order to install the StringCooordinate extension. RapidMiner is implementend in Java and should thus in principle run on any platform. The extension should work with RapidMiner 5.3 and newer versions, however, there is a bug when using 5.3 with newer versions of Mac OS X, so in that case, you should definitely install RapidMiner version 6. In particular, when the bug occurs, one cannot use any text input fields in RapidMiner. Refer to the RapidMiner manual for further information on how to install and run the program on your platform.

Secondly, copy the extension file StringCoordinate.jar into the proper directory:

lib/plugins (relative to the RapidMiner root directory) for RapidMiner 5.3

USER_HOME/.RapidMiner/extensions for RapidMiner 6

Once installed, you should have an additional menu entry "String Coordinate Extension" and the corresponding operator in your operator selection menu.

3. Example: construct a string reaction coordinate for polymer folding

Here, we will show how to construct a string reaction coordinate for the folding of a polymer chain. The data are real simulation results and stored in database_polymer.csv. Each line contains the result of a committor calculation for a configuration, as well as a list of collective variables describing that particular configuration. The system under investigation is a single, flexible chain of identical monomers, linked together by harmonic springs. Non-neighboring monomers interact via a short-ranged attractive interaction with a repulsive core. For a sufficiently narrow attractive well, the system undergoes a first-order like freezing transition from an expanded disordered coil to a compact crystalline state. We use the system's total potential energy, denoted by U, to distinguish between the two stable states. Further information on the system, as well as all the details of the simulation, can be found in C. Leitold, W. Lechner, and C. Dellago, J. Phys. Cond. Matt. 27, 194126 (2015).

The following walk-through assumes you are running RapidMiner Studio 6.5. The user interface might look slightly different with other versions, but the keyboard commands should be the same in any case.

a) Start RapidMiner and create a new process (File -> New Process), save it under some name if you like

b) In the "Operators" tab on the left of the main screen, select Import -> Data -> Read CSV. Drag the operator into you process viewspace in the center.

c) Single-click the operator to configure it. Click "Import Configuration Wizard". In the wizard, select the file database_polymer.csv and click Next.

Import CSV

You should now see a preview of your data. Note that the first line is used to name the values. The names are the same as the ones used in the paper cited above. NA and NB are the number of trajectories started from that configuration ending in A and B, respectively. Click Next twice. The last screen of the wizard is the most important one: for the purpose of this example, untick all boxes except the one for pB, NA, NB, U, and Q6. Change the attribute type of "pB" to "label". This is important, as we will need this value only for visualization, as it is actually redundant via the relation pB = NB / (NA + NB). Click Finish to exit the wizard.

Confirm import CSV

d) (Optional) For a quick look at the data, use your mouse to connect the "out" port of the Read CSV operator with the "res" (result) input port on the right border of the process view window. Click the blue "run" arrow or press F11 to run the process. The process will now read the input file and process it according to our settings. You are directed to the results view, where you can look at the data, view statiscal information and also create plots. You might want to create a scatter plot using U as x and pB as y. Press F8 or click "Design" in the top right corner to return to the process view window.

e) Now, in the "Operators" tab, select Data Transformation -> Value Modification -> Numerical Value Modification -> Normalize and drag the operator into the process window. Connect the "out" port of the Read CSV operator to the "exa" (example set) input port of the Normalize operator. Configure it the following way: attribute filter type = subset, click "Select Attributes", select Q6 and U, click Apply. Click "2 hidden expert parameters" or hit F4 to enable "expert" mode, change the normalization method to "range transformation", keep 0 and 1 as the defaults for min and max. This step should be pretty self-explaining: We have just used the Normalize operator to transform the values of the variables U and Q6 such that they fall within the intervall [0, 1] by a simple linear transformation. This are the two variables that define the space our string will reside in.

f) Finally, in the "Operators" tab, select "String Coordinate Extension" and drag the operator into the screen. Once again, connect the "out" port of the Normalize operator to the "tra" (training data) input port of the String Coordinate operator.

g) Configuration options for the String Coordinate operator:
fixed attribute = U (the potential energy)
number of string images = 5 (default value)
tick "use committor data", otherwise, a sum of square differences is used as objective function for the optimization
attribute for NA = NA
attribute for NB = NB
tick "reverse string" as well (That way, HIGH values of U correspond to state A and LOW values of U to B, as it is defined in the paper. You could achieve the same result by not ticking "reverse string", but interchanging NA and NB)
number of iterations = 10000 (you can keep the default, but 10000 should be more than enough)

h) Connect the "str" (string coordinate visualization) output port to the "res" input port on the right border. Also connect the "mod" (model) output.

Process view

i) Click "run" or hit F11. Depending on you system, the optimization might take some time. For reference, it takes about 15 seconds on my Intel Core 2 Quad Q9650. You are now presented with the results view. Click the "StringCoordinate" tab to see a human-readable representation of the string reaction coordinate just created. The meaning of all parameters is explained in the two papers cited above. Select tab "ExampleSet", then "Charts", then "String Coordinate Scatter" as type of plot. Use U as x, Q6 as y, and pB as color axis. A representation of the string, including two helper points at the ends, and all data points is plotted.

Results view

Now, its your turn to play around with the method: try other variable combinations (but for the polymer data, KEEP U as fixed dimension), use your own data, use RapidMiner to find the best combination of variables in terms of final likelihood, ...

4. All output ports of the String Coordinate operator

mod: the "model", a complete representation of the reaction coordinate with all parameters. Can be used in an "Apply Model" operator to calculate reaction coordinates for new configurations.

exa: the initial example set used for the optimization

wei: weights for the variables, currently not used

str: same as exa, but with addition of string images for visualization

fin: final score of optimization. As the optimization searches for the MINIMUM, actually the NEGATIVE log likelihood is used here and everywhere else.

5. License

The extension is licensed using the AGPL version 3 and thus fully compatible with any version of RapidMiner. See LICENSE for the full text.