Toukon Project

How does upamaana work

The inputs to the Upamaana module are two lists of sequences of code. This module detects the most frequent co-occurrences that appear in the fist list of sequences and that never appear in the second.

Thus, the output of this module is a set of sub-sequences that are the most often found in the sequences of the first list and that cannot be found in any of the sequences of the second list.

The meaning of the code needs not to be specified. It can be any ascii string. The only restrictions on the code are as follows:

The two lists must contain the same kind of code.

The order of the code in any two sequences must be the same. This means that the item at position "i" in a sequence must be comparable to and only to the item at position "i" of any other sequence.

The most frequent contrasted co-occurrences are derived using a special algorithm that compares the sequences in the first and second lists.

The upamaana module can be either embedded into a project (like the TOUKON project) or used as an independent UNIX command.

Upamaana algorithm description

The issue of the upamaana algorithm can be summarized as follows:
"Let A and B be two sets of sequences of code. What is the most frequent combination of code in the sequences of A that never appear in B".

Lets consider the following example:

The sequences are represented as columns. The code is made of 4 colors.

Set A					Set B
red	red	yellow	red	red	green	red
green	green	green	red	green	yellow	green
blue	green	blue	green	red	red	blue
yellow	yellow	yellow	yellow	yellow	yellow	red
blue	red	blue	blue	green	green	blue

A classical approach to this problem would be to compute for every two sequences of A the common combinations of code. Then do the same for every three sequences of A. Then for every four sequences and so forth. Once this tedious work is completed, the resulting combinations should be compared to the sequences of B and only those that do not appear in B should be kept. Finally, among these sequences, the most frequent in A are chosen.

Upamaana uses an algorithm which is much less time consuming. It is inspired by the way humans would naturally resolve the issue.

First, the sequences of A are reproduced without code that appear less than 2 times among the sequences. Then again, without the codes that appear less than 3 times and so forth until only the most frequent codes are left.
The result is a table. The lines of this table contain sequences derived from those of A by removing codes that are less or equaly frequent to the number of the line.

Example:
In this table the frequence in A of each sub-sequence is marked underneath the sub-sequence.

Removing codes that appear 1 time
red	red	X	red	red
green	green	green	X	green
blue	green	blue	green	X
yellow	yellow	yellow	yellow	yellow
blue	X	blue	blue	X
1	1	1	1	1

Removing codes that appear 2 times
red	X	red	red
green	green	X	green
X	X	X	X
yellow	yellow	yellow	yellow
blue	blue	blue	X
1	1	1	2

Removing codes that appear 3 times
X	red	red
green	X	green
X	X	X
yellow	yellow	yellow
X	X	X
1	1	3

Removing codes that appear 4 times
X
X
X
yellow
X
5

Let I be a line counter initially set to 1.

The next step is to compute the common combinations of code among the sequences of the last line of this table. As the number of sequences is smaller than in the initial set A, this computation should be much quicker. The common combinations are compared to the sequences in set B. Those that do not appear in B are retained. Among these, only the most frequent in A are kept and I is set to their frequence.

Now, the goal is to find a combination of code in A that is more frequent than I times, without appearing in B. This combination appears necessarily in the lines after line number I in the table. Indeed, if a combination is more frequent than I times then all of its codes taken individually are more frequent than I times.

Hence, the above step is repeated for each line of the table starting from the last until the I-1 line is reached. For each new line that is processed, the most frequent combinations that do not appear in B are computed. They are maintained if and only if their frequence is bigger than I and I is updated to this new frequence.

Take note that for each new line processed the value of I can increase. Although, the number of the current line processed decreases. Therefore, these two values converge towards one another.

Example:

For the last line of the table, the only common combination is:


X
X
X
yellow
X

Its frequence is 5 in A. Yet it appears in the first sequence of B. Therefore, this combination cannot be kept. The algorithm must go on to the next line.

The common combinations for line 3 are (their frequences are marked underneath):


red	X	red	X
green	X	X	green
X	X	X	X
yellow	yellow	yellow	yellow
X	X	X	X
3	5	4	4

The most frequent common combinations that do not appear in B are the two last ones. Their frequence is 4, thus I=4. The number of the current line is 3=4-1=I-1. Thus the concluding condition of the algorithm has been met and the results are:


red	X
X	green
X	X
yellow	yellow
X	X