HOME PAGE
How does upamaana work

The inputs to the Upamaana module are two lists of sequences of code. This module detects the most frequent co-occurrences that appear in the fist list of sequences and that never appear in the second.

Thus, the output of this module is a set of sub-sequences that are the most often found in the sequences of the first list and that cannot be found in any of the sequences of the second list.

The meaning of the code needs not to be specified. It can be any ascii string. The only restrictions on the code are as follows:


The most frequent contrasted co-occurrences are derived using a special algorithm that compares the sequences in the first and second lists.

The upamaana module can be either embedded into a project (like the TOUKON project) or used as an independent UNIX command.


Upamaana algorithm description

The issue of the upamaana algorithm can be summarized as follows:
"Let A and B be two sets of sequences of code. What is the most frequent combination of code in the sequences of A that never appear in B".

Lets consider the following example:

The sequences are represented as columns. The code is made of 4 colors.
 

Set A
 
Set B
red
red
yellow
red
red
 
green
red
green
green
green
red
green
 
yellow
green
blue
green
blue
green
red
 
red
blue
yellow
yellow
yellow
yellow
yellow
 
yellow
red
blue
red
blue
blue
green
 
green
blue

A classical approach to this problem would be to compute for every two sequences of A the common combinations of code. Then do the same for every three sequences of A. Then for every four sequences and so forth. Once this tedious work is completed, the resulting combinations should be compared to the sequences of B and only those that do not appear in B should be kept. Finally, among these sequences, the most frequent in A are chosen.

Upamaana uses an algorithm which is much less time consuming. It is inspired by the way humans would naturally resolve the issue.

First, the sequences of A are reproduced without code that appear less than 2 times among the sequences. Then again, without the codes that appear less than 3 times and so forth until only the most frequent codes are left.
The result is a table. The lines of this table contain sequences derived from those of A by removing codes that are less or equaly frequent to the number of the line.

Example:
In this table the frequence in A of each sub-sequence is marked underneath the sub-sequence.
 
 

Removing codes that appear 1 time  
red
red
 X
red
red
 
green
green
green
 X
green
 
blue
green
blue
green
 X
 
yellow
yellow
yellow
yellow
yellow
 
blue
 X
blue
blue
 X
 
1
1
1
1
1
Removing codes that appear 2 times
red
 X
red
red
 
green
green
 X
green
 
 X
 X
 X
 X
 
yellow
yellow
yellow
yellow
 
blue
blue
blue
 X
 
1
1
1
2
Removing codes that appear 3 times
 X
red
red
 
green
 X
green
 
 X
 X
 X
   
yellow
yellow
yellow
 
 X
 X
 X
   
1
1
3
 
Removing codes that appear 4 times
 X
 
 X
 
 X
   
yellow
 
 X
   
5

Let I be a line counter initially set to 1.

The next step is to compute the common combinations of code among the sequences of the last line of this table. As the number of sequences is smaller than in the initial set A, this computation should be much quicker. The common combinations are compared to the sequences in set B. Those that do not appear in B are retained. Among these, only the most frequent in A are kept and I is set to their frequence.

Now, the goal is to find a combination of code in A that is more frequent than I times, without appearing in B. This combination appears necessarily in the lines after line number I in the table. Indeed, if a combination is more frequent than I times then all of its codes taken individually are more frequent than I times.

Hence, the above step is repeated for each line of the table starting from the last until the I-1 line is reached. For each new line that is processed, the most frequent combinations that do not appear in B are computed. They are maintained if and only if their frequence is bigger than I and I is updated to this new frequence.

Take note that for each new line processed the value of I can increase. Although, the number of the current line processed decreases. Therefore, these two values converge towards one another.

Example:

For the last line of the table, the only common combination is:
 

 X
 
 X
 
 X
   
yellow
 
 X
   

Its frequence is 5 in A. Yet it appears in the first sequence of B.  Therefore, this combination cannot be kept. The algorithm must go on to the next line.

The common combinations for line 3 are (their frequences are marked underneath):

red
 X
red
X
 
green
X
 X
green
 
 X
 X
 X
 X
 
yellow
yellow
yellow
yellow
 
X
X
X
 X
 
3
5
4
4

The most frequent common combinations that do not appear in B are the two last ones. Their frequence is 4, thus I=4. The number of the current line is 3=4-1=I-1. Thus the concluding condition of the algorithm has been met and the results are:
 

red
X
 
 X
green
 
 X
 X
 
yellow
yellow
 
X
 X