The inputs to the Upamaana module are two lists of sequences of code. This module detects the most frequent co-occurrences that appear in the fist list of sequences and that never appear in the second.
Thus, the output of this module is a set of sub-sequences that are the most often found in the sequences of the first list and that cannot be found in any of the sequences of the second list.
The meaning of the code needs not to be specified. It can be any ascii string. The only restrictions on the code are as follows:
The most frequent contrasted co-occurrences
are derived using a special algorithm that compares the sequences in the
first and second lists.
The upamaana module can be either embedded into a project (like the TOUKON project) or used as an independent UNIX command.
Upamaana algorithm description
The issue of the upamaana algorithm can be
summarized as follows:
"Let A and B be two sets of sequences of
code. What is the most frequent combination of code in the sequences of
A that never appear in B".
Lets consider the following example:
The sequences are represented as columns.
The code is made of 4 colors.
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A classical approach to this problem would be to compute for every two sequences of A the common combinations of code. Then do the same for every three sequences of A. Then for every four sequences and so forth. Once this tedious work is completed, the resulting combinations should be compared to the sequences of B and only those that do not appear in B should be kept. Finally, among these sequences, the most frequent in A are chosen.
Upamaana uses an algorithm which is much less time consuming. It is inspired by the way humans would naturally resolve the issue.
First, the sequences of A are reproduced without
code that appear less than 2 times among the sequences. Then again, without
the codes that appear less than 3 times and so forth until only the most
frequent codes are left.
The result is a table. The lines of this
table contain sequences derived from those of A by removing codes that
are less or equaly frequent to the number of the line.
Example:
In this table the frequence in A of each
sub-sequence is marked underneath the sub-sequence.
Removing codes that appear 1 time | |||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Removing codes that appear 2 times | ||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Removing codes that appear 3 times | ||||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
Removing codes that appear 4 times | ||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
Let I be a line counter initially set to 1.
The next step is to compute the common combinations of code among the sequences of the last line of this table. As the number of sequences is smaller than in the initial set A, this computation should be much quicker. The common combinations are compared to the sequences in set B. Those that do not appear in B are retained. Among these, only the most frequent in A are kept and I is set to their frequence.
Now, the goal is to find a combination of code in A that is more frequent than I times, without appearing in B. This combination appears necessarily in the lines after line number I in the table. Indeed, if a combination is more frequent than I times then all of its codes taken individually are more frequent than I times.
Hence, the above step is repeated for each line of the table starting from the last until the I-1 line is reached. For each new line that is processed, the most frequent combinations that do not appear in B are computed. They are maintained if and only if their frequence is bigger than I and I is updated to this new frequence.
Take note that for each new line processed the value of I can increase. Although, the number of the current line processed decreases. Therefore, these two values converge towards one another.
Example:
For the last line of the table, the only common
combination is:
|
||
|
||
|
||
|
||
|
Its frequence is 5 in A. Yet it appears in the first sequence of B. Therefore, this combination cannot be kept. The algorithm must go on to the next line.
The common combinations for line 3 are (their frequences are marked underneath):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The most frequent common combinations that
do not appear in B are the two last ones. Their frequence is 4, thus I=4.
The number of the current line is 3=4-1=I-1. Thus the concluding condition
of the algorithm has been met and the results are:
|
|
|
|
|
|
|
|
|
|
|
|
|
|