Winner Francois Glineur (GFD6e)
In the recent week long MATLAB programming contest, about 800 entries went through the contest queue, resulting in three intermediate prize winners and one final winner.
In addition to looking at this analysis (which focuses on the techniques used in the winning entry), you might want to look at the discussion areas to see other interesting threads related to the MATLAB Contest.
score = cputime + totalscore;
Instead of displaying "totalscore" on our Rankings page, we displayed "result", a normalized measure how how short your sequence was. "result" was computed with:
result = totalscore/Our_Length*100;
(As an aside, our experience with previous contests shows that hiding details like this doesn't always prevent the entries from "evolving" to be suited to the specific trials presented. However, it's not clear if that happened here, since the best entries did fairly well across many different types of tests.)
This MAT file contains two cell arrays. The one_answer{i} contains the original sequence used to create the fragments stored in testsuite{i}. There were 29 tests, with original lengths ranging from 10 up to 23456 characters.
This typically meant comparing all the segments to each other to determine how to make the best match. While time wasn't weighted nearly as strongly as the scoring function, time was initially important because it was difficult to get a slow brute force algorithm to pass the testsuite without timing out.
Here were a few common ideas we saw in the top entries; we've pointed out the code related to these ideas further below where we've "dissected" the code.
To avoid unnecessary work, many entries first removed duplicate fragments. One way of doing this is to use SORTROWS followed by DIFF to remove duplicates. Since SORTROWS operates on all columns of the data set, this does more work than is necessary and takes too long. A few entries "inlined" the SORTROWS algorithm, but this still performed unnecessary work.
A clever idea involved removing the SORTROWS algorithm entirely. All that's needed is to remove identical segments; we don't need to sort all the rows. With this in mind, a simple hash was performed, which mapped the n-space SORTROWS and DIFF problem (where n is the length of a segment) into a 1-d SORT and DIFF problem. Because only one column of data is sorted, this saves time. This mapping was done using a vector of random numbers; while this wasn't 100% foolproof, the likelihood of random numbers not producing the desired behavior is small and this method worked well.
Here's that mapping as used in the winning entry:
lSegs = size(Segments, 2); [SortedHash Order] = sort(Segments * randn(lSegs, 1)); startSegs = Order(diff(cat(1, SortedHash, 0)) ~= 0)';
Matching the segments is where most of the algorithms spent most of their time, matching up every pair with every possible overlap; a natural way of matching the segments is to use STRCMP/STRNCMP and FIND.
As before, a simple hash was used to quickly perform the comparison, saving a significant amount of CPU time.
As the segments were combined, the running length of the final output grew, but the rate at which it grew couldn't be determined ahead of time. This presented a challenge in efficiently storing the result as it's formed.
A common approach to solving this problem used cell arrays. For problems with large segments, one successful method was to track the information needed to combine the segments without actually doing the combining. Only at the end of the process were the segments combined into the final result. This turned out to be especially efficient with larger segments. The final winning entry used this method for longer segments and would combine segments for shorter segments; an IF statement was used to compare the segment length and switch between methods as needed.
The code for the winning entry is below. In some places, we've reformatted the code with "..." and newlines to make it more readable. The code begins with:
function Sequence = GFD6(Segments) if size(Segments, 1) > 200This IF statement is the cutoff swich statement to determine which algorithm to use for accumulating the final result.
Next we had:
lSegs = size(Segments, 2); [SortedHash Order] = sort(Segments * randn(lSegs, 1)); startSegs = Order(diff(cat(1, SortedHash, 0)) ~= 0)';These lines remove the duplicate fragments, according to the hashing-type algorithm we discussed earlier. Notice that the sort is only performed on a 1-d vector, instead of on an lSegs-d array. The multiplication 'Segments * randn(lSegs,1)' maps lSegs dimensions into 1 dimension.
endSegs = startSegs; nSegs = length(startSegs);These lines set up some work vectors to be carried around for updating information.
Next we need to set up a cell array to carry information about the combinations needed to produce our final result:
Description = cell(nSegs, 1);
for index = startSegs
Description{index} = index;
end
While this code works well, a minor problem in this code is that this
pre-allocated size may not be large enough to hold the final
result.Below we begin the routine for finding which segments have the best (greatest) overlap:
matchLength = lSegs-1; leftSegs = nSegs;Since there are no identical fragments, the longest match possible is lSegs-1. Now we begin to loop over the possible shifts to find the best match:
while matchLength > 0
RandWeight = randn(matchLength, 1);
LeftHash = Segments(endSegs, lSegs-matchLength+1:end) * RandWeight;
RightHash = (Segments(startSegs, 1:matchLength) * RandWeight)';
The RandWeight vector is the vector which maps the lSegs matching
problem into a 1-D problem, similar to what was done to remove
duplicate entries.Next we'll loop over the uncombined segments and check to see if there are any matches:
indexLeft = 1;
while indexLeft <= leftSegs
Match = findstr(LeftHash(indexLeft), RightHash);
if Match
indexRight = Match(1);
After checking for matches, we check for a special case:
if (indexRight == indexLeft) & (length(Match) > 1)
% Rare case, happens e.g. with strvcat('AAA','AAG','AGC')
indexRight = Match(2);
end
Did you remember that we're still inside an IF statement here? In this
section of the IF, we don't actually combine the segments when a match
is found. Instead we store information about how to combine the
segments and do the combination at the end of the routine.The following code updates the Description array to record the combining pairs and the amount by which they're shifted; it also updates the related working vectors:
if indexRight ~= indexLeft
Description{startSegs(indexLeft)} = cat(2,...
Description{startSegs(indexLeft)}, lSegs-matchLength, ...
Description{startSegs(indexRight)});
startSegs(indexRight) = startSegs(indexLeft);
startSegs(indexLeft) = []; endSegs(indexLeft) = [];
leftSegs = leftSegs - 1;
RightHash(indexRight) = RightHash(indexLeft);
RightHash(indexLeft) = []; LeftHash(indexLeft) = [];
If there's nothing left to track, we break out of this routine:
if leftSegs == 1
matchLength = 1;
break;
end
else
Here's the rest of the "if indexRight ~= indexLeft" conditional:
indexLeft = indexLeft + 1;
end
else
indexLeft = indexLeft + 1;
And now we end the "if Match" conditional:
end
Now we can end the "while indexLeft <= leftSegs" loop:
end
One more step to reduce matchLength:
matchLength = matchLength - 1;
Now we can end the "while matchLength > 0" conditional.
endBut we're not finished yet! Now we need to combine the segments according to the Description cell array created. We'll start by attaching all nonmatching fragments to the end of the sequence:
if length(startSegs) > 1
% If disjoint segments in the end, we have to merge them !
for index = 2:length(startSegs)
Description{startSegs(1)} = [Description{startSegs(1)},...
lSegs, Description{startSegs(index)}];
end
startSegs = startSegs(1);
end
Now we actually combine the segments - check out the line used here:
Sequence(repmat(cumsum(cat(2, 0, ...
Description{startSegs}(2:2:2*nSegs-2)))',...
1, lSegs) + repmat(1:lSegs, nSegs, 1))...
= Segments(Description{startSegs}(1:2:2*nSegs-1), :);
Pretty impressive! This code snippet is a great example of
vectorizing. REPMAT, CUMSUM, and the colon operator are all used in
this line, making for a very efficient implementation.Keeping in mind we're still inside an IF conditional, the above section of code only executes for segments of length greater than 200. The following block of code beginning with the ELSE describes the algorithm for used for shorter segments; it combines segments as soon as a match is found:
else
[nSegs, lSegs] = size(Segments);
[SortedHash Order] = sort(Segments * randn(lSegs,1)) ;
Segments = Segments(Order(diff(cat(1,SortedHash,0)) ~= 0), :);
[nSegs, lSegs] = size(Segments);
theSegments = cell(nSegs, 1);
theSegs = 1:nSegs;
for index = theSegs,
theSegments{index} = Segments(index, :);
end
copySegments = theSegments;
matchLength = lSegs-1;
while matchLength > 0
Segments(:, 1) = [];
for leftSeg = theSegs
The following line is a very computationally expensive line. If you
profile this function, you'll see that a majority of the time spent in
this routine is spend on this line:
Match = find(strncmp(Segments(leftSeg, :), copySegments, matchLength));
if Match
Select = Match(1);
rightSeg = theSegs(Select);
if rightSeg == leftSeg
if length(Match) > 1 % Rare case
temp = Match(Match ~= Select);
Select = temp(1);
rightSeg = theSegs(Select);
Here's one of the code sections that does the actual combining:
theSegments{rightSeg} = cat(2,theSegments{leftSeg}, ...
theSegments{rightSeg}(1+matchLength:end));
i = find(theSegs == leftSeg);
theSegs(i) = [];
if length(theSegs) == 1
Sequence = theSegments{theSegs};
return;
end
copySegments(Select) = copySegments(i);
copySegments(i) = [];
end
This section of code is then followed by:
else
theSegments{rightSeg} = cat(2,theSegments{leftSeg}, ...
theSegments{rightSeg}(1+matchLength:end));
i = find(theSegs == leftSeg);
theSegs(i) = [];
if length(theSegs) == 1
Sequence = theSegments{theSegs};
return;
end
copySegments(Select) = copySegments(i);
copySegments(i) = [];
end
end
end
matchLength = matchLength - 1;
end
Sequence = [theSegments{theSegs}];
end
Take a look at these two sections of code. They have a lot in common;
their structure is very similar. It'd be interesting to see if it's
possible to remove any redundancies between the code to improve the
performance yet again.We hope you've enjoyed this analysis of the code. If you have more questions on it, you may want to contact Francois, his contact information is listed next to his winning entry.