Optimizing Circular Soft Mask, Krita:GSoC

A new vectorized code  implemented using Vc library to allow SIMD operations for the generation of the Circular Soft Mask. Implementation was straightforward using internal methods declared in Vc however the gains were not as dramatic as with Gaussian Masks because one of the biggest bottlenecks is fetching from memory the predefined values rendered from the curve set by the user.

Making a plan

Phabricator task: Implement Circular Soft Mask Optim AVX
The code templates work the same as the Circular Gaussian Mask generator implementation, which I explained in my [previous post](blog, URL). Taking that into account the plan consisted in three simple steps.

  1. Understand how the scalar vector is generating the values for the Mask
  2. Port all operations to the vectorized model
  3. Test and profile the implementation.

Understanding the code

SofCurve Setting example.

The code generating the values for every coordinate (x,y) is calculated using a distance from the center as the code below in the function value shows. This value is only generated if the distance is within a range and there is no antialias applied. It can be observed that the function curveData.at(i) fetches the value at position i from the curveData QVector. the value is calculated from the integer part of the distance normalized to the curve resolution.

inline quint8 value(qreal dist) const
{
qreal distance = dist * curveResolution;

quint16 alphaValue = distance;
qreal alphaValueF = distance - alphaValue;

qreal alpha = (
(1.0 - alphaValueF) * curveData.at(alphaValue) +
alphaValueF * curveData.at(alphaValue+1));

return (1.0 - alpha) * 255;
}

Before calling the value function, we pass the distance trough a filter for appling a fade. If the filter is false the value will be generated as is, in any other case the value will be given according to the rules of the fadeMaker code below.

inline bool needFade(qreal dist, quint8 *value) {
if (dist > m_radius) {
* value = 255;
return true;
}

if (!m_enableAntialiasing) {
return false;
}

if (dist > m_antialiasingFadeStart) {
* value = m_fadeStartValue + (dist - m_antialiasingFadeStart) * m_antialiasingFadeCoeff;
return true;
}

return false;
}

Porting all operations to vectorized code.

Mask result from the curve setted by the user.

With this in mind we set out to implement the mask vectorized generator. And we started by defining a template for the vectorized soft mask. Below I add the relevant code to fetch the data from the generated curve in the same way as the scalar version but now vectorized to do in groups of data.

Vc::IndexType was used to truncate the integer part of the float vector, this is fast and secure. The gather mehotd for the Vc::vector allows to fill a vector with the Indexes defined in the second argument.

if (!excludeMask.isFull()) {
Vc::float_v valDist = dist * vCurveResolution;

Vc::float_v::IndexType vAlphaValue(valDist);
Vc::float_v vFloatAlphaValue = vAlphaValue;

Vc::float_v alphaValueF = valDist - vFloatAlphaValue;

vCurvedData.gather(curveDataPointer,vAlphaValue);
vCurvedData1.gather(curveDataPointer,vAlphaValue + 1);

// more code truncated
}

The hard part for this implementation was defining the excludeMask: as it might be remembered the scalar code applies the value function in a case by case scenario. This can’t be done here as we apply the same operation on all data in parallel. If the value needs to faded the fader alters the value, else it does nothing. to make this we made the vectorized fader function to create the exlusionMask for the altered values, and return the mask. The final vector needFade function is reproduced below.

Vc::float_m needFade(Vc::float_v &dist) {
const Vc::float_v vOne(Vc::One);
const Vc::float_v vValMax(255.f);

Vc::float_v vRadius(m_radius);
Vc::float_v vFadeStartValue(m_fadeStartValue);
Vc::float_v vAntialiasingFadeStart(m_antialiasingFadeStart);
Vc::float_v vAntialiasingFadeCoeff(m_antialiasingFadeCoeff);

Vc::float_m outsideMask = dist > vRadius;
dist(outsideMask) = vOne;

Vc::float_m fadeStartMask(false);

if(m_enableAntialiasing){
fadeStartMask = dist > vAntialiasingFadeStart;
dist((outsideMask ^ fadeStartMask) & fadeStartMask) =
(vFadeStartValue + (dist - vAntialiasingFadeStart)
* vAntialiasingFadeCoeff) / vValMax;
}
return (outsideMask | fadeStartMask);
}

As with the Gaussian generator, the final steps include masking the extreme values (if any) to generate sane data for the mask.

Testing

For testing correctness a new case was added to the mask similarity tester, it help find out some masking issues and extreme values generated from some angles. Once all that was identified and corrected I tested the bechmarks getting the next results:

********* Start testing of FreehandStrokeBenchmark *********
# Before Vc optimization, 2.7Ghz Intel Core i5
testSoftTip() Cores: 1 Time: 20239 (ms)
testSoftTip() Cores: 2 Time: 11142 (ms)
testSoftTip() Cores: 3 Time: 10012 (ms)
testSoftTip() Cores: 4 Time: 9672 (ms)

## After Vc optimization, 2.7Ghz Intel Core i5
testSoftTip() Cores: 1 Time: 5523 (ms)
testSoftTip() Cores: 2 Time: 3345 (ms)
testSoftTip() Cores: 3 Time: 3063 (ms)
testSoftTip() Cores: 4 Time: 3112 (ms)

## After Vc running on Threadripper
testSoftTip() Cores: 1 Time: 5337 (ms)
testSoftTip() Cores: 2 Time: 3184 (ms)
testSoftTip() Cores: 3 Time: 2322 (ms)
testSoftTip() Cores: 4 Time: 1850 (ms)
testSoftTip() Cores: 5 Time: 1555 (ms)
testSoftTip() Cores: 6 Time: 1334 (ms)
testSoftTip() Cores: 7 Time: 1201 (ms)
testSoftTip() Cores: 8 Time: 1103 (ms)
# Cores 9~29 benchmeark omitted
testSoftTip() Cores: 30 Time: 676 (ms)
testSoftTip() Cores: 31 Time: 695 (ms)
testSoftTip() Cores: 32 Time: 694 (ms)
PASS : FreehandStrokeBenchmark::testSoftTip()

********* Start testing of KisMaskGeneratorBenchmark *********
PASS : KisMaskGeneratorBenchmark::testCircularSoftScalarMask()
RESULT : KisMaskGeneratorBenchmark::testCircularSoftScalarMask():
23.99 msecs per iteration (total: 7,199, iterations: 300)
PASS : KisMaskGeneratorBenchmark::testCircularSoftVectorMask()
RESULT : KisMaskGeneratorBenchmark::testCircularSoftVectorMask():
4.339 msecs per iteration (total: 1,302, iterations: 300)

The speed gain is very good despite accessing memory frequently to retrieve curveData values. Also as the Vc implementation in multithread the mask generator gets faster the more cores the cpu has, as demonstrated by the Threadripper benchmark (The same test on non vectorized mask generators does not show any speed improvement with more cores.)


Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s