featured image for gauss optim

Optimizing Circular Gaussian Mask, Krita:GSoC


Previous implementation was based on a slow scalar model, calculating each mask value per coordinate. I implement a new vectorized code using Vc library to allow a robust SIMD usage, calculating the mask values in parallel. Not all operations are implemented on Vc data types, especially erf had to be implemented for Vc data types. The new implementation shows to be up to 10 times faster (on my system) on mask generation. Given that the mask generation requires the most computing on brush stroke generation, this speed improvement holds up even in the full brush stroke benchmarks. Given the way it is implemented the code can become faster as future SIMD registers grows on future CPUs.

Circular Gauss Optimization results.

Code study and implementation of Gauss Mask Mask generator.

Phabricator task: Implement Circular Gauss Mask Optim AVX
Vc creates code from templates tailored for each processor instruction set: AVX, AVX2, SSSE2, SSSE3, SSE2, and scalar. so first a template must be declared to manage the creation of each instructions set code. Using the vectorized Default Mask implementation as a guideline, studying how the code generates is constructed to provide the functionality allowed to extend it for the other MaskGenerators

KisBrushMaskApplicatorBase is an abstract class that defines the basic functionality our mask generator has to provide. This generator can fallback to a scalar implementation if the user host does not provide a system capable of SIMD parallelization (or it can be turned off if some bug causes the SIMD implementation to be unstable). The base class is defined like this, such that process is the function that processes the data to generate a mask inside the region defined by a QRect, savind all to the MaskProcessingData memory address.

class KisBrushMaskApplicatorBase
{
public:
    virtual ~KisBrushMaskApplicatorBase() {}
    virtual void process(const QRect &rect) = 0;

    inline void initializeData(const MaskProcessingData *data) {
        m_d = data;
    }

protected:
    const MaskProcessingData *m_d;
};


KisGaussCircleMaskGenerator is an subclass of KisMaskGenerator and on creation we have to generate a KisBrushMaskApplicator to use for placing the generated data that represents the mask. This generators are defined for as templates to adapt into code for each instructions set. So inside the constructor we call an applicator creator. This reset means that it will generate a new applicator based on the type given for the template.

d->applicator.reset(createOptimizedClass<MaskApplicatorFactory
     >(this));

MaskApplicatorFactory is defined in the following way

template<class MaskGenerator,
         template class Applicator>
struct MaskApplicatorFactory
{
    typedef MaskGenerator* ParamType;
    typedef KisBrushMaskApplicatorBase* ReturnType;

    template
    static ReturnType create(ParamType maskGenerator);
};

Vc _impl is substituted at compilation time to generate all the register version’s code. And the mask generator is generated for each of the MaskGenerator types declared. And depending on this we generate a different KisBrushMaskApplicator with different calls to a method call process which then calls either processScalar or processVector, depending on the MaskGenerator type (scalar or vectorized).

aplicator.reset for KisGaussCircleMaskGenerator ends up calling the template defined inside kis_brush_mask_applicator_factories

template<>
template<>
MaskApplicatorFactory<KisGaussCircleMaskGenerator, KisBrushMaskVectorApplicator>::ReturnType
MaskApplicatorFactory<KisGaussCircleMaskGenerator, KisBrushMaskVectorApplicator>::create<Vc::CurrentImplementation::current()>(ParamType maskGenerator)
{
    return new KisBrushMaskVectorApplicator<KisGaussCircleMaskGenerator,Vc::CurrentImplementation::current()>(maskGenerator);
}

The mask template generator has more elements than what I’ve shown, because it decides what generator to use depending on several conditions, however I hope this short exposition gave you the basic idea on how Krita decides on how a mask generator is made.

Implementing Vectorized ValueAt

ValueAt is the function name used to calculate the value of a certain coordinates (x,y) for the mask to be made. The first step to ensure the vectorization is a close as possible to the original result is to use valueAt as a foundation for the new vectorize implementation. However while the operations can be applied in the same fashion, if done so, the final result speed gain wont be as optimal. This was shown by the dummy implementation done for the proposal, not only did the gains were not as good as desired, also the result was, as shown in the previous entry, entirely different. The following points had to be taken into consideration in order to have the desired result.

  • Operations applied must be the same or equivalent.
  • All operations need to act on vector data types for maximum efficiency.
  • test, test, test!

A quick profiler shows that most of the time is spent in the calculation of the erf values


Scalar profile shows most of the time is spent in erf and exp operation.

Vectorizing ValueAt

The mask calculation core is done in scalar way like using this two functions. calculating the value using the error function to calculate the value to be returned for that particular coordinate. Because the shape is a circle, the value can be calculated using the distance from the center. The further away the mask lets less color to pass.

inline quint8 KisGaussCircleMaskGenerator::Private::value(qreal dist) const
{
    dist * = distfactor;
    quint8 ret = alphafactor * (erf(dist + center) - erf(dist - center));
    return (quint8) 255 - ret;
}
KisBrushMaskApplicatorBase\* KisGaussCircleMaskGenerator::applicator()
{
    return d->applicator.data();
}

quint8 KisGaussCircleMaskGenerator::valueAt(qreal x, qreal y) const
{
    if (isEmpty()) return 255;
    qreal xr = x;
    qreal yr = qAbs(y);
    fixRotation(xr, yr);

    qreal dist = sqrt(norme(xr, yr * d->ycoef));

    quint8 value;
    if (d->fadeMaker.needFade(dist, &value)) {
        return value;
    }

    return d->value(dist);
}

In the same way the distance is calculated in the vectorized code. However notice that vYCoeff, xr and yr are Vc::float_v vector types. And all operations on them, are parallel. The size of the vector is determined at runtime. In my case the vector size could process up to 8 values at the same time. Notice the value is recalculated based on a fadeMaker object, I’ll talk about it further on. The basic operation in Vc looks like this.

// d is our maskGenerator object
// vYCoeff is a Vc vector object
Vc::float_v vYCoeff(d->ycoef);

// The distance calculation
Vc::float_v dist = sqrt(pow2(xr) + pow2(yr * vYCoeff));

Vc::float_v valDist = dist * vDistfactor;
Vc::float_v fullFade = vAlphafactor * ( d->vErf(valDist + vCenter) - d->vErf(valDist - vCenter));

Reimplementing Error function

The Gauss mask relies on the error function to calculate the Gauss mask drop value, as shown by the initial profiling most the time is spent in this function so we need to parallelize it to obtain the most optimization. However Vc library did not have any implementation for that particular function and the standard math library would not work for a couple of reasons: first, it computes the values as doubles and while Vc can work with doubles because we need double the space the loose half the values in doing so. Second, the standard math does not work with Vc data types. So using it directly is almost not possible. It is possible to use the function directly is using Vc apply method, which call a function on every member in the vector and returns a new vector. While this works, as used in the dummy implementation, because the method itself does not work in parallel it could make the optimization not as good as theoretically possible.


Relative time use according to Instruments profiler. A huge amount of time is spent in this operator

For implementing our function I used an approximation to reduce the number of divisions and powers iterations. Also considering our input values were in a range between 0 and 100, it was possible to reduce complexity on filtering and still get a value within the expected error of 1e-04. The standard library math implementation also uses a case approach but it does so in a more detailed and mature way to work on a wide variety on situations, however it is not needed for this.

static inline Vc::float_v erf(Vc::float_v x) {
    Vc::float_v xa = abs(x);
    Vc::float_v sign(Vc::One);
    Vc::float_m invertMask = x < 0.f;
    sign(invertMask) = -1.f;

    // CONSTANTS
    float a1 =  0.254829592;
    float a2 = -0.284496736;
    float a3 =  1.421413741;
    float a4 = -1.453152027;
    float a5 =  1.061405429;
    float p  =  0.3275911;

    Vc::float_v t = 1.0f / (1.0f + p * xa);
    Vc::float_v y = 1.0f - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * exp(-xa * xa);
    y(precisionLimit) = 1.0f;
    return sign * y;
}

Fade values, the antialias parameter.


Antialiasing states: left is on. The fader only activates if soft value creates a hard edge.

Previously I talked about a fadeMaker used to process the values differently. This mask in particular makes use of a 1 dimensional fading factor. Used to generate the antialiasing values when the soft values are 1. Without the fader the border will not be antialiased on hard edge situations. Below I paste the original code.

static inline Vc::float_v erf(Vc::float_v x) {
        Vc::float_v xa = abs(x);
        Vc::float_v sign(Vc::One);
        Vc::float_m invertMask = x  m_radius) {
            \*value = 255;
            return true;
        }

        if (!m_enableAntialiasing) {
            return false;
        }

        if (dist > m_antialiasingFadeStart) {
            * value = m_fadeStartValue + (dist - m_antialiasingFadeStart)
                      * m_antialiasingFadeCoeff;
            return true;
        }

        return false;
    }

The code above can be vectorized as shown below, but a few things must be considered. Operations are applied to a particular coordinate up until a false is encountered, this is solved in the scalar version with a return. We cannot return for each false because SIMD operates on the entire vector on parallel, what its done instead is mask the values that do not need further processing.

    // BEGIN FadeMaker needFade vectorized
    // follow fademaker rules for outsideMask
    Vc::float_m outsideMask = dist > vFadeRadius;
    dist(outsideMask) = vOne;

    Vc::float_m fadeStartMask(false);
    // if antialias is off, do not process
    if(antialiasOn){
        fadeStartMask = dist > vFadeAFadeStart;
        dist((outsideMask ^ fadeStartMask) & fadeStartMask) = (vFadeStartValue + (dist - vFadeAFadeStart) * vFadeAFadeCoeff) / vValMax;
    }
    Vc::float_m excludeMask(outsideMask | fadeStartMask);

Final touches; masking values outside range

Once the fader is in place the mask returned is used to determine if there are still coordinates with no value. If excludeMask is all set to true, then we need no further processing and the values are copied to the buffer. On the other hand, the values are processed using the vErf function.

Masking is done for any value outside our permited range (0,1), it must be returned only in that range to ensure the behaviour is as expected. For it we generate a couple of mask for lowe and upper bounds value checks.

    if (!excludeMask.isFull()) {
        Vc::float_v valDist = dist * vDistfactor;
        Vc::float_v fullFade = vAlphafactor * ( d->vErf(valDist + vCenter) - d->vErf(valDist - vCenter));

        Vc::float_m mask;
        // Mask  undefined values, out of range are out of mask
        mask = Vc::isfinite(fullFade);
        fullFade.setZero(!mask);

        // Mask in the inner circe of the mask
        mask = fullFade  254.974f;
        fullFade(mask) = vValMax;

        // normalize values between 0 and 1
        Vc::float_v vFade = (vValMax - fullFade) / vValMax;

        // return original dist values before vFade transform
        vFade(excludeMask) = dist;
        vFade.store(bufferPointer, Vc::Aligned);

    } else {
      dist.store(bufferPointer, Vc::Aligned);
  }
  currentIndices = currentIndices + increment;

  bufferPointer += Vc::float_v::size();
}

Testing the implementation

Users expect a tool to behave in a certain way, because of this the new implementation has to render the same looking Mask, so artists can work as if nothing but the speed changed. For this we created a mask similarity test which compares the generated Masks between the old engine and the vectorized Vc one. So we generated and entry for this Mask Generator and setup the creation of the brush as the code below

KisGaussCircleMaskGenerator circVectr(500, 1.0, 0.5, 0.5, 2, true);

The test passed perfectly! And I proceeded to paint some to test the speed gains on a real world scenario. Just to find out not everything was sorted out.

Problems found! Improving unitTests

The new implementation was really fast, but also terribly wrong! It only behave ok on the test scenario values, anything outisde those values was was generating a a lot weird patterns! This post has reduced mutch of the struggle to the final code (because I do not have all the steps of all the struggles). but a short summary goes like this.

  1. Fader was not initially implemented! So first versions did not do any antialiasing producing ugly results.
  2. Weird patterns were appearing on particular conditions.
  3. Masks for values outside range appear to do nothing.


Some of the errors generated by masking issues and precission errors.

Almost all errors were fixed by looking closely to the operations and code and realizing some stuff was setup incorrectly. After adding the fader and fixing most wrongly assumed conditions I was only left with an error that generated pure white in the middle, or pure black on the borders. This error made me think the problem was in the value range masks, but nothing I did fixed that. A closer look revealed the true nature of the problem: On certain edge values the approximation for erf failed, giving extraordinary big values, or extraordinary low values.

static inline Vc::float_v erf(Vc::float_v x) {
        Vc::float_v xa = abs(x);

        // Precision is lost for any value beyond 9.3
        Vc::float_m precisionLimit(xa >= 9.3f);
        xa(precisionLimit) = 0;
        Vc::float_v sign(Vc::One);
        Vc::float_m invertMask = x < 0.f;
        sign(invertMask) = -1.f;

        // CONSTANTS
        float a1 =  0.254829592;
        float a2 = -0.284496736;
        float a3 =  1.421413741;
        float a4 = -1.453152027;
        float a5 =  1.061405429;
        float p  =  0.3275911;

        Vc::float_v t = 1.0f / (1.0f + p * xa);
        Vc::float_v y = 1.0f - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * exp(-xa * xa);
        y(precisionLimit) = 1.0f;
        return sign * y;
    }

This took care of any precision errors and made test pass on any combination of creation values (for this we made an iterative creator to try all possible values, just to be sure!). Precision wouldn't be a problem if the values were doubles but working in a double vector would reduce the working parallel values by half, and since it could solved easily masking the problematic values, this was the better approach.

Profiling code

No optimization can be fully finished without profiling the test, that is, knowing exactly how much resources a certain operation is taking. Because I'm workin on an OSX device, I set up an Instruments Time profiler to find out were the code is using the biggest amount of time. This helps me find what pieces of code need refactoring to avoid time loss. Because I was following a set of rules to strive for good optimization, the profiling did not show a particular operation slowing the operation down. They were some warnings that did not turn to be real problems.


Some operations appeared to be suspiciously taking time, however they probably expose some limitation of this profiling.

A special Benchmark for Mask generator was made, and using it to profile the Mask generator routines provided with enough repetitions to help find the slow bits of code. Below some screens showing the profiling general view for both process


Scalar run total time.


Vectorized run total time.

Final thoughts

For being the first mask generator to be implemented it presented a good amount of need processes to understand. And this made if slow to implement correctly. however an structured planning and well thought test environment allowed me to finish this implementation on time and ready for Krita 4.1!!

Now the next challenge is to implement a good SoftBrush Mask generator :]


Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s