Previous implementation was based on a slow scalar model, calculating each mask value per coordinate. I implement a new vectorized code using Vc library to allow a robust SIMD usage, calculating the mask values in parallel. Not all operations are implemented on Vc data types, especially *erf* had to be implemented for Vc data types. The new implementation shows to be up to 10 times faster (on my system) on mask generation. Given that the mask generation requires the most computing on brush stroke generation, this speed improvement holds up even in the full brush stroke benchmarks. Given the way it is implemented the code can become faster as future SIMD registers grows on future CPUs.

## Code study and implementation of Gauss Mask Mask generator.

**Phabricator task:** Implement Circular Gauss Mask Optim AVX

Vc creates code from templates tailored for each processor instruction set: AVX, AVX2, SSSE2, SSSE3, SSE2, and scalar. so first a template must be declared to manage the creation of each instructions set code. Using the vectorized Default Mask implementation as a guideline, studying how the code generates is constructed to provide the functionality allowed to extend it for the other **MaskGenerators**

*KisBrushMaskApplicatorBase* is an abstract class that defines the basic functionality our mask generator has to provide. This generator can fallback to a scalar implementation if the user host does not provide a system capable of SIMD parallelization (or it can be turned off if some bug causes the SIMD implementation to be unstable). The base class is defined like this, such that **process** is the function that processes the data to generate a mask inside the region defined by a **QRect**, savind all to the **MaskProcessingData** memory address.

class KisBrushMaskApplicatorBase { public: virtual ~KisBrushMaskApplicatorBase() {} virtual void process(const QRect &rect) = 0; inline void initializeData(const MaskProcessingData *data) { m_d = data; } protected: const MaskProcessingData *m_d; };

*KisGaussCircleMaskGenerator* is an subclass of *KisMaskGenerator* and on creation we have to generate a *KisBrushMaskApplicator* to use for placing the generated data that represents the mask. This generators are defined for as templates to adapt into code for each instructions set. So inside the constructor we call an applicator creator. This reset means that it will generate a new applicator based on the type given for the template.

d->applicator.reset(createOptimizedClass<MaskApplicatorFactory >(this));

**MaskApplicatorFactory** is defined in the following way

template<class MaskGenerator, template class Applicator> struct MaskApplicatorFactory { typedef MaskGenerator* ParamType; typedef KisBrushMaskApplicatorBase* ReturnType; template static ReturnType create(ParamType maskGenerator); };

Vc _impl is substituted at compilation time to generate all the register version’s code. And the mask generator is generated for each of the MaskGenerator types declared. And depending on this we generate a different *KisBrushMaskApplicator* with different calls to a method call *process* which then calls either *processScalar* or *processVector*, depending on the *MaskGenerator* type (scalar or vectorized).

*aplicator.reset* for *KisGaussCircleMaskGenerator* ends up calling the template defined inside **kis_brush_mask_applicator_factories**

template<> template<> MaskApplicatorFactory<KisGaussCircleMaskGenerator, KisBrushMaskVectorApplicator>::ReturnType MaskApplicatorFactory<KisGaussCircleMaskGenerator, KisBrushMaskVectorApplicator>::create<Vc::CurrentImplementation::current()>(ParamType maskGenerator) { return new KisBrushMaskVectorApplicator<KisGaussCircleMaskGenerator,Vc::CurrentImplementation::current()>(maskGenerator); }

The mask template generator has more elements than what I’ve shown, because it decides what generator to use depending on several conditions, however I hope this short exposition gave you the basic idea on how Krita decides on how a mask generator is made.

## Implementing Vectorized ValueAt

*ValueAt* is the function name used to calculate the value of a certain coordinates (**x**,**y**) for the mask to be made. The first step to ensure the vectorization is a close as possible to the original result is to use *valueAt* as a foundation for the new vectorize implementation. However while the operations can be applied in the same fashion, if done so, the final result speed gain wont be as optimal. This was shown by the dummy implementation done for the proposal, not only did the gains were not as good as desired, also the result was, as shown in the previous entry, entirely different. The following points had to be taken into consideration in order to have the desired result.

- Operations applied must be the same or equivalent.
- All operations need to act on vector data types for maximum efficiency.
- test, test, test!

A quick profiler shows that most of the time is spent in the calculation of the *erf* values

### Vectorizing ValueAt

The mask calculation core is done in scalar way like using this two functions. calculating the value using the error function to calculate the value to be returned for that particular coordinate. Because the shape is a circle, the value can be calculated using the distance from the center. The further away the mask lets less **color** to pass.

inline quint8 KisGaussCircleMaskGenerator::Private::value(qreal dist) const { dist * = distfactor; quint8 ret = alphafactor * (erf(dist + center) - erf(dist - center)); return (quint8) 255 - ret; } KisBrushMaskApplicatorBase\* KisGaussCircleMaskGenerator::applicator() { return d->applicator.data(); } quint8 KisGaussCircleMaskGenerator::valueAt(qreal x, qreal y) const { if (isEmpty()) return 255; qreal xr = x; qreal yr = qAbs(y); fixRotation(xr, yr); qreal dist = sqrt(norme(xr, yr * d->ycoef)); quint8 value; if (d->fadeMaker.needFade(dist, &value)) { return value; } return d->value(dist); }

In the same way the distance is calculated in the vectorized code. However notice that **vYCoeff**, **xr** and **yr** are *Vc::float_v* vector types. And all operations on them, are parallel. The size of the vector is determined at runtime. In my case the vector size could process up to 8 values at the same time. Notice the value is recalculated based on a fadeMaker *object*, I’ll talk about it further on. The basic operation in Vc looks like this.

// d is our maskGenerator object // vYCoeff is a Vc vector object Vc::float_v vYCoeff(d->ycoef); // The distance calculation Vc::float_v dist = sqrt(pow2(xr) + pow2(yr * vYCoeff)); Vc::float_v valDist = dist * vDistfactor; Vc::float_v fullFade = vAlphafactor * ( d->vErf(valDist + vCenter) - d->vErf(valDist - vCenter));

### Reimplementing Error function

The Gauss mask relies on the *error* function to calculate the Gauss mask drop value, as shown by the initial profiling most the time is spent in this function so we need to parallelize it to obtain the most optimization. However Vc library did not have any implementation for that particular function and the standard math library would not work for a couple of reasons: *first*, it computes the values as **doubles** and while Vc can work with doubles because we need double the space the loose half the values in doing so. *Second*, the standard math does not work with Vc data types. So using it directly is almost not possible. It is possible to use the function directly is using Vc *apply* method, which call a function on every member in the **vector** and returns a new vector. While this works, as used in the dummy implementation, because the *method* itself does not work in parallel it could make the optimization not as good as theoretically possible.

For implementing our function I used an approximation to reduce the number of divisions and powers iterations. Also considering our input values were in a range between 0 and 100, it was possible to reduce complexity on filtering and still get a value within the expected error of 1e-04. The standard library math implementation also uses a case approach but it does so in a more detailed and mature way to work on a wide variety on situations, however it is not needed for this.

static inline Vc::float_v erf(Vc::float_v x) { Vc::float_v xa = abs(x); Vc::float_v sign(Vc::One); Vc::float_m invertMask = x < 0.f; sign(invertMask) = -1.f; // CONSTANTS float a1 = 0.254829592; float a2 = -0.284496736; float a3 = 1.421413741; float a4 = -1.453152027; float a5 = 1.061405429; float p = 0.3275911; Vc::float_v t = 1.0f / (1.0f + p * xa); Vc::float_v y = 1.0f - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * exp(-xa * xa); y(precisionLimit) = 1.0f; return sign * y; }

### Fade values, the antialias parameter.

Previously I talked about a **fadeMaker** used to process the values differently. This mask in particular makes use of a 1 dimensional fading factor. Used to generate the antialiasing values when the soft values are 1. Without the fader the border will not be antialiased on hard edge situations. Below I paste the original code.

static inline Vc::float_v erf(Vc::float_v x) { Vc::float_v xa = abs(x); Vc::float_v sign(Vc::One); Vc::float_m invertMask = x m_radius) { \*value = 255; return true; } if (!m_enableAntialiasing) { return false; } if (dist > m_antialiasingFadeStart) { * value = m_fadeStartValue + (dist - m_antialiasingFadeStart) * m_antialiasingFadeCoeff; return true; } return false; }

The code above can be vectorized as shown below, but a few things must be considered. Operations are applied to a particular coordinate up until a *false* is encountered, this is solved in the scalar version with a **return**. We cannot return for each false because SIMD operates on the entire vector on parallel, what its done instead is mask the values that do not need further processing.

// BEGIN FadeMaker needFade vectorized // follow fademaker rules for outsideMask Vc::float_m outsideMask = dist > vFadeRadius; dist(outsideMask) = vOne; Vc::float_m fadeStartMask(false); // if antialias is off, do not process if(antialiasOn){ fadeStartMask = dist > vFadeAFadeStart; dist((outsideMask ^ fadeStartMask) & fadeStartMask) = (vFadeStartValue + (dist - vFadeAFadeStart) * vFadeAFadeCoeff) / vValMax; } Vc::float_m excludeMask(outsideMask | fadeStartMask);

## Final touches; masking values outside range

Once the fader is in place the mask returned is used to determine if there are still coordinates with no value. If *excludeMask* is all set to true, then we need no further processing and the values are copied to the buffer. On the other hand, the values are processed using the *vErf* function.

Masking is done for any value outside our permited range (0,1), it must be returned only in that range to ensure the behaviour is as expected. For it we generate a couple of mask for lowe and upper bounds value checks.

if (!excludeMask.isFull()) { Vc::float_v valDist = dist * vDistfactor; Vc::float_v fullFade = vAlphafactor * ( d->vErf(valDist + vCenter) - d->vErf(valDist - vCenter)); Vc::float_m mask; // Mask undefined values, out of range are out of mask mask = Vc::isfinite(fullFade); fullFade.setZero(!mask); // Mask in the inner circe of the mask mask = fullFade 254.974f; fullFade(mask) = vValMax; // normalize values between 0 and 1 Vc::float_v vFade = (vValMax - fullFade) / vValMax; // return original dist values before vFade transform vFade(excludeMask) = dist; vFade.store(bufferPointer, Vc::Aligned); } else { dist.store(bufferPointer, Vc::Aligned); } currentIndices = currentIndices + increment; bufferPointer += Vc::float_v::size(); }

## Testing the implementation

Users expect a tool to behave in a certain way, because of this the new implementation has to render the same looking Mask, so artists can work as if nothing but the speed changed. For this we created a *mask similarity test* which compares the generated Masks between the old engine and the vectorized Vc one. So we generated and entry for this Mask Generator and setup the creation of the brush as the code below

KisGaussCircleMaskGenerator circVectr(500, 1.0, 0.5, 0.5, 2, true);

The test passed perfectly! And I proceeded to paint some to test the speed gains on a real world scenario. Just to find out not everything was sorted out.

### Problems found! Improving unitTests

The new implementation was really fast, but also terribly wrong! It only behave ok on the test scenario values, anything outisde those values was was generating a a lot weird patterns! This post has reduced mutch of the struggle to the final code (because I do not have all the steps of all the struggles). but a short summary goes like this.

- Fader was not initially implemented! So first versions did not do any antialiasing producing ugly results.
- Weird patterns were appearing on particular conditions.
- Masks for values outside range appear to do nothing.

Almost all errors were fixed by looking closely to the operations and code and realizing some stuff was setup incorrectly. After adding the fader and fixing most wrongly assumed conditions I was only left with an error that generated pure white in the middle, or pure black on the borders. This error made me think the problem was in the value range masks, but nothing I did fixed that. A closer look revealed the true nature of the problem: On certain edge values the approximation for *erf* failed, giving extraordinary big values, or extraordinary low values.

static inline Vc::float_v erf(Vc::float_v x) { Vc::float_v xa = abs(x); // Precision is lost for any value beyond 9.3 Vc::float_m precisionLimit(xa >= 9.3f); xa(precisionLimit) = 0; Vc::float_v sign(Vc::One); Vc::float_m invertMask = x < 0.f; sign(invertMask) = -1.f; // CONSTANTS float a1 = 0.254829592; float a2 = -0.284496736; float a3 = 1.421413741; float a4 = -1.453152027; float a5 = 1.061405429; float p = 0.3275911; Vc::float_v t = 1.0f / (1.0f + p * xa); Vc::float_v y = 1.0f - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * exp(-xa * xa); y(precisionLimit) = 1.0f; return sign * y; }

This took care of any precision errors and made test pass on any combination of creation values (for this we made an iterative creator to try all possible values, just to be sure!). Precision wouldn't be a problem if the values were **doubles** but working in a double vector would reduce the working parallel values by half, and since it could solved easily masking the problematic values, this was the better approach.

## Profiling code

No optimization can be fully finished without profiling the test, that is, knowing exactly how much resources a certain operation is taking. Because I'm workin on an **OSX** device, I set up an *Instruments* **Time profiler** to find out were the code is using the biggest amount of time. This helps me find what pieces of code need refactoring to avoid time loss. Because I was following a set of rules to strive for good optimization, the profiling did not show a particular operation slowing the operation down. They were some warnings that did not turn to be real problems.

A special Benchmark for Mask generator was made, and using it to profile the Mask generator routines provided with enough repetitions to help find the slow bits of code. Below some screens showing the profiling general view for both *process*

## Final thoughts

For being the first mask generator to be implemented it presented a good amount of need processes to understand. And this made if slow to implement correctly. however an structured planning and well thought test environment allowed me to finish this implementation on time and ready for Krita 4.1!!

Now the next challenge is to implement a good SoftBrush Mask generator :]

Nice work. Thanks for working on it.

Thanks!