Return to site

ShapeConvs

Differentiable shape rendering

Differentiable Shape Rendering

We want to introduce you to a set of differentiable operations called ShapeConvs, that render different shapes giving a tight representation.

A key insight to that process is that one has to give every pixel all the information it needs so that it can calculate for itself if it should be painted or not. Then we can use common operations of convolutional neural networks and combine them to render the shapes. This saves a lot of computational resources while keeping the gradient tractable.

Coordinates Encoding

Pixel Coordinates

We inject the coordinates, for each pixel (xc,yc), as two additional channel maps. We normalize the pixels between 0 and 1. This is heavily inspired by Ubers CoordConv Paper. ( Blog and Paper ) .

You really should check out their Blog. It's well written and also offers a wonderful video explaining the CoordConv layer.

Below we visualized the pixel coordinates layer as an Image. Each pixel has a different coordinate and depending on that also a different color. By simply looking at the color of the pixel it is possible to know where, in the image, a pixel lies.

RectConv

All the information each pixel needs to decide if it should be painted or not, are given as several channels. The parameterization, of the rectangle to be drawn, is given by the 4 values ((x1,y1),(x2,y2)), representing the upper left and lower right point of the rectangle, normalized to [0,1].

We merge the two pixel coordinate channels and the four rectangle coordinate channels feature wise.

(I) coordinates_image = [xc,yc,x1,y1,x2,y2] channelwise

Transformation

Constraint Satisfaction Evaluation

Each pixel (xc,yc) lies within rectangle ((x1,y1),(x2,y2)) if

this is exactly true if

We want to evaluate every pixel on our canvas and check whether or not it should be painted. By using existing neural network operations we can keep our operation efficient while also maintaining a gradient flow for later learning from the given deep learning frameworks.

The output of our convolutional layer has 2 channels, it calculates how much the corresponding constraint has fulfilled or violated.

(II) differences = (x-x1,y-y1) * (x2-x,y2-y)

Now every value greater than zero suggests that the constraint is fulfilled while negative values show a violation.

Inside/Out Decision

We apply a Rectifier Linear Unit (ReLU) on the convolved feature maps.

Pixels that were outside of the rectangle that were negative are now are all set to zero. The pixels inside of the rectangle still have the same intensity as before the ReLU operation.

(III) filtered = relu(differences)

Constraint Collapse

The 4dim filtered vector gets collapsed to a scalar map of the same resolution by multiplying the feature channels.

If any constraint was violated, it is set to 0 and therefore the pixel of the collapsed output will also be 0. So every pixel in the collapsed tensors, that is greater than 0 indicates that it lies within the wanted rectangle.

(IV) collapsed = multiply_along_feature_axis(filtered)

Soft Binarize

Our desired output should be binary as pixels can either be painted or not. Unfortunately this binarization cannot be implemented directly, because we can't propagate the gradient through hard boundaries. Therefore we use a shifted Hyperbolic Tangent function as a "soft" binary operation.

tanh - hyperbolic tangent

After the inside/out ReLU we know all values to be greater or equal to zero. Applying a tanh function squeezes it to fit into range [0,1]. The desired output of our renderer should always be zero or not. To avoid grayish values at edges and get crisp renderings we shift the input several magnitudes to the right and effectively 'sharpen up' our rendering.

Choosing the right number of magnitudes to shift is a delicate business. Too few and our renderings look blurrish, because the operation is not "binary enough". Too many and we face the vanishing gradient problem because of the saturating activation function.

(V) binary = tanh(1e5 * collapsed)

Colorize

The final output of our rendering operation has a certain color.

(VI) output = color * binary

Below we compare a rendered rectangle using plain numpy (left) vs our differentiable tensorflow implementation (right).

RectConv Operation

  1. coordinates_image = [xc,yc,x1,y1,x2,y2] channelwise
  2. differences = (x-x1,y-y1) * (x2-x,y2-y)
  3. filtered = relu(differences)
  4. collapsed = multiply_along_feature_axis(filtered)
  5. binary = tanh(1e5 * collapsed)
  6. output = color * binary

Supervised Rectangle Rendering

Dataset

Supervised Rendering

 Liu et al. defined the Supervised Rendering task. Given a (x,y) coordinate, it renders an image of 64x64 pixels in which a white square of length 9 is rendered. Using a combination of CoordConv and convolutions solves this remarkably good while a plain convolutional neural network will give a disappointing result.

Supervised Rectangle Rendering

At the beginning we wanted to extend the square rendering toy dataset but then we also found the Rectangle Rendering dataset interesting because not only it gives you the position of the rendered rectangle but also its dimensions.

Each rendered canvas has the dimensions 64x64 and contains a single rectangle at a random position with a random shape. Every side has a random length in range [4,40].

Given a rendered rectangle a convolutional encoder regresses the rectangle's bounding box [x1,y1,x2,y2]. Then a rectconv operation renders the encoded rectangle. This compact autoencoding is guided by the metric Intersection over Union.

Model

Network

We use some vanilla convolutional layers to transform a given rendering to the four required values for the upper left and lower right corner of a rectangle. Each Convolution has stride of 2, downsampling the input iteratively until the feature maps have size 4. Then we apply three fully connected layers, the final one returning four scalar values representing the rectangle coordinates as (x1,y1,x2,y2).

Neural Network Architecture

ReLU is applied after each convolutional or fully connected layer. All biases are initialized with zero except the final fully connected layer. The final fully connected bias should be initialized in a space spanning way in that the initial rectangles cover a significant portion of the canvas (we use 25% of the total area). If all 4 values are very small the resulting rectangle would also be really small. The probability of hitting a given rectangle is very low, so training will be very unstable and the small coverage will yield a very weak gradient signal. So (x1,y1) are initialized with 0.25 and (x2,y2) are initialized with 0.75. After the final ReLU an additional tanh function guarantees valid pixel spaces.

Loss

One difference to the original definition is the change in loss function. While per-pixel sigmoid activation and cross entropy loss work well together, we need to think of this problem in a more geometric way. RectConv basically produces binary shapes, so we argue that a Intersection over Union loss (IoU) better describes the coverage of geometric shapes.

For using IoU as a loss function the calculation has to be differentiable. We follow the implementation of Atiqur et al to approximate this metric in an elegant way.

Train

Each training batch contains the rectangle coordinates and the rendered images. It gets randomly created on the fly. Our network sees two million training samples in a batch of size 8.

We use vanilla Adam optimizer with base learning rate 0.0001 and exponentially decrease it after 3000 steps by a factor 0.9.

Learning Rate Schedule

The IoU measures how well the initial rectangle and the reconstructed rectangle match with each other. IoU of zero means no coverage at all while a IoU of one signals perfect coverage.

As one might expect, randomly initialized parameters yield a terrible performance. But as training progresses and parameters are optimized several thousand times, IoU climbs to almost 1 (final value 0.9944).

Intersection over Union

You can download our trained model here. Have a look at the beginning of the training progress. Initially the rectangles are far off but quickly reconstructs the given rectangle remarkably well.

first 50k training steps

The training is quite fast as the process takes about 15 minutes on a GTX 1070. An IoU of 1 on randomly generated data is exactly what we anticipated. RectConv can be used in combination with vanilla neural network layers. Nice!

CircleConv

After we can render rectangles, we want to render more shapes. Let's write a differentiable circle renderer. All we need is basic geometry expressed in vector arithmetic.

CircleConv Operation

  1. pixel_coords = [x,y] channelwise
  2. circle_coords = [xc,yc,r] channelwise
  3. diff_center = sum([xc-x,yc-y]^2) 
  4. constraint_satisfaction = r^2 - diff_center
  5. filtered = relu(constraint_satisfaction)
  6. binary = tanh(1e5 * collapsed)
  7. output = color * binary

TriConv

Rendering arbitrary triangles is a key ingredient for modern day rendering engines, because they are very primitive shape and can be calculated efficiently while also having the ability to approximate any shape as close as desired. Here we restrict ourselves to two-dimensional triangles.

To decide for all points whether or not it should be painted or not, we transform their cartesian coordinates to barycentric coordinates and check their lambdas.

The calculation of lambdas is straight forward. We evaluate the constraints in two steps, first we check if lambda is positive and step 2 is to check if it is smaller 1.

TriConv Operation

  1. pixel_coords = [x,y] channelwise
  2. triangle_coords = [x1,y1,x2,y2,x3,y3] channelwise
  3. denominator = (y2-y3)*(x1-x3) + (x3-x2)*(y1-y3)
  4. L1 = ( (y2-y3)*(x-x3)+(x3-x2)*(y-y3) ) / denominator 
  5. L2 = ( (y3-y1)*(x-x3)+(x1-x3)*(y-y3) ) / denominator
  6. L3 = 1 - L1 - L2
  7. L01 = relu( L1 )
  8. L02 = relu( L2 )
  9. L03 = relu( L3 )
  10. L11 = relu( -L1 + 1 )
  11. L12 = relu( -L2 + 1 )
  12. L13 = relu( -L3 + 1 )
  13. inside = L01 * L02 * L03 * L11 * L12 * L13 
  14. binary = tanh(1e5 * inside)
  15. output = color * binary

Conclusion

In this article we propose a new group of efficient, precise and differentiable shape renderer called ShapeConvs. By broadcasting the representation of the shapes onto each pixel we can express the rendering process as a combination of basic operations manipulating every pixel. We calculate how much each individual pixel violates or fulfills the constraints of the shape. Rectifier Linear Units offer a switch to suppress all pixels outside the desired shape, followed by a soft binary operation which only affects all pixels on the inside yielding a binary mask which can be colored afterwards.

Because we use only common basic operations, our code can be replicated effortlessly in some lines of code in all modern deep learning frameworks and simply be plugged into existing neural network architectures.

In the future we would like to not only render 2d shapes, but also 3d faces. If we could render arbitrary faces in a differentiable way, we perhaps could infer 3d models based on images of the same object taken at several viewports in a compact mesh representation.

All Posts
×

Almost done…

We just sent you an email. Please click the link in the email to confirm your subscription!

OKSubscriptions powered by Strikingly