Category: Code

Bit packing Pacman

Haven’t posted in a while, since I’ve been heads down in building a lot of cool tooling at work (blog posts coming), but had a chance to mess around a bit with something that came up in an interview question this week.

I frequently ask candidates a high level design question to build PacMan. Games like pacman are fun because on the surface they are very simple, but if you don’t structure your entities and their interactions correctly the design falls apart.

At some point during the interview we had scaled the question up such that there was now a problem of knowing at a particular point in the game what was nearby it. For example, if the board is 100000 x 100000 (10 billion elements) how efficiently can we determine if there is a nugget/wall next to us? One option is to store all of these entities in a 2d array and just access the neighbors. However, if the entity is any non trivial object, then we now have at minumum 16 bytes. That means we’re storing 160 gigs to access the board. Probably not something we can realistically do on commodity hardware.

Given we’re answering only a “is something there or not” question, one option is to bit pack the answer. In this sense you can leverage that each bit represents a coordinate in your grid. For example in a 2D grid

0 1
2 3

These positions could be represented by the binary value at that bit:

0 = 0b0001
1 = 0b0010
2 = 0b0100
3 = 0b1000

If we do that, and we store a list of longs (64 bits, 8 bytes) then to store 10 billion elements we need:

private val maxBits = maxX * maxY
private val requiredLongs = (maxBits / 64) + 1

Which ends up being 22,032,273 longs, which in turn is 176.2 MB. Thats… a big savings. Considering that the trivial form we stored 10,000,000,000 objects, this is a compression ratio of 450%.

Now, one thing the candidate brought up (which is a great point) is that this makes working with the values much more difficult. The answer here is to provide a higher level API that hides away the hard bits.

I figured today I’d set down and do just that. We need to be able to do a few things

  1. Find out how many longs to store
  2. Find out given a coordinate which long it belongs to
  3. In that long toggle the bit representing the coordinate if we want to set/unset it
class TwoDBinPacker(maxX: Int, maxY: Int) {
  private val maxBits = maxX * maxY
  private val requiredLongs = (maxBits / 64) + 1
  private val longArray = new Array[Long](requiredLongs)

  def get(x: Int, y: Int): Boolean = {
    longAtPosition(x, y).value == 1
  }

  def set(x: Int, y: Int, value: Boolean) = {
    val p = longAtPosition(x, y)

    longArray(p.index) = p.set(value)
  }

  private def longAtPosition(x: Int, y: Int): BitValue = {
    val flattenedPosition = y * maxX + x

    val longAtPosition = flattenedPosition / 64

    val bitAtPosition = flattenedPosition % 64

    BitValue(longAtPosition, longArray(longAtPosition), bitAtPosition)
  }
}

With the helper class of a BitValue looking like:

case class BitValue(index: Int, container: Long, bitNumber: Int) {
  val value = (container >> bitNumber) & 1

  def set(boolean: Boolean): Long = {
    if (boolean) {
      val maskAt = 1 << bitNumber

      container | maskAt
    } else {
      val maskAt = ~(1 << bitNumber)

      container & maskAt
    }
  }
}

At this point we can drive a scalatest:

"Bit packer" should "pack large sets (10 billion!)" in {
  val packer = new TwoDBinPacker(100000, 100000)

  packer.set(0, 0, true)
  packer.set(200, 400, true)

  assert(packer.get(0, 0))
  assert(packer.get(200, 400))
  assert(!packer.get(99999, 88888))
}

And this test runs in 80ms.

Now, this is a pretty naive way of doing things, since we are potentially storing tons of unused longs. A smarter way would be use a sparse set with skip lists, such that as you use a long you create it and mark it used, but things before it and after it (up to the next long) are marker blocks that can span many ranges. I.e.

{EmtpyBlock}[long, long, long]{EmptyBlock}[long]

This way you don’t have to store things you don’t actually set.

Anyways, a fun little set of code to write. Full source available on my github

Strongly typed http headers in finatra

When building service architectures one thing you need to solve is how to pass context between services. This is usually stuff like request id’s and other tracing information (maybe you use zipkin) between service calls. This means that if you set request id FooBar123 on an entrypoint to service A, if service A calls service B it should know that the request id is still FooBar123. The bigger challenge is usually making sure that all thread locals keep this around (and across futures/execution contexts), but before you attempt that you need to get it into the system in the first place.

I’m working in finatra these days, and I love this framework. It’s got all the things I loved from dropwizard but in a scala first way. Todays challenge was that I wanted to be able to pass request http headers around between services in a typesafe way that could be used in thread local request contexts. Basically I want to send

X-Magic-Header someValue

And be able to resolve that into a MagicHeader(value: T) class.

The first attempt is easy, just parse header values into case classes:

case class MagicHeader(value: String)

But the question I have is how do I enforce that the header string X-Magic-Value is directly correlated to the case class MagicHeader?

object MagicHeader { 
   val key = "X-Magic-Header"
}

case class MagicHeader(value: String)

Maybe, but still, when someone sends the value out, they can make a mistake:

setRequestHeader("X-mag1c-whatevzer" -> magicHeader.value)

That sucks, I don’t want that. I want it strictly paired. I’m looking for what is in essence a case class that has 2 fields: key, value, but where the key is fixed. How do I do that?

I like to start with how I want to use something, and then work backwards to how to make that happen. Given that, lets say we want an api kind of like:

object Experimental {
  val key = "Experimental"

  override type Value = String
}

And I’d like to be able to do something like

val experimentKey = Experimental("experiment abc")
(experimentKey.key -> experimentKey.value) shouldEqual
         ("Experimental" -> "experiment abc")

I know this means I need an apply method somewhere, and I know that I want a tuple of (key, value). I also know that because I have a path dependent type of the second value, that I can do something with that

Maybe I can fake an apply method to be like

trait ContextKey {
  val key: String

  /**
   * The custom type of this key
   */
  type Value

  /**
   * A tupel of (String, Value)
   */
  type Key = Product2[String, Value]

  def apply(data: Value): Key = new Key {
    override def _1: String = key

    override def _2: Value = data
  }
}

And update my object to be

object Experimental extends ContextKey {
  val key = "Experimental"

  override type Value = String
}

Now my object has a mixin of an apply method that creates an anonmyous tuple of type String, Value. You can create instances of Experimental but you can’t ever set the key name itself! However, I can still access the pinned key because the anonymous tuple has it!

But in the case that I wanted, I wanted to use these as http header values. Which means I need to be able to parse a string into a type of ContextKey#Value which is path dependent on the object type.

We can do that by adding now a few extra methods on the ContextKey trait:

trait ContextKeyType[T] extends Product2[String, T] {
  def unparse: String
}

trait ContextKey {
  self =>
  val key: String

  /**
   * The custom type of this key
   */
  type Value

  /**
   * A tupel of (String, Value)
   */
  type Key = ContextKeyType[Value]

  /**
   * Utility to allow the container to provide a mapping from Value => String
   *
   * @param r
   * @return
   */
  def parse(r: String): Value

  def unparse(v: Value): String

  def apply(data: Value): Key = new Key {
    override def _1: String = key

    override def _2: Value = data

    /**
     * Allow a mapping of Value => String
     *
     * @return
     */
    override def unparse: String = self.unparse(data)

    override def equals(obj: scala.Any): Boolean = {
      canEqual(obj)
    }

    override def canEqual(that: Any): Boolean = {
      that != null &&
      that.isInstanceOf[ContextKeyType[_]] &&
      that.asInstanceOf[ContextKeyType[_]]._1 == key &&
      that.asInstanceOf[ContextKeyType[_]]._2 == data
    }
  }
}

This introduces a parse and unparse method which converts things to and from strings. A http header object can now define how to convert it:

object Experimental extends ContextKey {
  val key = "Experimental"
  override type Value = String

  override def parse(value: String): String = value

  override def unparse(value: String): String = value
}

So, if we want to maybe send JSON in a header, or a long/int/uuid we can now parse and unparse that value pre and post wire.

Now lets add a utility to convert a Map[String, String] which could represent an http header map, into a set of strongly typed context values:

object ContextValue {
  def find[T <: ContextKey](search: T, map: Map[String, String]): Option[T#Value] = {
    map.collectFirst {
      case (key, value) if search.key == key => search.parse(value)
    }
  }
}

Back in finatra land, lets add a http filter

case class CurrentRequestContext(
  experimentId: Option[Experimental.Value],
)

object RequestContext {
  private val requestType = Request.Schema.newField[CurrentRequestContext]

  implicit class RequestContextSyntax(request: Request) {
    def context: CurrentRequestContext = request.ctx(requestType)
  }

  private[filters] def set(request: Request): Unit = {
    val data = CurrentRequestContext(
      experimentId = ContextValue.find(Experimental, request.headerMap)
    )

    request.ctx.update(requestType, data)
  }
}

/**
 * Set the remote context from requests 
 */
class RemoteContextFilter extends SimpleFilter[Request, Response] {
  override def apply(request: Request, service: Service[Request, Response]): Future[Response] = {
    RequestContext.set(request)

    service(request)
  }
}

From here on out, we can provide a set of strongly typed values that are basically case classes with hidden keys

Deployment the paradoxical way

First and foremost, this is all Jake Swensons brain child. But it’s just too cool to not share and write about. Thanks Jake for doing all the hard work :)

At paradoxical, we really like being able to crank out libraries and projects as fast as possible. We hate boilerplate and we hate repetition. Everything should be automated. For a long time we used maven archetypes to crank out services from a template and libraries from a template, and that worked reasonably well. However, deployment was always kind of a manual process. We had scripts in each repo to use the maven release plugin but our build system (Travis) wasn’t wired into it. This meant that deploys of libraries/services required a manual (but simple) step to run. We also had some kinks with our gpg keys and we weren’t totally sure a clean way of having Travis be able to sign our artifacts in a secure way without our keys being checked into a bunch of different repos

Jake and I had talked a while ago about how nice it would be if we could

  • Have all builds to master auto deployed as snapshots
  • PR’s built but not deployed
  • Creating a github release kicked off an actual release

The first two were reasonably easy with the travis scripts we already had, but it was the last one that was fun.

This article was posted not long ago about simplifying your maven release process by chucking the maven release plugin and instead using the maven deploy directly. If you could parameterize your maven artifact version number and have your build pass that in from the git tag, then we could really easily achieve git tag driven development!

To that end, Jake created a git project that facilitated setting up all our repo’s for tag driven deployment. Each deployable of ours would check out this project as a submodule under .deployment which contains the tooling to make git tag releases happen.

To onboard

First things first, is that we need a way to delegate deployment after our travis build is complete. So you’d add the following to your projects travis file:

git:
  submodules: false
before_install:
  # https://git-scm.com/docs/git-submodule#_options:
  # --remote
  # Instead of using the superproject’s recorded SHA-1 to update the submodule,
  # use the status of the submodule’s remote-tracking (branch.<name>.remote) branch (submodule.<name>.branch).
  # --recursive
  # https://github.com/travis-ci/travis-ci/issues/4099
  - git submodule update --init --remote --recursive
after_success:
- ./.deployment/deploy.sh

Which would pull the git deployment submodule, and delegate the after step to its deploy script.

You also need to add the deployment project as a parent of your pom:

<parent>
    <groupId>io.paradoxical</groupId>
    <artifactId>deployment-base-pom</artifactId>
    <version>1.0</version>
</parent>

This sets up nice things for us like making sure we sign our GPG artifacts, include sources as part of our deployment, and attaches javadocs.

The last thing you need to do is parametarize your artifact version field:

<version>1.0${revision}</version>

The parent pom defines revision and will set it to be either the git tag or -SNAPSHOT depending on context.

But, for those of you with strong maven experience, an alarm may fire that you can’t parameterize the version field. To solve that problem, Jake wrote a wonderful maven parameter resolver which lets you white-list which fields need to be pre-processed before they are processed. This solves an issue where a deployed maven pom that has parameterized values that are set at build time only are captured for deployment. Without that, maven has issues resolving transitive dependencies.

Anyways, the base pom handles a lot of nice things :)

The deploy script

Now lets break down the after build deploy script. It’s job is to take the travis encrypted gpg keys (which are also password secured) and decrypt them, and run the right maven release given the git tags.

if [ -n "$TRAVIS_TAG" ]; then
    echo "Deploying release version for tag '${TRAVIS_TAG}'"
    mvn clean deploy --settings "${SCRIPT_DIR}/settings.xml" -DskipTests -P release -Drevision='' $@
    exit $?
elif [ "$TRAVIS_BRANCH" = "master" ]; then
    echo "Deploying snapshot version on branch '${TRAVIS_BRANCH}'"
    mvn clean deploy --settings "${SCRIPT_DIR}/settings.xml" -DskipTests -P snapshot $@
    exit $?
else
    echo "No deployment running for current settings"
    exit 0
fi

It’s worth noting here a few magic things.

  1. The settings.xml file is provided by submodule and contains a field for the gpg username and the parametrized password the in every repo and contains the gpg user
  2. Because the deploy script is invoked from the root of the project, even though the deploy script is in the deployment submodule it resolves paths from the script execution point (not where it the script lives at). This is why the script captures its own path and stores it as the $SCRIPT_DIR variable.

Release time!

Now that it’s all set up we can safely merge to master whenever we want to publish a snapshot, and if we want to mark a release as public we just create a git tag for it.

Coproducts and polymorphic functions for safety

I was recently exploring shapeless and a coworker turned me onto the interesting features of coproducts and how they can be used with polymorphic functions.

Frequently when using pattern matching you want to make sure that all cases are exhaustively checked. A non exhaustive pattern match is a runtime exception waiting to happen. As a scala user, I’m all about compile time checking. For classes that I own I can enforce exhaustiveness by creating a sealed trait heirarchy:

sealed trait Base
case class Sub1() extends Base
case class Sub2() extends Base

And if I ever try and match on an Base type I’ll get a compiler warning (that I can fail on) if all the types aren’t matched. This is nice because if I ever add another type, I’ll get a (hopefully) failed build.

But what about the scenario where you don’t own the types?

case class Type1()
case class Type2()
case class Type3()

They’re all completely unrelated. Even worse is how do you create a generic function that accepts an instance of those 3 types but no others? You could always create overloaded methods:

def takesType(type: Type1) = ???
def takesType(type: Type2) = ???
def takesType1(type: Type3) = ???

Which works just fine, but what if that type needs to be passed through a few layers of function calls before its actually acted on?

def doStuff(type: Type1) = ... takesType(type1)
def doStuff(type: Type2) = ... takesType(type2)
def doStuff(type: Type3) = ... takesType(type3)

Oh boy, this is a mess. We can’t get around with just using generics with type bounds since there is no unified type for these 3 types. And even worse is if we add another type. We could use an either like Either[Type1, Either[Type2, Either[Type3, Nothing]]]

Which lets us write just one function and then we have to match on the subsets. This is kind of gross too since its polluted with a bunch of eithers. Turns out though, that a coproduct is exactly this… a souped up either!

Defining

type Items = Type1 :+: Type2 :+: Type3 :+: CNil

(where CNil is the terminator for a coproduct) we now have a unified type for our collection. We can write functions like :

def doStuff(item: Items) = {
  // whatever
  takesType(item)
}

At some point, you need to lift an instance of Type1 etc into a type of Item and this can be done by calling Coproduct[Item](instance). This call will fail to compile if the type of the instance is not a type of Item. You also are probably going to want to actually do work with the thing, so you need to unbox this souped up either and do stuff with it

This is where the shapeless PolyN methods come into play.

object Worker {
  type Items = Type1 :+: Type2 :+: Type3 :+: CNil

  object thisIsAMethod extends Poly1 {
    // corresponding def for the data type of the coproduct instance
    implicit def invokedOnType1 = at[Type1](data => data.toString)
    implicit def invokedOnType2 = at[Type2](data => data.toString)
    implicit def invokedOnType3 = at[Type3](data => data.toString)
  }

  def takesItem(item: Item): String = {
    thisIsAMethod(item)
  }
}

class Provider {
  Worker.takesItem(Coproduct[Item](Type1()) // ok
  Worker.takesItem(Coproduct[Item](WrongType()) // fails
}

The object thisIsAMethod creates a bunch of implicit type dependent functions that are defined at all the elements in the coproduct. If we add another option to our coproduct list, we’ll get a compiler error when we try and use the coproduct against the polymorphic function. This accomplishes the same thing as giving us the exhaustiveness check but its an even stronger guarantee as the build will fail.

While it is a lot of hoops to jump through, and can be a little mind bending, I’ve found that coproducts and polymorphic functions are a really nice addition to my scala toolbox. Being able to strongly enforce these kinds of contracts is pretty neat!

Mocking nested objects with mockito

Yes, I know its a code smell. But I live in the real world, and sometimes you need to mock nested objects. This is a scenario like:

when(a.b.c.d).thenReturn(e)

The usual pattern here is to create a mock for each object and return the previous mock:

val a = mock[A]
val b = mock[B]
val c = mock[C]
val d = mock[D]

when(a.b).thenReturn(b)
when(b.c).thenReturn(c)
when(c.d).thenReturn(d)

But again, in the real world the signatures are longer, the types are nastier, and its never quite so clean. I figured I’d sit down and solve this for myself once and for all and came up with:

import org.junit.runner.RunWith
import org.mockito.Mockito
import org.scalatest.junit.JUnitRunner
import org.scalatest.{FlatSpec, Matchers}

@RunWith(classOf[JUnitRunner])
class Tests extends FlatSpec with Matchers {
  "Mockito" should "proxy nested objects" in {
    val parent = Mocks.mock[Parent]

    Mockito.when(
      parent.
        mock(_.getChild1).
        mock(_.getChild2).
        mock(_.getChild3).
        value.doWork()
    ).thenReturn(3)

    parent.value.getChild1.getChild2.getChild3.doWork() shouldEqual 3
  }
}

class Child3 {
  def doWork(): Int = 0
}

class Child2 {
  def getChild3: Child3 = new Child3
}

class Child1 {
  def getChild2: Child2 = new Child2
}

class Parent {
  def getChild1: Child1 = new Child1
}

As you can see in the full test we can create some mocks object, and reference the call chain via extractor methods.

The actual mocker is really pretty simple, it just looks nasty cause of all the lambdas/manifests. All thats going on here is a way to pass the next object to a chain and extract it with a method. Then we can create a mock using the manifest and assign that mock to the source object via the lambda.

import org.mockito.Mockito

object Mocks {
  implicit def mock[T](implicit manifest: Manifest[T]) = new RichMockRoot[T]

  class RichMockRoot[T](implicit manifest: Manifest[T]) {
    val value = Mockito.mock[T](manifest.runtimeClass.asInstanceOf[Class[T]])

    def mock[Y](extractor: T => Y)(implicit manifest: Manifest[Y]): RichMock[Y] = {
      new RichMock[T](value, List(value)).mock(extractor)
    }
  }

  class RichMock[T](c: T, prevMocks: List[_]) {
    def mock[Y](extractor: T => Y)(implicit manifest: Manifest[Y]): RichMock[Y] = {
      val m = Mockito.mock[Y](manifest.runtimeClass.asInstanceOf[Class[Y]])

      Mockito.when(extractor(c)).thenReturn(m)

      new RichMock(m, prevMocks ++ List(m))
    }

    def value: T = c

    def mockChain[Y](idx: Int) = prevMocks(idx).asInstanceOf[Y]

    def head[Y] = mockChain[Y](0)
  }
}

The main idea here is just to hide away the whole “make b and have it return c” for you. You can even capture all the intermediate mocks in a list (I called it a mock chain), and expose the first element of the list with head. With a little bit of scala manifest magic you can even get around needing to pass class files around and can leverage the generic parameter (boy, feels almost like .NET!).

Extracting scala method names from objects with macros

I have a soft spot in me for AST’s ever since I went through the exercise of building my own language. Working in Java I missed the dynamic ability to get compile time information, though I knew it was available as part of the annotation processing pipleine during compilation (which is how lombok works). Scala has something similiar in the concept of macros: a way to hook into the compiler, manipulate or inspect the syntax tree, and rewrite or inject whatever you want. It’s a wonderfully elegant system that reminds me of Lisp/Clojure macros.

I ran into a situation (as always) where I really wanted to get the name of a function dynamically. i.e.

class Foo {
   val field: String = "" 
   def method(): Unit = {}
}

val name: String = ??.field // = "field"

In .NET this is pretty easy since at runtime you can create an expression tree which gives you the AST. But I haven’t been in .NET in a while, so off to macros I went!

First off, I found the documentation regarding macros to be lackluster. It’s either rudimentary with trivial examples, or the learning curve was steep and I was too lazy to read through all of it. Usually when I encounter scenarios like this I turn to exploratory programming, where I have a unit test that sets up a basic example and I leverage the debugger and intellij live REPL to poke through what I can and can’t do. Time to get set up.

First, I needed to create a new submodule in my multi module maven project that would contain my macro. The reason is that you can’t use macros in the same compilation unit that they are defined in. You can however, use macros in a macros test since the compiler compiles test sources different from regular sources.

That said, debugging macros is harder than normal because you aren’t debugging your running program, you are debugging the actual compiler. I found this blog post which was a life saver, even though it was missing a few minor pieces.

1. Set the main class to scala.tools.nsc.Main
2. Set the VM args to -Dscala.usejavacp=true
3. Set the program arguments to first point to the file containing the macro, then the file to compile that uses the macro:

-cp types.Types macros/src/main/scala/com/devshorts/common/macros/MethodNames.scala config/src/test/scala/config/ConfigProxySpec.scala

Now you can actually debug your macro!

First let me show the test

case class MethodNameTest(field1: Object) {
  def getFoo(arg: Object): Unit = {}
  def getFoo2(arg: Object, arg2: Object): Unit = {}
}

class MethodNamesMacroSpec extends FlatSpec with Matchers {
  "Names macro" should "extract from an function" in {
    methodName[MethodNameTest](_.field1) shouldEqual MethodName("field1")
  }

  it should "extract when the function contains an argument" in {
    methodName[MethodNameTest](_.getFoo(null)) shouldEqual MethodName("getFoo")
  }

  it should "extract when the function contains multiple argument" in {
    methodName[MethodNameTest](_.getFoo2(null, null)) shouldEqual MethodName("getFoo2")
  }

  it should "extract when the method is curried" in {
    methodName[MethodNameTest](m => m.getFoo2 _) shouldEqual MethodName("getFoo2")
  }
}

macro

methodName here is a macro that extracts the method name from a lambda passed in of the parameterized generic type. What’s nice about how scala set up their macros is you provide an alias for your macro such that you can re-use the macro but type it however you want.

object MethodNames {
  implicit def methodName[A](extractor: (A) => Any): MethodName = macro methodNamesMacro[A]

  def methodNamesMacro[A: c.WeakTypeTag](c: Context)(extractor: c.Expr[(A) => Any]): c.Expr[MethodName] = {
     ...
   }
}

I’ve made the methodName function take a generic and a function that uses that generic (even though no actual instance is ever passed in). The nice thing about this is I can re-use the macro typed as another function elsewhere. Imagine I want to pin [A] so people don’t have to type it. I can do exactly that!

case class Configuration (foo: String)

implicit def config(extractor: Configuration => Any): MethodName = macro MethodNames.methodNamesMacro[Configuration]

config(_.foo) == "foo"

At this point its time to build the bulk of the macro. The idea is to inspect parts of the AST and potentially walk it to find the pieces we want. Here’s what I ended up with:

def methodNamesMacro[A: c.WeakTypeTag](c: Context)(extractor: c.Expr[(A) => Any]): c.Expr[MethodName] = {
  import c.universe._

  @tailrec
  def resolveFunctionName(f: Function): String = {
    f.body match {
      // the function name
      case t: Select => t.name.decoded

      case t: Function =>
        resolveFunctionName(t)

      // an application of a function and extracting the name
      case t: Apply if t.fun.isInstanceOf[Select] =>
        t.fun.asInstanceOf[Select].name.decoded

      // curried lambda
      case t: Block if t.expr.isInstanceOf[Function] =>
        val func = t.expr.asInstanceOf[Function]

        resolveFunctionName(func)

      case _ => {
        throw new RuntimeException("Unable to resolve function name for expression: " + f.body)
      }
    }
  }

  val name = resolveFunctionName(extractor.tree.asInstanceOf[Function])

  val literal = c.Expr[String](Literal(Constant(name)))

  reify {
    MethodName(literal.splice)
  }
}

For more details on parts of the AST here is a great resource

In the first case, when we pass in methodName[Config](_.method) it gets mangled into a function with a body that is of x$1.method. The select indicates the x$1 instance and selects the method expression of it. This is an easy case.

In the block case that maps to when we call methodName[Config](c => c.thing _). In this case we have a function but its curried. In this scenario the function body is a block who’s inner expression is a function. But, the functions body of that inner function is an Apply.

Apply takes two arguments — a Select or an Ident for the function and a list of arguments

So that makes sense.

The rest is just helper methods to recurse.

The last piece of the puzzle is to create an instance of a string literal and splice it into a new expression returning the MethodName case class that contains the string literal.

All in all a fun afternoons worth of code and now I get semantic string safety. A great use case here can be to type configuration values or other string semantics with a trait. You can get compile time refactoring + type safety. Other use cases are things like database configurations and drivers (the .NET mongo driver uses expression trees to type an object to its underlying mongo collection).

Unit testing DNS failovers

Something that’s come up a few times in my career is the difficulty of validating if and when your code can handle actual DNS changes. A lot of times testing that you have the right JVM settings and that your 3rd party clients can handle it involves mucking with hosts files, nameservers, or stuff like Route53 and waiting around. Then its hard to automate and deterministically reproduce. However, you can hook into the DNS resolution in the JVM to control what gets resolved to what. And this way you can tweak the resolution in a test and see what breaks! I found some info at this blog post and cleaned it up a bit for usage in scala.

The magic sauce to pull this off is to make sure you override the default sun.net.spi.nameservice.NameServiceDescriptor. Internally in the InetAddress class it tries to load an instance of the interface NameServiceDescriptor using the Service loader mechanism. The service loader looks for resources in META-INF/services/fully.qualified.classname.to.override and instantiates whatever fully qualified class name is that class name override file.

For example, if we have

cat META-INF/services/sun.net.spi.nameservice.NameServiceDescriptor
io.paradoxical.test.dns.LocalNameServerDescriptor

Then the io.paradoxical.test.dns.LocalNameServerDescriptor will get created. Nice.

What does that class actually look like?

class LocalNameServerDescriptor extends NameServiceDescriptor {
  override def getType: String = "dns"

  override def createNameService(): NameService = {
    new LocalNameServer()
  }

  override def getProviderName: String = LocalNameServer.dnsName
}

The type is of dns and the name service implementation is our own class. The provider name is something we have custom defined as well below:

object LocalNameServer {
  Security.setProperty("networkaddress.cache.ttl", "0")

  protected val cache = new ConcurrentHashMap[String, String]()

  val dnsName = "local-dns"

  def use(): Unit = {
    System.setProperty("sun.net.spi.nameservice.provider.1", s"dns,${dnsName}")
  }

  def put(hostName: String, ip: String) = {
    cache.put(hostName, ip)
  }

  def remove(hostName: String) = {
    cache.remove(hostName)
  }
}

class LocalNameServer extends NameService {

  import LocalNameServer._

  val default = new DNSNameService()

  override def lookupAllHostAddr(name: String): Array[InetAddress] = {
    val ip = cache.get(name)
    if (ip != null && !ip.isEmpty) {
      InetAddress.getAllByName(ip)
    } else {
      default.lookupAllHostAddr(name)
    }
  }

  override def getHostByAddr(bytes: Array[Byte]): String = {
    default.getHostByAddr(bytes)
  }
}

Pretty simple. We have a cache that is stored in a singleton companion object with some helper methods on it, and all we do is delegate looking into the cache. If we can resolve the data in the cache we return it, otherwise just proxy it to the default resolver.

The use method sets a system property that says to use the dns resolver of name local-dns as the highest priority nameservice.provider.1 (lower numbers are higher priority)

Now we can write some tests and see if this works!

@RunWith(classOf[JUnitRunner])
class DnsTests extends FlatSpec with Matchers {
  LocalNameServer.use()

  "DNS" should "resolve" in {
    val google = resolve("www.google.com")

    google.getHostAddress shouldNot be("127.0.0.1")
  }

  it should "be overridable" in {
    LocalNameServer.put("www.google.com", "127.0.0.1")

    val google = resolve("www.google.com")

    google.getHostAddress should be("127.0.0.1")

    LocalNameServer.remove("www.google.com")
  }

  it should "be undoable" in {
    LocalNameServer.put("www.google.com", "127.0.0.1")

    val google = resolve("www.google.com")

    google.getHostAddress should be("127.0.0.1")

    LocalNameServer.remove("www.google.com")

    resolve("www.google.com").getHostAddress shouldNot be("127.0.0.1")
  }

  def resolve(name: String) = InetAddress.getByName(name)
}

Happy dns resolving!

Consistent hashing for fun

I think consistent hashing is pretty fascinating. It lets you define a ring of machines that shard out data by a hash value. Imagine that your hash space is 0 -> Int.Max, and you have 2 machines. Well one machine gets all values hashed from 0 -> Int.Max/2 and the other from Int.Max/2 -> Int.Max. Clever. This is one of the major algorithms of distributed systems like cassandra and dynamoDB.

For a good visualization, check out this blog post.

The fun stuff happens when you want to add replication and fault tolerance to your hashing. Now you need to have replicants and manage when machines join and add. When someone joins, you need to re-partition the space evenly and re-distribute the values that were previously held.

Something similar when you have a node leave, you need to make sure that whatever it was responsible for in its primray space AND the things it was responsible for as a secondary replicant, are re-redistributed amongst the remaining nodes.

But the beauty of consistent hashing is that the replication basically happens for free! And so does redistribution!

Since my new feature is in all in Scala, I figured I’d write something up to see how this might play out in scala.

For the impatient, the full source is here.

First I started with some data types

case class HashValue(value: String) extends AnyRef

case class HashKey(key: Int) extends AnyRef with Ordered[HashKey] {
  override def compare(that: HashKey): Int = key.compare(that.key)
}

object HashKey {
  def safe(key: Int) = new HashKey(Math.abs(key))
}

case class HashRange(minHash: HashKey, maxHash: HashKey) extends Ordered[HashRange] {
  override def compare(that: HashRange): Int = minHash.compare(that.minHash)
}

I chose to wrap the key in a positive space since it made things slightly easier. In reality you want to use md5 or some actual hashing function, but I relied on the hash code here.

And then a machine to hold values:

import scala.collection.immutable.TreeMap

class Machine[TValue](val id: String) {
  private var map: TreeMap[HashKey, TValue] = new TreeMap[HashKey, TValue]()

  def add(key: HashKey, value: TValue): Unit = {
    map = map + (key -> value)
  }

  def get(hashKey: HashKey): Option[TValue] = {
    map.get(hashKey)
  }

  def getValuesInHashRange(hashRange: HashRange): Seq[(HashKey, TValue)] ={
    map.range(hashRange.minHash, hashRange.maxHash).toSeq
  }

  def keepOnly(hashRanges: Seq[HashRange]): Seq[(HashKey, TValue)] = {
    val keepOnly: TreeMap[HashKey, TValue] =
      hashRanges
      .map(range => map.range(range.minHash, range.maxHash))
      .fold(map.empty) { (tree1, tree2) => tree1 ++ tree2 }

    val dropped = map.filter { case (k, v) => !keepOnly.contains(k) }

    map = keepOnly
    
    dropped.toSeq
  }
}

A machine keeps a sorted tree map of hash values. This lets me really quickly get things within ranges. For example, when we re-partition a machine, it’s no longer responsible for the entire range set that it was before. But it may still be responsible for parts of it. So we want to be able to tell a machine hey, keep ranges 0-5, 12-20, but drop everything else. The tree map lets me do this really nicely.

Now for the fun part, the actual consistent hashing stuff.

Given a set of machines, we need to define how the circular partitions is defined

private def getPartitions(machines: Seq[Machine[TValue]]): Seq[(HashRange, Machine[TValue])] = {
  val replicatedRanges: Seq[HashRange] = Stream.continually(defineRanges(machines.size)).flatten

  val infiteMachines: Stream[Machine[TValue]] = 
       Stream.continually(machines.flatMap(List.fill(replicas)(_))).flatten

  replicatedRanges
  .zip(infiteMachines)
  .take(machines.size * replicas)
  .toList
}

What we want to make sure is that each node sits on multiple ranges, this gives us the replication factor. To do that I’ve duplicated the machines in the list by the replication factor, and made sure all the lists cycle around indefinteily, so while they are not evenly distributed around the ring (they are clustered) they do provide fault tolerance

Lets look at what it takes to put a value into the ring:

private def put(hashkey: HashKey, value: TValue): Unit = {
  getReplicas(hashkey).foreach(_.add(hashkey, value))
}

private def getReplicas(hashKey: HashKey): Seq[Machine[TValue]] = {
  partitions
  .filter { case (range, machine) => hashKey >= range.minHash && hashKey < range.maxHash }
  .map(_._2)
}

We need to make sure that for each replica in the ring that sits on a hash range, that we insert it into that machine. Thats pretty easy, though we can improve this later with better lookups

Lets look at a get

def get(hashKey: TKey): Option[TValue] = {
  val key = HashKey.safe(hashKey.hashCode())

  getReplicas(key)
  .map(_.get(key))
  .collectFirst { case Some(x) => x }
}

Also similar. Go through all the replicas, and find the first one to return a value

Now lets look how to add a machine into the ring

def addMachine(): Machine[TValue] = {
  id += 1

  val newMachine = new Machine[TValue]("machine-" + id)

  val oldMachines = partitions.map(_._2).distinct

  partitions = getPartitions(Seq(newMachine) ++ oldMachines)

  redistribute(partitions)

  newMachine
}

So we first create a new list of machines, and then ask how to re-partition the ring. Then the keys in the ring need to redistribute themselves so that only the nodes who are responsible for certain ranges contain those keys

def redistribute(newPartitions: Seq[(HashRange, Machine[TValue])]) = {
  newPartitions.groupBy { case (range, machine) => machine }
  .flatMap { case (machine, ranges) => machine.keepOnly(ranges.map(_._1)) }
  .foreach { case (k, v) => put(k, v) }
}

Redistributing isn’t that complicated either. We group all the nodes in the ring by the machine they are on, then for each machine we tell it to only keep values that are in its replicas. The machine keepOnly function takes a list of ranges and will remove and return anything not in those ranges. We can now aggregate all the things that are “emitted” by the machines and re insert them into the right location

Removing a machine is really similiar

def removeMachine(machine: Machine[TValue]): Unit = {
  val remainingMachines = partitions.filter { case (r, m) => !m.eq(machine) }.map(_._2)

  partitions = getPartitions(remainingMachines.distinct)

  redistribute(partitions)
}

And thats all there is to it! Now we have a fast, simple consistent hasher.

A toy generational garbage collector

Had a little downtime today and figured I’d make a toy generational garbage collector, for funsies. A friend of mine was once asked this as an interview question so I thought it might make for some good weekend practice.

For those not familiar, a common way of doing garbage collection in managed languages is to have the concept of multiple generations. All newly created objects go in gen0. New objects are also the most probably to be destroyed, as there is a lot of transient stuff that goes in an application. If an element survives a gc round it gets promoted to gen1. Gen1 doesn’t get GC’d as often. Same with gen2.

A GC cycle usually consists of iterating through application root nodes (so starting at main and traversing down) and checking to see where a reference lays in which generation. If we’re doing a gen1 collection, we’ll also do gen0 and gen1. However, if you’re doing gen0 only and a node is already laying in gen1, you can just bail early and say “meh, this node and all its references are probably ok for now, we’ll try this later“.

For a really great visualization checkout this msdn article on generation garabage collection.

And now on to the code! First lets start with what is an object

@Data
@EqualsAndHashCode(of = "id")
public class Node {
    private final String id;

    public Node(String id) {
        this.id = id;
    }

    private final List<Node> references = new ArrayList<>();

    public void addReference(Node node) {
        references.add(node);
    }

    public void removeReference(Node node) {
        references.removeIf(i -> i.getId().equals(node.getId()));
    }
}

For the purposes of the toy, its just some node with some unique id.

Lets also define an enum of the different generations we’ll support and their ordinal values

public enum Mode {
    Gen0,
    Gen1
}

Next, lets make an allocator who can allocate new nodes. This would be like a new syntax behind the scenes

public class Allocator {
    @Getter
    private Set<Node> gen0 = new HashSet<>();

    @Getter
    private Set<Node> gen1 = new HashSet<>();

    public Node newNode() {
        return newNode("");
    }

    public Node newNode(String tag) {
        final Node node = new Node(tag + UUID.randomUUID());

        getGen0().add(node);

        return node;
    }

    public Mode locateNode(Node tag) {
        if (gen1.contains(tag)) {
            return Mode.Gen1;
        }

        return Mode.Gen0;
    }
    ....

At this point we can allocate a new node, and assign nodes references.

final Allocator allocator = new Allocator();

final Node root = allocator.newNode();

root.addReference(allocator.newNode());
root.addReference(allocator.newNode());
root.addReference(allocator.newNode());

Still haven’t actually collected anything though yet. So lets write a garbage collector

public class Gc {
    private final Allocator allocator;

    public Gc(Allocator allocator) {
        this.allocator = allocator;
    }

    public void collect(Node root, Mode mode) {
        final Allocator.Marker marker = allocator.markBuilder(mode);

        mark(root, marker, mode);

        marker.sweep();
    }

    private void mark(Node root, Allocator.Marker marker, Mode mode) {
        final Mode found = allocator.locateNode(root);

        if (found.ordinal() > mode.ordinal()) {
            return;
        }

        marker.mark(root);

        root.getReferences().forEach(ref -> mark(ref, marker, mode));
    }
}

The GC dos a DFS on the root object reference and marks all visible nodes with some marker builder (yet to be shown). If the generational heap that the node lives in is less than or equal to the mode we are on, we’ll mark it, otherwise just skip it. This works because later we’ll only prune from generation heaps according to the mode

Now comes the fun part, and its the marker

public static class Marker {

    private final Set<String> marks;
    private final Allocator allocator;

    private final Mode mode;

    public Marker(Allocator allocator, final Mode mode) {
        this.allocator = allocator;
        this.mode = mode;
        marks = new HashSet<>();
    }

    public void mark(Node node) {
        marks.add(node.getId());
    }

    public void sweep() {
        Predicate<Node> remove = node -> !marks.contains(node.getId());

        allocator.getGen0().removeIf(remove);

        allocator.promote(Mode.Gen0);

        switch (mode) {
            case Gen0:
                break;
            case Gen1:
                allocator.getGen1().removeIf(remove);
                allocator.promote(Mode.Gen1);
                break;
        }
    }
}

All we do here is when we mark, tag the node in a set. When we go to sweep, go through the generations less than or equal to the current and remove unmarked nodes, as well as promote the surviving nodes to the next heap!

Still missing two last functions in the allocator which are promote and the marker builder

public Marker markBuilder(final Mode mode) {
    return new Marker(this, mode);
}

private void promote(final Mode mode) {
    switch (mode) {
        case Gen0:
            gen1.addAll(gen0);
            gen0.clear();
            break;
        case Gen1:
            break;
    }
}

Now we can put it all together and write some tests:

Below you can see the promotion in action.

final Allocator allocator = new Allocator();

final Gc gc = new Gc(allocator);

final Node root = allocator.newNode();

root.addReference(allocator.newNode());
root.addReference(allocator.newNode());
root.addReference(allocator.newNode());

final Node removable = allocator.newNode("remove");

removable.addReference(allocator.newNode("dangle1"));
removable.addReference(allocator.newNode("dangle2"));

root.addReference(removable);

assertThat(allocator.getGen0().size()).isEqualTo(7);

gc.collect(root, Mode.Gen0);

assertThat(allocator.getGen0().size()).isEqualTo(0);
assertThat(allocator.getGen1().size()).isEqualTo(7);

Nothing can be collected since all nodes have references, but we’ve cleared the gen0 and moved all nodes to gen1

root.removeReference(removable);

gc.collect(root, Mode.Gen1);

assertThat(allocator.getGen0().size()).isEqualTo(0);
assertThat(allocator.getGen1().size()).isEqualTo(4);

Now we can actually remove the reference and do a gen1 collection. You can see now that the gen1 heap size went down by 3 (so the removable node, plus its two children) since those nodes are no longer reachable

And just for fun, lets show that gen1 collection works as well

final Node gen1Remove = allocator.newNode();

root.addReference(gen1Remove);

gc.collect(root, Mode.Gen1);

assertThat(allocator.getGen0().size()).isEqualTo(0);
assertThat(allocator.getGen1().size()).isEqualTo(5);

root.removeReference(gen1Remove);

gc.collect(root, Mode.Gen1);

assertThat(allocator.getGen0().size()).isEqualTo(0);
assertThat(allocator.getGen1().size()).isEqualTo(4);

And there you have it, a toy generational garbage collector :)

For the full code, check out this gist

Logging the easy way

This is a cross post from the original posting at godaddy’s engineering blog. This is a project I have spent considerable time working on and leverage a lot.

Logging is a funny thing. Everyone knows what logs are and everyone knows you should log, but there are no hard and fast rules on how to log or what to log. Your logs are your first line of defense against figuring out issues live. Sometimes logs are the only line of defense (especially in time sensitive systems).

That said, in any application good logging is critical. Debugging an issue can be made ten times easier with simple, consistent logging. Inconsistent or poor logging can actually make it impossible to figure out what went wrong in certain situations. Here at GoDaddy we want to make sure that we encourage logging that is consistent, informative, and easy to search.

Enter the GoDaddy Logger. This is a SLF4J wrapper library that encourages us to fall into the pit of success when dealing with our logging formats and styles in a few ways:

  • Frees you from having to think about what context fields need to be logged and removes any worries about forgetting to log a value,
  • Provides the ability to skip personal identifiable information from being logged,
  • Abstracts out the actual format of the logs from the production of them. By decoupling the output of the framework from the log statements themselves, you can easily swap out the formatter when you want to change the structure and all of your logging statements will be consistently logged using the new format.

A lot of teams at GoDaddy use ELK (Elasticsearch, Logstash, Kibana) to search logs in a distributed system. By combining consistent logging with ELK (or Splunk or some other solution), it becomes relatively straight forward for developers to correlate and locate related events in their distributed systems.

THE GODADDY LOGGER

In an effort to make doing the right thing the easy thing, our team set out to build an extra layer on top of SLF4J – The GoDaddy Logger. While SLF4J is meant to abstract logging libraries and gives you a basic logging interface, our goal was to extend that interface to provide for consistent logging formats. One of the most important things for us was that we wanted to provide an easy way to log objects rather than having to use string formatting everywhere.

CAPTURING THE CONTEXT

One of the first things we did was expose what we call the ‘with’ syntax. The ‘with’ syntax builds a formatted key value pair, which by default is “key=value;”, and allows logging statements to be more human readable. For example:

logger.with(“first-name”, “GoDaddy”)
     .with(“last-name”, “Developers!”)
     .info(“Logging is fun”);

Using the default logging formatter this log statement outputs:

Logging is fun; first-name=“GoDaddy”; last-name=”Developers!”.
We can build on this to support deep object logging as well. A good example is to log the entire object from an incoming request. Instead of relying on the .toString() of the object to be its loggable representation, we can crawl the object using reflectasm and format it globally and consistently. Let’s look at an example of how a full object is logged.

Logger logger = LoggerFactory.getLogger(LoggerTest.class);
Car car = new Car(“911”, 2015, “Porsche”, 70000.00, Country.GERMANY, new Engine(“V12”));
logger.with(car).info(“Logging Car”);

Like the initial string ‘with’ example, the above log line produces:

14:31:03.943 [main] INFO com.godaddy.logger.LoggerTest – Logging Car; cost=70000.0; country=GERMANY; engine.name=”V12”; make=”Porsche”; model=”911”; year=2015

All of the car objects info is cleanly logged in a consistent way. We can easily search for a model property in our logs and we won’t be at the whim of spelling errors of forgetful developers. You can also see that our logger nests object properties in dot object notation like “engine.name=”V12””. To accomplish the same behavior using SLF4J, we would need to do something akin to the following:

Use the Car’s toString functionality:

Implement the Car object’s toString function:

String toString() {
     Return “cost=” + cost + “; country=” + country + “; engine.name=” + (engine == null ? “null” : engine.getName()) … etc.
}

Log the car via it’s toString() function:

logger.info(“Logging Car; {}”, car.toString());

Use String formatting

logger.info("Logging Car; cost={}; country={};e.name=\"{}\"; make=\"{}\"; model=\"{}\"; " + "year={}; test=\"{}\"", car.getCost(), car.getCountry(), car.getEngine() == null ? null : car.getEngine().getName(), car.getMake(), car.getModel(), car.getYear());

Our logger combats these unfortunate scenarios and many others by allowing you to set the recursive logging level, which defines the amount of levels deep into a nested object you want to have logged and takes into account object cycles so there isn’t infinite recursion.

SKIPPING SENSITIVE INFORMATION

The GoDaddy Logger provides annotation based logging scope support giving you the ability to prevent fields/methods from being logged with the use of annotations. If you don’t want to skip the entity completely, but would rather provide a hashed value, you can use an injectable hash processor to hash the values that are to be logged. Hashing a value can be useful since you may want to log a piece of data consistently but you may not want to log the actual data value. For example:

import lombok.Data;


@Data

public class AnnotatedObject {
     private String notAnnotated;
     

@LoggingScope(scope = Scope.SKIP)
     private String annotatedLogSkip;
     public String getNotAnnotatedMethod() {
          return "Not Annotated";
     }


     @LoggingScope(scope = Scope.SKIP)
     public String getAnnotatedLogSkipMethod() {
          return "Annotated";
     }
     

@LoggingScope(scope = Scope.HASH)
     public String getCreditCardNumber() {
          return "1234-5678-9123-4567";
     }
}

If we were to log this object:

AnnotatedObject annotatedObject = new AnnotatedObject();
annotatedObject.setAnnotatedLogSkip(“SKIP ME”);
annotatedObject.setNotAnnotated(“NOT ANNOTATED”);

logger.with(annotatedObject).info(“Annotation Logging”);

The following would be output to the logs:

09:43:13.306 [main] INFO com.godaddy.logging.LoggerTest – Annotating Logging; creditCardNumber=”5d4e923fe014cb34f4c7ed17b82d6c58; notAnnotated=”NOT ANNOTATED”; notAnnotatedMethod=”Not Annotated”

Notice that the annotatedLogSkip value of “SKIP ME” is not logged. You can also see that the credit card number has been hashed. The GoDaddy Logger uses Guava’s MD5 hashing algorithm by default which is not cryptographically secure, but definitely fast. And you’re able to provide your own hashing algorithm when configuring the logger.

LOGGING CONTEXT

One of the more powerful things of the logger is that the ‘with’ syntax returns a new immutable captured logger. This means you can do something like this:

Logger contextLogger = logger.with(“request-id”, 123);
contextLogger.info(“enter”);

// .. Do Work

contextLogger.info(“exist”);

All logs generated off the captured logger will include the captured with statements. This lets you factor out common logging statements and cleans up your logs so you see what you really care about (and make less mistakes).

CONCLUSION

With consistent logging we can easily search through our logs and debug complicated issues with confidence. As an added bonus, since our log formatting is centralized and abstracted, we can also make team-wide or company-wide formatting shifts without impacting developers or existing code bases.

Logging is hard. There is a fine line between logging too much and too little. Logging is also best done while you write code vs. as an afterthought. We’ve really enjoyed using the GoDaddy Logger and it’s really made logging into a simple and unobtrusive task. We hope you take a look and if you find it useful for yourself or your team let us know!

For more information about the GoDaddy Logger, check out the GitHub project, or if you’re interested in working on these and other fun problems with us, check out our jobs page.