Category: Code

Functors in scala

A coworker of mine and I frequently talk about higher kinded types, category theory, and lament about the lack of unified types in scala: namely functors. A functor is a fancy name for a thing that can be mapped on. Wanting to abstract over something that is mappable comes up more often than you think. I don’t necessarily care that its an Option, or a List, or a whatever. I just care that it has a map.

We’re not the only ones who want this. Cats, Shapeless, Scalaz, all have implementations of functor. The downside there is that usually these definitions tend to leak throughout your ecosystem. I’ve written before about ecosystem and library management, and it’s an important thing to think about when working at a company of 50+ people. You need to think long and hard about putting dependencies on things. Sometimes you can, if those libraries have good versioning or back-compat stories, or if they expose lightweight API’s with heavyweight bindings that you can separate out.

Often times these libraries aren’t really well suited for large scale use and so you’re forced to either replicate, denormalize, or otherwise hide away how those things come into play.

In either case, this post isn’t about that. I just wanted to know how the hell those libraries did the magic.

Let me lay out the final product first and we’ll break it down:

trait Functor[F[_]] {
  def map[A, B](f: F[A])(m: A => B): F[B]
}

object Functor {
  implicit class FunctorOps[F[_], A](f: F[A])(implicit functor: Functor[F]) {
    def map[B](m: A => B): F[B] = {
      functor.map(f)(m)
    }
  }

  implicit def iterableFunctor[T[X] <: Traversable[X]] = new Functor[T] {
    override def map[A, B](f: T[A])(m: A => B) = {
      f.map(m).asInstanceOf[T[B]]
    }
  }

  implicit def optionFunctor = new Functor[Option] {
    override def map[A, B](f: Option[A])(m: A => B) = {
      f.map(m)
    }
  }

  implicit def futureFunctor(implicit executionContext: ExecutionContext) = new Functor[Future] {
    override def map[A, B](f: Future[A])(m: A => B) = {
      f.map(m)
    }
  }
}

And no code is complete without a test…

class Tests extends FlatSpec with Matchers {

  import com.curalate.typelevel.Functor
  import com.curalate.typelevel.Functor._

  private def testMaps[T[_] : Functor](functor: T[Int]): T[Int] = {
    functor.map(x => x + 1)
  }

  "A test" should "run" in {
    testMaps(List(1)) shouldEqual List(2)

    testMaps(Some(1): Option[Int]) shouldEqual Some(2)

    testMaps(None: Option[Int]) shouldEqual None

    testMaps(Set(1)) shouldEqual Set(2)

    Await.result(testMaps(Future.successful(1)), Duration.Inf) shouldEqual 2
  }
}

How did we get here? First if you look at the definition of functor again

trait Functor[F[_]] {
  def map[A, B](f: F[A])(m: A => B): F[B]
}

We’re saying that

  1. Given a type F that contains some other unknown type (i.e. F is a box, like List, or Set)
  2. Define a map function from A to B and give me back a type of F of B

The nuanced part here is that the map takes an instance of F[A]. We need this to get all the types to be happy, since we have to specify somewhere that F[A] and A => B are paired together.

Lets make a functor for list, since that one is pretty easy:

object Functor {
  implicit lazy val listFunctor = new Functor[List] {
    override def map[A, B](f: List[A])(m: A => B) = {
      f.map(m)
    }
  }
}

Now we can get an instance of functor from a List[T]

We could use it like this now:

def listMapper(f: Functor[List[Int]])(l: List[Int])  = {
  f.map(l)(_ + 1)
}

But that sort of sucks. I don’t want to know I have a list, that defeats the purpose of a functor!

What if we do

def intMapper[T[_]](f: Functor[T[Int]])(l: T[Int])  = {
  f.map(l)(_ + 1)
}

Kind of better. Now I have a higher kinded type that doesn’t care about what the box is. But I still need to somehow get an instance of a functor to do my mapping.

This is where the ops class come in:

implicit class FunctorOps[F[_], A](f: F[A])(implicit functor: Functor[F]) {
  def map[B](m: A => B): F[B] = {
    functor.map(f)(m)
  }
}

This guy says given a container, and a functor for that container, here is a helpful map function. It’s giving us an extension method on F[A] that adds map. You may wonder, well dont’ all things we’re mapping on already have a map function? And the answer is yes, but the compiler doesn’t know that since we’re dealing with only generics here!

Now, we can import our functor ops class and finally get that last bit to work:

class Tests extends FlatSpec with Matchers {

  import com.curalate.typelevel.Functor
  import com.curalate.typelevel.Functor._

  private def testMaps[T[_] : Functor](functor: T[Int]): T[Int] = {
    functor.map(x => x + 1)
  }

  "A test" should "run" in {
    testMaps(List(1)) shouldEqual List(2)

    testMaps(Some(1): Option[Int]) shouldEqual Some(2)

    testMaps(None: Option[Int]) shouldEqual None

    testMaps(Set(1)) shouldEqual Set(2)

    Await.result(testMaps(Future.successful(1)), Duration.Inf) shouldEqual 2
  }
}

Pulling it all together, we’re asking for a type of T that is a box of anything that has an implicit Functor[T] typeclass. We want to use the map method on the functor of T and that map method comes because we leverage the implicit FunctionOps.

It helps to think of functor not as an interface that a thing implements, but as a typeclass/extension of a thing. I.e. in order to get a map, you have to wrap something.

Anyways, big thanks to Christian for helping me out.

Tracing High Volume Services

This post was originally posted at engineering.curalate.com

We like to think that building a service ecosystem is like stacking building blocks. You start with a function in your code. That function is hosted in a class. That class in a service. That service is hosted in a cluster. That cluster in a region. That region in a data center, etc. At each level there’s a myriad of challenges.

From the start, developers tend to use things like logging and metrics to debug their systems, but a certain class of problems crops up when you need to debug across services. From a debugging perspective, you’d like to have a higher projection of the view of the system: a linearized view of what requests are doing. I.e. You want to be able to see that service A called service B and service C called service D at the granularity of single requests.

Cross Service Logging

The simplest solution to this is to require that every call from service to service comes with some sort of trace identifier. Incoming requests into the system, either from public API’s or client side requests, or even from async daemon invoked timers/schedules/etc generates a trace. This trace then gets propagated through the entire system. If you use this trace in all your log statements you can now correlate cross service calls.

How is this accomplished at Curalate? For the most part we use Finagle based services and the Twitter ecosystem has done a good job of providing the concept of a thread local TraceId and automatically propagating it to all other twitter-* components (yet another reason we like Finatra!).

All of our service clients automatically pull this thread local trace id out and populate a known HTTP header field that services then pick up and re-assume. For Finagle based clients this is auto-magick’d for you. For other clients that we use, like OkHttp, we had to add custom interceptors that pulled the trace from the thread local and set it on the request.

Here is an example of the header being sent automatically as part of Zipkin based headers (which we re-use as our internal trace identifiers):

finagle_trace_id

Notice the X-B3-TraceId header. When a service receives this request it’ll re-assume the trace id and set its SLF4j MDC field of traceId to be that value. We can now include in our logback.xml configuration to include the trace id like in our STDOUT log configuration below:

<appender name="STDOUT-COLOR" class="ch.qos.logback.core.ConsoleAppender">
    <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
        <level>TRACE</level>
    </filter>
    <encoder>
        <pattern>%yellow(%d) [%magenta(%X{traceId})] [%thread] %highlight(%-5level) %cyan(%logger{36}) %marker - %msg%n</pattern>
    </encoder>
</appender>

And we can also send the trace id as a structured JSON field to Loggly.

Let’s look at an example from our own logs:

tid_example

What we’re seeing here is a system called media-api made a query to a system called networkinformationsvc. The underlying request carried a correlating trace id across the service boundaries and both systems logged to Loggly with the json.tid (transaction id) field populated. Now we can query our logs and get a linear time based view of what’s happening.

Thread local tracing

The trick here is to make sure that this implicit trace id that is pinned to the thread local of the initiating request properly moves from thread to thread as you make async calls. We don’t want anyone to have to ever remember to set the trace. It should just gracefully flow from thread to thread implicity.

To make sure that traces hop properly between systems we had to make sure to enforce that everybody uses an ExecutionContext that safely captures the callers thread local’s before executing. This is critical, otherwise you can make an async call and the trace id gets dropped. In that case, bye bye go the logs! It’s hyper important to always take an execution context and to never pin an execution context when it comes to async scala code. Thankfully, we can make any execution context safe by wrapping it up in a delegate:

/**
 * Wrapper around an existing ExecutionContext that makes it propagate MDC information.
 */
class PropagatingExecutionContextWrapper(wrapped: ExecutionContext)
  extends ExecutionContext { self =>

   override def prepare(): ExecutionContext = new ExecutionContext {
     // Save the call-site state
     private val context = Local.save()

     def execute(r: Runnable): Unit = self.execute(new Runnable {
       def run(): Unit = {
         // re-assume the captured call site thread locals
         Local.let(context) {
           r.run()
         }
       }
     })

     def reportFailure(t: Throwable): Unit = self.reportFailure(t)
   }

  override def execute(r: Runnable): Unit = wrapped.execute(r)

  override def reportFailure(t: Throwable): Unit = wrapped.reportFailure(t)
}

class TwitterExecutionContextProvider extends ExecutionContextProvider {
  /**
   * Safely wrap any execution context into one that properly passes context
   *
   * @param executionContext
   * @return
   */
  override def of(executionContext: ExecutionContext) = new PropagatingExecutionContextWrapper(executionContext)
}

We’ve taken this trace wrapping concept and applied to all kinds of executors like ExecutorService, and ScheduledExecutorService. Technically we don’t really want to expose the internals of how we wrap traces, so we load an ExecutionContextProvider via a java service loading mechanism and provide an API contract so that people can wrap executors without caring how they are wrapped:

/**
 * A provider that loads from the java service mechanism
 */
object ExecutionContextProvider {
  lazy val provider: ExecutionContextProvider = {
    Option(ServiceLoader.load(classOf[ExecutionContextProvider])).
      map(_.asScala).
      getOrElse(Nil).
      headOption.
      getOrElse(throw new MissingExecutionContextException)
  }
}

/**
 * Marker interfaces to provide contexts with custom logic. This
 * forces users to make sure to use the execution context providers that support request tracing
 * and maybe other tooling
 */
trait ProvidedExecutionContext extends ExecutionContext

/**
 * A context provider contract
 */
trait ExecutionContextProvider {
  def of(context: ExecutionContext): ProvidedExecutionContext

  ...
}

From a callers perspective they now do:

implicit val execContext = ExecutionContextProvider.provider.of(scala.concurrent.ExecutionContext.Implicits.global)

Which would wrap, in this example, the default scala context.

Service to Service dependency and performance tracing

Well that’s great! We have a way to safely and easily pass trace id’s, and we’ve tooled through our clients to all pass this trace id automatically, but this only gives us logging information. We’d really like to be able to leverage the trace information to get more interesting statistics such as service to service dependencies, performance across service hops, etc. Correlated logs is just the beginning of what we can do.

Zipkin is an open source tool that we’ve discussed here before so we won’t go too much into it, but needless to say that Zipkin hinges on us having proper trace identifiers. It samples incoming requests to determine IF things should be traced or not (i.e. sent to Zipkin). By default, we have all our services send 0.1% of their requests to Zipkin to minimize impact on the service.

Let’s look at an example:

zipkin

In this Zipkin trace we can see that this batch call made a call to Dynamo. The whole call took 6 milliseconds and 4 of those milliseconds were spent calling Dynamo. We’ve tooled through all our external client dependencies with Zipkin trace information automatically using java dynamic proxies so that as we upgrade our external dep’s we get tracing on new functions as well.

If we dig further into the trace:

zipkin_w_trace

We can now see (highlighted) the trace ID and search in our logs for logs related to this trace

Finding needles in the haystack

We have a way to correlate logs, and get sampled performance and dependency information between services via Zipkin. What we still can’t do yet is trace an individual piece of data flowing through high volume queues and streams.

Some of our services at Curalate process 5 to 10 thousand items a second. It’s just not fiscally prudent to log all that information to Loggly or emit unique metrics to our metrics system (DataDog). Still, we want to know at the event level where things are in the system, where they passed through, where they got dropped etc. We want to answer the question of

Where is identifier XYZ.123 in the system and where did it go and come from?

This is difficult to answer with the current tools we’ve discussed.

To solve this problem we have one more system in play. This is our high volume auditing system that lets us write and filter audit events at a large scale (100k req/s+). The basic architecture here is we have services write audit events via an Audit API which are funneled to Kinesis Firehose. The firehose stream buffers data for either 5 minutes or 128 MB (whichever comes first). When the buffer limit is reached, firehose dumps newline separated JSON in a flat fi`le into S3. We have a lambda function that waits for S3 create events on the bucket, reads the JSON, then transforms the JSON events into Parquet which is an efficient columnar storage format. The Parquet file is written back into S3 into a new folder with the naming scheme of

year=YYYY/month=MM/day=DD/hour=HH/minute=mm/<uuid>.parquet

Where the minutes are grouped in 5 minute intervals. This partition is then added to Athena, which is a managed map-reduce around PrestoDB, that lets you query large datasets in S3.

auditing_arch

What does this have to do with trace id’s? Each event emitted comes with a trace id that we can use to query back to logs or Zipkin or other correlating identifiers. This means that even if services aren’t logging to Loggly due to volume restrictions, we can still see how events trace through the system. Let’s look at an example where we find a specific network identifier from Instagram and see when it was data mined and when we added semantic image tags to it (via our vision APIs):

SELECT minute, app, message, timestamp, context
FROM curalateauditevents."audit_events"
WHERE context['network_id'] = '1584258444344170009_249075471' and context['network']='instagram'
and day=18 and hour=22
order by timestamp desc
limit 100

This is the Athena query. We’ve included the specific network ID and network we are looking for, as well as a limited partition scope.

athena_query

Notice the two highlights.

Starting at the second highlight there is a message that we augmented the piece of data. In our particular pipe we only augment data under specific circumstances (not every image is analyzed) and so it was important to see that some images were dropped and this one was augmented. Now we can definitely say “yes, item ABC was augmented but item DEF was not and here is why”. Awesome.

Moving upwards, the first highlight is how much data was scanned. This particular partition we looked through has 100MB of data, but we only searched through 2MB to find what we wanted (this is due to the optimization of Parquet). Athena is priced by how much data you scan at a cost of $5 per terabyte. So this query was pretty much free at a cost of $0.000004. The total set of files across all the partitions for the past week is roughly 21GB spanning about 3.5B records. So even if we queried all the data, we’d only pay $.04. In fact, the biggest cost here isn’t in storage or query or lambda, it’s in firehose! Firehose charges you $0.029 per GB transferred. At this rate we pay 60 cents a week. The boss is going to be ok with that.

However, there are still some issues here. Remember the target scale is upwards of 100k req/s. At that scale we’re dealing with a LOT of data through Kinesis Firehose. That’s a lot of data into S3, a lot of IO reads to transform to Parquet, and a lot of opportunities to accidentally scan through tons of data in our athena partitions with poorly written queries that loop over repeated data (even though we limit partitions to a 2 week TTL). We also now have issues of rate limiting with Kinesis Firehose.

On top of that, some services just pump so much repeated data that its not worth seeing it all the time. To that end we need some sort of way to do live filters on the streams. What we’ve done to solve this problem is leverage dynamically invoked Nashorn javascript filters. We load up filters from a known remote location at an interval of 30 seconds, and if a service is marked for filtering (i.e. it has a really high load and needs to be filtered) then it’ll run all of its audit events through the filter before it actually gets sent to the downstream firehose. If an event fails the filter it’s discarded. If it passes, the event is annotated with which filter name it passed through and sent through the stream.

Filters are just YML files for us:

name: "Filter name"
expiration: <Optional DateTime. Epoch or string datetime of ISO formats parseable by JODA>
js: |
    function filter(event) {
        // javascript that returns a boolean
    }

And an example filter may look like

name: "anton_client_filter"
js: |
    function filter(event) {
      var client = event.context.get("client_id")

      return client != null && client == "3136"
    }

In this filter only events that are marked with the client id of my client will pass through. Some systems don’t need to be filtered so all their events pass through anyway.

Now we can write queries like

SELECT minute, app, message, timestamp, context
FROM curalateauditevents."audit_events"
WHERE contains(trace_names, 'anton_client_filter')
and day=18 and hour=22
limit 100

To get events that were tagged with my filter in the current partition. From there, we now can do other exploratory queries to find related data (either by trace id or by other identifiers related to the data we care about).

Let’s look at some graphs that show how dramatic this filtering can be

filtering

Here the purple line is one of our data mining ingestion endpoints. It’s pumping a lot of data to firehose, most of which is repeated over time and so isn’t super useful to get all the input from. The moment the graph drops is when the yml file was uploaded with a filter to add filtering to the service. The blue line is a downstream service that gets data after debouncing and other processing. Given its load is a lot less we don’t care so much that it is sending all its data downstream. You can see the purple line slow to a trickle later on when the filter kicks in and data starts matching it.

Caveats with Nashorn

Building the system out there were a few interesting caveats when using Nashorn in a high volume pipeline like this.

The first was that subtle differences in javascript can have massive performance impacts. Let’s look at some examples and benchmark them.

function filter(event) {
  var anton = {
    "136742": true,
    "153353": true
  }

  var mineable = event.context.get("mineable_id")

  return mineable != null && anton[mineable]
}

The JMH benchmarks of running this code is

[info] FiltersBenchmark.testInvoke  thrpt   20     1027.409 ±      29.922  ops/s
[info] FiltersBenchmark.testInvoke   avgt   20  1484234.075 ± 1783689.007  ns/op

What?? 29 ops/second

Let’s make some adjustments to the filter, given our internal system loads the javascript into an isolated scope per filter and then re-invokes just the function filter each time (letting us safely create global objects and pay heavy prices for things once):

var anton = {
  "136742": true,
  "153353": true
}

function filter(event) {
  var mineable = event.context.get("mineable_id")

  return mineable != null && anton[mineable]
}
[info] FiltersBenchmark.testInvoke  thrpt   20  7391161.402 ± 206020.703  ops/s
[info] FiltersBenchmark.testInvoke   avgt   20    14879.890 ±   8087.179  ns/op

Ah, much better! 206k ops/sec.

If we use java constructs:

function filter(event) {
  var anton = new java.util.HashSet();
  anton.add("136742")
  anton.add("153353")

  var mineable = event.context.get("mineable_id")

  return mineable != null && anton.contains(mineable)
}
[info] FiltersBenchmark.testInvoke  thrpt   20  5662799.317 ± 301113.837  ops/s
[info] FiltersBenchmark.testInvoke   avgt   20    41963.710 ±  11349.277  ns/op

Even better! 301k ops/sec

Something is clearly up with the anonymous object creation in Nashorn. Needless to say, benchmarking is important, especially when these filters are going to be dynamically injected into every single service we have. We need them to be performant, sandboxed, and safe to fail.

For that we make sure everything runs its own engine scope in a separate execution context isolated from main running code and is fired off asynchronously to not block the main calling thread. This is also where we have monitoring and alerting on when someone uploads a non-performant filter so we can investigate and mitigate quickly.

For example, the discovery of the poorly performing json object came from this alert:

high_cpu

Conclusion

Tracing is hard and it’s incredibly difficult to tool through after the fact if you start to build service architectures without this in mind from the get go. Tooling trace identifiers through the system from the beginning sets you up for success in building more interesting debugging infrastructure that isn’t always possible without that. When building larger service ecosystems it’s important to keep in mind how to inspect things at varying granularity levels. Sometimes building custom tools to help inspect the systems is worth the effort, especially if they help debug complicated escalations or data inconsistencies.

From Thrift to Finatra

Originally posted on the curalate engineering blog

There are a million and one ways to do (micro-)services, each with a million and one pitfalls. At Curalate, we’ve been on a long journey of splitting out our monolith into composable and simple services. It’s never easy, as there are a lot of advantages to having a monolith. Things like refactoring, code-reuse, deployment, versioning, rollbacks, are all atomic in a monolith. But there are a lot of disadvantages as well. Monoliths encourage poor factoring, bugs in one part of the codebase force rollbacks/changes of the entire application, reasoning about the application in general becomes difficult, build times are slow, transient build errors increase, etc.

To that end our first foray into services was built on top of Twitter Finagle stack. If you go to the page and can’t figure out what exactly finagle does, I don’t blame you. The documentation is lackluster and in and of itself is quite low-level. Finagle defines a service as a function that transforms a request into a response, and composes services with filters that manipulate requests/responses themselves. It’s a clean abstraction, given that this is basically what all web service frameworks do.

Thrift

Finagle by itself isn’t super opinionated. It gives you building blocks to build services (service discovery, circuit breaking, monitoring/metrics, varying protocols, etc) but doesn’t give you much else. Our first set of services built on finagle used Thrift over HTTP. Thrift, similiar to protobuf, is an intermediate declarative language that creates RPC style services. For example:

namespace java tutorial
namespace py tutorial

typedef i32 int // We can use typedef to get pretty names for the types we are using
service MultiplicationService
{
        int multiply(1:int n1, 2:int n2),
}

Will create an RPC service called MultiplicationService that takes 2 parameters. Our implementation at Curalate hosted Thrift over HTTP (serializing Thrift as JSON) since all our services are web based behind ELB’s in AWS.

We have a lot of services at Curalate that use Thrift, but we’ve found a few shortcomings:

Model Reuse

Thrift forces you to use primitives when defining service contracts, which makes it difficult to share lightweight models (with potentially useful utilities) to consumers. We’ve ended up doing a lot of mapping between generated Thrift types and shared model types. Curalate’s backend services are all written in Scala, so we don’t have the same issues that a company like Facebook (who invented Thrift) may have with varying languages needing easy access to RPC.

Requiring a client

Many times you want to be able to interact with a service without needing access to a client. Needing a client has made developers to get used to cloning service repositories, building the entire service, then entering a Scala REPL in order to interact with a service. As our service surface area expands, it’s not always feasible to expect one developer to build another developers service (conflicting java versions, missing SBT/Maven dependencies or settings, etc). The client requirement has led to services taking heavyweight dependencies on other services and leaking dependencies. While Thrift doesn’t force you to do this, this has been a side effect of it taking extra love and care to generate a Thrift client properly, either by distributing Thrift files in a jar or otherwise.

Over the wire inspection

With Thrift-over-HTTP, inspecting requests is difficult. This is due to the fact that these services use Thrift serialization, which unlike JSON, isn’t human-readable.

Because Thrift over HTTP is all POSTs to /, tracing access and investigating ELB logs becomes a jumbled mess of trying to correlate times and IP’s to other parts of our logging infrastructure. The POST issue is frustrating, because it’s impossible for us to do any semantic smart caching, such as being able to insert caches at the serving layer for retrieval calls. In a pure HTTP world, we could insert a cache for heavily used GETs given a GET is idempotent.

RPC API design

Regardless of Thrift, RPC encourages poorly unified API’s with lots of specific endpoints that don’t always jive. We have many services that have method topologies that are poorly composable. A well designed API, and cluster of API’s, should gently guide you to getting the data you need. In an ideal world if you get an ID in a payload response for a data object, there should be an endpoint to get more information about that ID. However, in the RPC world we end up with a batch call here, a specific RPC call there, sometimes requiring stitching several calls to get data that should have been a simple domain level call.

Internal vs External service writing

We have lot of public REST API’s and they are written using the Lift framework (some of our oldest code). Developers moving from internal to external services have to shift paradigms and move from writing REST with JSON to RPC with Thrift.

Overall Thrift is a great piece of technology, but after using it for a year we found that it’s not necessarily for us. All of these things have prompted a shift to writing REST style services.

Finatra

Finatra is an HTTP API framework built on top of Finagle. Because it’s still Finagle, we haven’t lost any of our operational knowledge of the underlying framework, but instead we can now write lightweight HTTP API’s with JSON.

With Finatra, all our new services have Swagger automatically enabled so API exploration is simple. And since it’s just plain JSON using Postman is now possible to debug and inspect APIs (as well as viewing requests in Charles or other proxies).

With REST we can still distribute lightweight clients, or more importantly, if there are dependency conflicts a service consumer can very quickly roll an HTTP client to a service. Our ELB logs now make sense and our new API’s are unified in their verbiage (GET vs POST vs PUT vs DELETE) and if we want to write RPC for a particular service we still can.

There are a few other things we like about Finatra. For those developers coming from a background of writing HTTP services, Finatra feels familiar with the concept of controllers, filters, unified test-bed for spinning up build verification tests (local in memory servers), dependency injection (via Guice) baked in, sane serialization using Jackson, etc. It’s hard to do the wrong thing given that it builds strong production level opinions onto Finagle. And thankfully those opinions are ones we share at Curalate!

We’re not in bad company — Twitter, Duolingo, and others are using Finatra in production.

The HTTP driver pattern

Yet another SOA blog post, this time about calling services. I’ve seen a lot of posts, articles, even books, on how to write services but not a good way about calling services. It may seem trivial, isn’t calling a service a matter of making a web request to one? Yes, it is, but in a larger organization it’s not always so trivial.

Distributing fat clients

The problem I ran into was the service stack in use at my organization provided a feature rich client as an artifact of a services build. It had retries, metrics, tracing with zipkin, etc. But, it also pulled in things like finagle, netty, jackson, and each service may be distributing slightly different versions of all of these dependencies. When you start to consume 3, 4, 5 or more clients in your own service, suddenly you’ve gotten into an intractable mess of dependencies. Sometimes there’s no actual way to resolve them all without forcing upgrades in other services! That… sucks. It violates the idea of services in that my service is now coupled to your service.

You don’t want to force service owners to have to write clients for each service they want to call. That’d be a big waste of time and duplicated effort. If your organization is mono-lingual (i.e. all java/scala/whatever) then its still worth providing a feature rich client that has the sane things built in: retries, metrics, tracing, fast fail, serialization, etc. But you don’t want services leaking all the nuts and bolts to each other.

One solution is to auto generate clients server side. This is akin to what WCF does, or projects like swagger, thrift for RPC, etc. The downside here is that the generated code is usually pretty nasty and sometimes its hard to plug in to augment the clients with custom tracing, correlation tracking, etc. Other times the API itself might need a few nicety helper methods that you don’t want to expose in the raw API itself. But in the auto generated world, you can’t do this.

There are other projects like Retrofit that look like they solve the problem since your client is just an interface and its only dependency is OkHttp. But retrofit isn’t scala friendly (None’s need custom support, default arguments in methods are not properly intercepted, etc). You’re also bound to the back-compat story of retrofit/okhttp, assuming that they can do things like make sure older versions live side by side together.

In practice, I found that retrofit (even with scala’s issues) didn’t work well in a distributed services environment where everyone was at wildly different versions of things.

Abstracting HTTP

However, taking the idea from retrofit we can abstract away http calls with an http driver. Http really isn’t that complicated, especially for how its used in conjuction with service to service calls:

import scala.concurrent.{ExecutionContext, Future}

case class ApiRequest(
  path: String,
  queryParams: Seq[(String, Option[String])] = Nil,
  headers: Seq[(String, Option[String])] = Nil,
  options: Option[RequestOptions] = None
) 

case class RequestOptions(
  contentType: Option[String],
  characterSet: String = "utf-8"
)

/**
 * A response with a body
 *
 * @param data     The deserialized data
 * @param response The raw http response
 * @tparam T The type to deserialize
 */
case class BodyResponse[T](data: T, response: RawResponse)

/**
 * A raw response that contains code, the body and headers
 *
 * @param code
 * @param body
 * @param headers
 */
case class RawResponse(code: Int, body: String, headers: Map[String, List[String]])

/**
 * An http error that all drivers should throw on non 2xx
 *
 * @param code  The code
 * @param body  An optional body
 * @param error The inner exception (may be driver specific)
 */
case class HttpError(code: Int, body: Option[String], error: Exception)
  extends Exception(s"Error ${code}, body: ${body}", error)

/**
 * Marker trait indicating an http client
 */
trait HttpClient

/**
 * The simplest HTTP Driver. This is used to abstract libraries that call out over the wire.
 *
 * Anyone can create a driver as long as it implements this interface
 */
trait HttpDriver {
  val serializer: HttpSerializer

  def get[TRes: Manifest](
    request: ApiRequest
  )(implicit executionContext: ExecutionContext): Future[BodyResponse[TRes]]

  def post[TReq: Manifest, TRes: Manifest](
    request: ApiRequest,
    body: Option[TReq]
  )(implicit executionContext: ExecutionContext): Future[BodyResponse[TRes]]

  def put[TReq: Manifest, TRes: Manifest](
    request: ApiRequest,
    body: Option[TReq]
  )(implicit executionContext: ExecutionContext): Future[BodyResponse[TRes]]

  def patch[TReq: Manifest, TRes: Manifest](
    request: ApiRequest,
    body: Option[TReq]
  )(implicit executionContext: ExecutionContext): Future[BodyResponse[TRes]]

  def custom[TReq: Manifest, TRes: Manifest](
    method: Methods,
    request: ApiRequest,
    body: Option[TReq]
  )(implicit executionContext: ExecutionContext): Future[BodyResponse[TRes]]

  def delete[TRes: Manifest](
    request: ApiRequest
  )(implicit executionContext: ExecutionContext): Future[BodyResponse[TRes]]

  def bytesRaw[TRes: Manifest](
    method: Methods,
    request: ApiRequest,
    body: Option[Array[Byte]]
  )(implicit executionContext: ExecutionContext): Future[BodyResponse[TRes]]
}

Service owners who want to distribute a client can create clients that have no dependencies (other than the driver definition. Platform maintainers, like myself, can be dilligent about making sure the driver interface never breaks, or if it does is broken in a new namespace such that different versions can peacefully co-exist in the same process.

An example client can now look like

class ServiceClient(driver: HttpDriver) {
  def ping()(implicit executionContext: ExecutionContext): Future[Unit] = {
    driver.get[Unit]("/health").map(_.data)
  }
}

But we still need to provide an implementation of a driver. This is where we can decouple things and provide drivers that are properly tooled with all the fatness we want (netty/finagle/zipkin tracing/monitoring/etc) and service owners can bind their clients to whatever driver they want. Those provided implementations can be in their own shared library that only service’s bind to (not service clients! i.e. terminal endpoints in the dependency graph)

There are few advantages here:

  • Clients can be distributed at multiple scala versions without dependency conflicts
  • It’s much simpler to version manage and back-compat an interface/trait than it is an entire lib
  • Default drivers that do the right thing can be provided by the service framework, and back compat doesn’t need to be taken into account there since the only consumer is the service (it never leaks).
  • Drivers are simple to use, so if someone needs to roll their own client its really simple to do it

Custom errors

We can do some other cool stuff now too, given we’ve abstracted away how to call http code. Another common issue with clients is dealing with meaningful errors that aren’t just the basic http 5xx/4xx codes. For example, if you throw a 409 conflict you may want the client to actually receive a WidgetInIncorrectState exception for some calls, and in other calls maybe a FooBarInUse error that contains more semantic information. Basically overloading what a 409 means for a particular call/query. One way of doing this is with a discriminator in the error body:

HTTP 409 response:
{
   "code": "WidgetInIncorrectState",
   "widgetName: "foo",
   "widgetSize": 1234
}

Given we don’t want client code pulling in a json library to do json parsing, the driver needs to support context aware deserialization.

To do that, I’ve exposed a MultiType object that defines

  • Given a path into the json object, which field defines the discriminator
  • Given a discriminator, which type to deserialize to
  • Which http error code to apply all this too

And it looks like:

/**
 * A type representing deserialization of multiple types.
 *
 * @param discriminatorField The field that represents the textual "key" of what the subtype is. Nested fields can be located using
 *                           json path format of / delimited. I.e /foo/bar
 * @param pathTypes          The lookup of the result of the discriminatorField to the subtype mapper
 * @tparam T The supertype of all the subtypes
 */
case class MultiType[T](
  discriminatorField: String,
  pathTypes: Map[String, SubType[_ <: T]]
)

/**
 * Represents a subtype as part of a multitype mapping
 *
 * @param path The optional json sub path (slash delimited) to deserialize the type as.
 * @tparam T The type to deserialize
 */
case class SubType[T: Manifest](path: Option[String] = None) {
  val clazz = manifest[T].runtimeClass.asInstanceOf[Class[T]]
}

Using this in a client looks like:

class ServiceClient(driver: HttpDriver) {
  val errorMappers = MultiType[ApiException](discriminatorField = "code", Map(
    "invalidData" -> SubType[InvalidDataException]()
  ))

  def ping()(implicit executionContext: ExecutionContext): Future[Unit] = {
    driver.get[Unit]("/health").map(_.data).failWithOnCode(500, errorMappers)
  }
}

This is saying that when I get the value invalidData in the json response of field code on an http 500 error, to actually throw an InvalidDataException in the client.

How does this work? Well just like the http driver, we’ve abstracted the serializer and that’s all plugged in by the service consumer

case class DiscriminatorDoesntExistException(msg: String) extends Exception(msg)

object JacksonHttpSerializer {
  implicit def jacksonToHttpSerializer(jacksonSerializer: JacksonSerializer): HttpSerializer = {
    new JacksonHttpSerializer(jacksonSerializer)
  }
}

class JacksonHttpSerializer(jackson: JacksonSerializer = new JacksonSerializer()) extends HttpSerializer {
  override def fromDiscriminator[SuperType](multiType: MultiType[SuperType])(str: String): SuperType = {
    val tree = jackson.objectMapper.readTree(str)

    val node = tree.at(addPrefix(multiType.discriminatorField, "/"))

    val subType = multiType.pathTypes.get(node.textValue()).orElse(multiType.defaultType).getOrElse {
      throw new RuntimeException(s"Discriminator ${multiType.discriminatorField} does not exist")
    }

    val treeToDeserialize = subType.path.map(m => tree.at(addPrefix(m, "/"))).getOrElse(tree)

    jackson.objectMapper.treeToValue(treeToDeserialize, subType.clazz)
  }

  override def toString[T](data: T): String = {
    jackson.toJson(data)
  }

  override def fromString[T: Manifest](str: String): T = {
    jackson.fromJson(str)
  }

  private def addPrefix(s: String, p: String) = {
    p + s.stripPrefix(p)
  }
}

Inherent issues

While there are a lot of goodies in abstracting serialization and http calling into a library API provided with implementations (drivers), it does handicap the clients a little bit. Things like doing custom manipulation of the raw response, any sort of business logic, adding other libraries, etc is really frowned upon. I’d argue this is a good thing and that this should all be handled at the service level since a client is always a nice to have and not a requirement.

Conclusion

The ultimate goal in SOA is separation. But 100% separation should not mean copy-pasting things, reinventing the wheel, or not sharing any code. It just means you need to build the proper lightweight abstractions to help keep strong barriers between services without creating a distributed monolith.

With the http drive abstraction pattern it’s now easy to provide drives that use finagle-http under the hood, or okhttp, or apache http, etc. Client writers can share their model and client code with helpful utilities without leaking dependencies. And most importantly, service owners can update dependencies and move to new scala versions without fearing that their dependencies are going to cause runtime or compile time issues against pulled in clients, all while still iterating quickly and safely.

Bit packing Pacman

Haven’t posted in a while, since I’ve been heads down in building a lot of cool tooling at work (blog posts coming), but had a chance to mess around a bit with something that came up in an interview question this week.

I frequently ask candidates a high level design question to build PacMan. Games like pacman are fun because on the surface they are very simple, but if you don’t structure your entities and their interactions correctly the design falls apart.

At some point during the interview we had scaled the question up such that there was now a problem of knowing at a particular point in the game what was nearby it. For example, if the board is 100000 x 100000 (10 billion elements) how efficiently can we determine if there is a nugget/wall next to us? One option is to store all of these entities in a 2d array and just access the neighbors. However, if the entity is any non trivial object, then we now have at minumum 16 bytes. That means we’re storing 160 gigs to access the board. Probably not something we can realistically do on commodity hardware.

Given we’re answering only a “is something there or not” question, one option is to bit pack the answer. In this sense you can leverage that each bit represents a coordinate in your grid. For example in a 2D grid

0 1
2 3

These positions could be represented by the binary value at that bit:

0 = 0b0001
1 = 0b0010
2 = 0b0100
3 = 0b1000

If we do that, and we store a list of longs (64 bits, 8 bytes) then to store 10 billion elements we need:

private val maxBits = maxX * maxY
private val requiredLongs = (maxBits / 64) + 1

Which ends up being 22,032,273 longs, which in turn is 176.2 MB. Thats… a big savings. Considering that the trivial form we stored 10,000,000,000 objects, this is a compression ratio of 450%.

Now, one thing the candidate brought up (which is a great point) is that this makes working with the values much more difficult. The answer here is to provide a higher level API that hides away the hard bits.

I figured today I’d set down and do just that. We need to be able to do a few things

  1. Find out how many longs to store
  2. Find out given a coordinate which long it belongs to
  3. In that long toggle the bit representing the coordinate if we want to set/unset it
class TwoDBinPacker(maxX: Int, maxY: Int) {
  private val maxBits = maxX * maxY
  private val requiredLongs = (maxBits / 64) + 1
  private val longArray = new Array[Long](requiredLongs)

  def get(x: Int, y: Int): Boolean = {
    longAtPosition(x, y).value == 1
  }

  def set(x: Int, y: Int, value: Boolean) = {
    val p = longAtPosition(x, y)

    longArray(p.index) = p.set(value)
  }

  private def longAtPosition(x: Int, y: Int): BitValue = {
    val flattenedPosition = y * maxX + x

    val longAtPosition = flattenedPosition / 64

    val bitAtPosition = flattenedPosition % 64

    BitValue(longAtPosition, longArray(longAtPosition), bitAtPosition)
  }
}

With the helper class of a BitValue looking like:

case class BitValue(index: Int, container: Long, bitNumber: Int) {
  val value = (container >> bitNumber) & 1

  def set(boolean: Boolean): Long = {
    if (boolean) {
      val maskAt = 1 << bitNumber

      container | maskAt
    } else {
      val maskAt = ~(1 << bitNumber)

      container & maskAt
    }
  }
}

At this point we can drive a scalatest:

"Bit packer" should "pack large sets (10 billion!)" in {
  val packer = new TwoDBinPacker(100000, 100000)

  packer.set(0, 0, true)
  packer.set(200, 400, true)

  assert(packer.get(0, 0))
  assert(packer.get(200, 400))
  assert(!packer.get(99999, 88888))
}

And this test runs in 80ms.

Now, this is a pretty naive way of doing things, since we are potentially storing tons of unused longs. A smarter way would be use a sparse set with skip lists, such that as you use a long you create it and mark it used, but things before it and after it (up to the next long) are marker blocks that can span many ranges. I.e.

{EmtpyBlock}[long, long, long]{EmptyBlock}[long]

This way you don’t have to store things you don’t actually set.

Anyways, a fun little set of code to write. Full source available on my github

Strongly typed http headers in finatra

When building service architectures one thing you need to solve is how to pass context between services. This is usually stuff like request id’s and other tracing information (maybe you use zipkin) between service calls. This means that if you set request id FooBar123 on an entrypoint to service A, if service A calls service B it should know that the request id is still FooBar123. The bigger challenge is usually making sure that all thread locals keep this around (and across futures/execution contexts), but before you attempt that you need to get it into the system in the first place.

I’m working in finatra these days, and I love this framework. It’s got all the things I loved from dropwizard but in a scala first way. Todays challenge was that I wanted to be able to pass request http headers around between services in a typesafe way that could be used in thread local request contexts. Basically I want to send

X-Magic-Header someValue

And be able to resolve that into a MagicHeader(value: T) class.

The first attempt is easy, just parse header values into case classes:

case class MagicHeader(value: String)

But the question I have is how do I enforce that the header string X-Magic-Value is directly correlated to the case class MagicHeader?

object MagicHeader { 
   val key = "X-Magic-Header"
}

case class MagicHeader(value: String)

Maybe, but still, when someone sends the value out, they can make a mistake:

setRequestHeader("X-mag1c-whatevzer" -> magicHeader.value)

That sucks, I don’t want that. I want it strictly paired. I’m looking for what is in essence a case class that has 2 fields: key, value, but where the key is fixed. How do I do that?

I like to start with how I want to use something, and then work backwards to how to make that happen. Given that, lets say we want an api kind of like:

object Experimental {
  val key = "Experimental"

  override type Value = String
}

And I’d like to be able to do something like

val experimentKey = Experimental("experiment abc")
(experimentKey.key -> experimentKey.value) shouldEqual
         ("Experimental" -> "experiment abc")

I know this means I need an apply method somewhere, and I know that I want a tuple of (key, value). I also know that because I have a path dependent type of the second value, that I can do something with that

Maybe I can fake an apply method to be like

trait ContextKey {
  val key: String

  /**
   * The custom type of this key
   */
  type Value

  /**
   * A tupel of (String, Value)
   */
  type Key = Product2[String, Value]

  def apply(data: Value): Key = new Key {
    override def _1: String = key

    override def _2: Value = data
  }
}

And update my object to be

object Experimental extends ContextKey {
  val key = "Experimental"

  override type Value = String
}

Now my object has a mixin of an apply method that creates an anonmyous tuple of type String, Value. You can create instances of Experimental but you can’t ever set the key name itself! However, I can still access the pinned key because the anonymous tuple has it!

But in the case that I wanted, I wanted to use these as http header values. Which means I need to be able to parse a string into a type of ContextKey#Value which is path dependent on the object type.

We can do that by adding now a few extra methods on the ContextKey trait:

trait ContextKeyType[T] extends Product2[String, T] {
  def unparse: String
}

trait ContextKey {
  self =>
  val key: String

  /**
   * The custom type of this key
   */
  type Value

  /**
   * A tupel of (String, Value)
   */
  type Key = ContextKeyType[Value]

  /**
   * Utility to allow the container to provide a mapping from Value => String
   *
   * @param r
   * @return
   */
  def parse(r: String): Value

  def unparse(v: Value): String

  def apply(data: Value): Key = new Key {
    override def _1: String = key

    override def _2: Value = data

    /**
     * Allow a mapping of Value => String
     *
     * @return
     */
    override def unparse: String = self.unparse(data)

    override def equals(obj: scala.Any): Boolean = {
      canEqual(obj)
    }

    override def canEqual(that: Any): Boolean = {
      that != null &&
      that.isInstanceOf[ContextKeyType[_]] &&
      that.asInstanceOf[ContextKeyType[_]]._1 == key &&
      that.asInstanceOf[ContextKeyType[_]]._2 == data
    }
  }
}

This introduces a parse and unparse method which converts things to and from strings. A http header object can now define how to convert it:

object Experimental extends ContextKey {
  val key = "Experimental"
  override type Value = String

  override def parse(value: String): String = value

  override def unparse(value: String): String = value
}

So, if we want to maybe send JSON in a header, or a long/int/uuid we can now parse and unparse that value pre and post wire.

Now lets add a utility to convert a Map[String, String] which could represent an http header map, into a set of strongly typed context values:

object ContextValue {
  def find[T <: ContextKey](search: T, map: Map[String, String]): Option[T#Value] = {
    map.collectFirst {
      case (key, value) if search.key == key => search.parse(value)
    }
  }
}

Back in finatra land, lets add a http filter

case class CurrentRequestContext(
  experimentId: Option[Experimental.Value],
)

object RequestContext {
  private val requestType = Request.Schema.newField[CurrentRequestContext]

  implicit class RequestContextSyntax(request: Request) {
    def context: CurrentRequestContext = request.ctx(requestType)
  }

  private[filters] def set(request: Request): Unit = {
    val data = CurrentRequestContext(
      experimentId = ContextValue.find(Experimental, request.headerMap)
    )

    request.ctx.update(requestType, data)
  }
}

/**
 * Set the remote context from requests 
 */
class RemoteContextFilter extends SimpleFilter[Request, Response] {
  override def apply(request: Request, service: Service[Request, Response]): Future[Response] = {
    RequestContext.set(request)

    service(request)
  }
}

From here on out, we can provide a set of strongly typed values that are basically case classes with hidden keys

Deployment the paradoxical way

First and foremost, this is all Jake Swensons brain child. But it’s just too cool to not share and write about. Thanks Jake for doing all the hard work :)

At paradoxical, we really like being able to crank out libraries and projects as fast as possible. We hate boilerplate and we hate repetition. Everything should be automated. For a long time we used maven archetypes to crank out services from a template and libraries from a template, and that worked reasonably well. However, deployment was always kind of a manual process. We had scripts in each repo to use the maven release plugin but our build system (Travis) wasn’t wired into it. This meant that deploys of libraries/services required a manual (but simple) step to run. We also had some kinks with our gpg keys and we weren’t totally sure a clean way of having Travis be able to sign our artifacts in a secure way without our keys being checked into a bunch of different repos

Jake and I had talked a while ago about how nice it would be if we could

  • Have all builds to master auto deployed as snapshots
  • PR’s built but not deployed
  • Creating a github release kicked off an actual release

The first two were reasonably easy with the travis scripts we already had, but it was the last one that was fun.

This article was posted not long ago about simplifying your maven release process by chucking the maven release plugin and instead using the maven deploy directly. If you could parameterize your maven artifact version number and have your build pass that in from the git tag, then we could really easily achieve git tag driven development!

To that end, Jake created a git project that facilitated setting up all our repo’s for tag driven deployment. Each deployable of ours would check out this project as a submodule under .deployment which contains the tooling to make git tag releases happen.

To onboard

First things first, is that we need a way to delegate deployment after our travis build is complete. So you’d add the following to your projects travis file:

git:
  submodules: false
before_install:
  # https://git-scm.com/docs/git-submodule#_options:
  # --remote
  # Instead of using the superproject’s recorded SHA-1 to update the submodule,
  # use the status of the submodule’s remote-tracking (branch.<name>.remote) branch (submodule.<name>.branch).
  # --recursive
  # https://github.com/travis-ci/travis-ci/issues/4099
  - git submodule update --init --remote --recursive
after_success:
- ./.deployment/deploy.sh

Which would pull the git deployment submodule, and delegate the after step to its deploy script.

You also need to add the deployment project as a parent of your pom:

<parent>
    <groupId>io.paradoxical</groupId>
    <artifactId>deployment-base-pom</artifactId>
    <version>1.0</version>
</parent>

This sets up nice things for us like making sure we sign our GPG artifacts, include sources as part of our deployment, and attaches javadocs.

The last thing you need to do is parametarize your artifact version field:

<version>1.0${revision}</version>

The parent pom defines revision and will set it to be either the git tag or -SNAPSHOT depending on context.

But, for those of you with strong maven experience, an alarm may fire that you can’t parameterize the version field. To solve that problem, Jake wrote a wonderful maven parameter resolver which lets you white-list which fields need to be pre-processed before they are processed. This solves an issue where a deployed maven pom that has parameterized values that are set at build time only are captured for deployment. Without that, maven has issues resolving transitive dependencies.

Anyways, the base pom handles a lot of nice things :)

The deploy script

Now lets break down the after build deploy script. It’s job is to take the travis encrypted gpg keys (which are also password secured) and decrypt them, and run the right maven release given the git tags.

if [ -n "$TRAVIS_TAG" ]; then
    echo "Deploying release version for tag '${TRAVIS_TAG}'"
    mvn clean deploy --settings "${SCRIPT_DIR}/settings.xml" -DskipTests -P release -Drevision='' $@
    exit $?
elif [ "$TRAVIS_BRANCH" = "master" ]; then
    echo "Deploying snapshot version on branch '${TRAVIS_BRANCH}'"
    mvn clean deploy --settings "${SCRIPT_DIR}/settings.xml" -DskipTests -P snapshot $@
    exit $?
else
    echo "No deployment running for current settings"
    exit 0
fi

It’s worth noting here a few magic things.

  1. The settings.xml file is provided by submodule and contains a field for the gpg username and the parametrized password the in every repo and contains the gpg user
  2. Because the deploy script is invoked from the root of the project, even though the deploy script is in the deployment submodule it resolves paths from the script execution point (not where it the script lives at). This is why the script captures its own path and stores it as the $SCRIPT_DIR variable.

Release time!

Now that it’s all set up we can safely merge to master whenever we want to publish a snapshot, and if we want to mark a release as public we just create a git tag for it.

Coproducts and polymorphic functions for safety

I was recently exploring shapeless and a coworker turned me onto the interesting features of coproducts and how they can be used with polymorphic functions.

Frequently when using pattern matching you want to make sure that all cases are exhaustively checked. A non exhaustive pattern match is a runtime exception waiting to happen. As a scala user, I’m all about compile time checking. For classes that I own I can enforce exhaustiveness by creating a sealed trait heirarchy:

sealed trait Base
case class Sub1() extends Base
case class Sub2() extends Base

And if I ever try and match on an Base type I’ll get a compiler warning (that I can fail on) if all the types aren’t matched. This is nice because if I ever add another type, I’ll get a (hopefully) failed build.

But what about the scenario where you don’t own the types?

case class Type1()
case class Type2()
case class Type3()

They’re all completely unrelated. Even worse is how do you create a generic function that accepts an instance of those 3 types but no others? You could always create overloaded methods:

def takesType(type: Type1) = ???
def takesType(type: Type2) = ???
def takesType1(type: Type3) = ???

Which works just fine, but what if that type needs to be passed through a few layers of function calls before its actually acted on?

def doStuff(type: Type1) = ... takesType(type1)
def doStuff(type: Type2) = ... takesType(type2)
def doStuff(type: Type3) = ... takesType(type3)

Oh boy, this is a mess. We can’t get around with just using generics with type bounds since there is no unified type for these 3 types. And even worse is if we add another type. We could use an either like Either[Type1, Either[Type2, Either[Type3, Nothing]]]

Which lets us write just one function and then we have to match on the subsets. This is kind of gross too since its polluted with a bunch of eithers. Turns out though, that a coproduct is exactly this… a souped up either!

Defining

type Items = Type1 :+: Type2 :+: Type3 :+: CNil

(where CNil is the terminator for a coproduct) we now have a unified type for our collection. We can write functions like :

def doStuff(item: Items) = {
  // whatever
  takesType(item)
}

At some point, you need to lift an instance of Type1 etc into a type of Item and this can be done by calling Coproduct[Item](instance). This call will fail to compile if the type of the instance is not a type of Item. You also are probably going to want to actually do work with the thing, so you need to unbox this souped up either and do stuff with it

This is where the shapeless PolyN methods come into play.

object Worker {
  type Items = Type1 :+: Type2 :+: Type3 :+: CNil

  object thisIsAMethod extends Poly1 {
    // corresponding def for the data type of the coproduct instance
    implicit def invokedOnType1 = at[Type1](data => data.toString)
    implicit def invokedOnType2 = at[Type2](data => data.toString)
    implicit def invokedOnType3 = at[Type3](data => data.toString)
  }

  def takesItem(item: Item): String = {
    thisIsAMethod(item)
  }
}

class Provider {
  Worker.takesItem(Coproduct[Item](Type1()) // ok
  Worker.takesItem(Coproduct[Item](WrongType()) // fails
}

The object thisIsAMethod creates a bunch of implicit type dependent functions that are defined at all the elements in the coproduct. If we add another option to our coproduct list, we’ll get a compiler error when we try and use the coproduct against the polymorphic function. This accomplishes the same thing as giving us the exhaustiveness check but its an even stronger guarantee as the build will fail.

While it is a lot of hoops to jump through, and can be a little mind bending, I’ve found that coproducts and polymorphic functions are a really nice addition to my scala toolbox. Being able to strongly enforce these kinds of contracts is pretty neat!

Mocking nested objects with mockito

Yes, I know its a code smell. But I live in the real world, and sometimes you need to mock nested objects. This is a scenario like:

when(a.b.c.d).thenReturn(e)

The usual pattern here is to create a mock for each object and return the previous mock:

val a = mock[A]
val b = mock[B]
val c = mock[C]
val d = mock[D]

when(a.b).thenReturn(b)
when(b.c).thenReturn(c)
when(c.d).thenReturn(d)

But again, in the real world the signatures are longer, the types are nastier, and its never quite so clean. I figured I’d sit down and solve this for myself once and for all and came up with:

import org.junit.runner.RunWith
import org.mockito.Mockito
import org.scalatest.junit.JUnitRunner
import org.scalatest.{FlatSpec, Matchers}

@RunWith(classOf[JUnitRunner])
class Tests extends FlatSpec with Matchers {
  "Mockito" should "proxy nested objects" in {
    val parent = Mocks.mock[Parent]

    Mockito.when(
      parent.
        mock(_.getChild1).
        mock(_.getChild2).
        mock(_.getChild3).
        value.doWork()
    ).thenReturn(3)

    parent.value.getChild1.getChild2.getChild3.doWork() shouldEqual 3
  }
}

class Child3 {
  def doWork(): Int = 0
}

class Child2 {
  def getChild3: Child3 = new Child3
}

class Child1 {
  def getChild2: Child2 = new Child2
}

class Parent {
  def getChild1: Child1 = new Child1
}

As you can see in the full test we can create some mocks object, and reference the call chain via extractor methods.

The actual mocker is really pretty simple, it just looks nasty cause of all the lambdas/manifests. All thats going on here is a way to pass the next object to a chain and extract it with a method. Then we can create a mock using the manifest and assign that mock to the source object via the lambda.

import org.mockito.Mockito

object Mocks {
  implicit def mock[T](implicit manifest: Manifest[T]) = new RichMockRoot[T]

  class RichMockRoot[T](implicit manifest: Manifest[T]) {
    val value = Mockito.mock[T](manifest.runtimeClass.asInstanceOf[Class[T]])

    def mock[Y](extractor: T => Y)(implicit manifest: Manifest[Y]): RichMock[Y] = {
      new RichMock[T](value, List(value)).mock(extractor)
    }
  }

  class RichMock[T](c: T, prevMocks: List[_]) {
    def mock[Y](extractor: T => Y)(implicit manifest: Manifest[Y]): RichMock[Y] = {
      val m = Mockito.mock[Y](manifest.runtimeClass.asInstanceOf[Class[Y]])

      Mockito.when(extractor(c)).thenReturn(m)

      new RichMock(m, prevMocks ++ List(m))
    }

    def value: T = c

    def mockChain[Y](idx: Int) = prevMocks(idx).asInstanceOf[Y]

    def head[Y] = mockChain[Y](0)
  }
}

The main idea here is just to hide away the whole “make b and have it return c” for you. You can even capture all the intermediate mocks in a list (I called it a mock chain), and expose the first element of the list with head. With a little bit of scala manifest magic you can even get around needing to pass class files around and can leverage the generic parameter (boy, feels almost like .NET!).

Extracting scala method names from objects with macros

I have a soft spot in me for AST’s ever since I went through the exercise of building my own language. Working in Java I missed the dynamic ability to get compile time information, though I knew it was available as part of the annotation processing pipleine during compilation (which is how lombok works). Scala has something similiar in the concept of macros: a way to hook into the compiler, manipulate or inspect the syntax tree, and rewrite or inject whatever you want. It’s a wonderfully elegant system that reminds me of Lisp/Clojure macros.

I ran into a situation (as always) where I really wanted to get the name of a function dynamically. i.e.

class Foo {
   val field: String = "" 
   def method(): Unit = {}
}

val name: String = ??.field // = "field"

In .NET this is pretty easy since at runtime you can create an expression tree which gives you the AST. But I haven’t been in .NET in a while, so off to macros I went!

First off, I found the documentation regarding macros to be lackluster. It’s either rudimentary with trivial examples, or the learning curve was steep and I was too lazy to read through all of it. Usually when I encounter scenarios like this I turn to exploratory programming, where I have a unit test that sets up a basic example and I leverage the debugger and intellij live REPL to poke through what I can and can’t do. Time to get set up.

First, I needed to create a new submodule in my multi module maven project that would contain my macro. The reason is that you can’t use macros in the same compilation unit that they are defined in. You can however, use macros in a macros test since the compiler compiles test sources different from regular sources.

That said, debugging macros is harder than normal because you aren’t debugging your running program, you are debugging the actual compiler. I found this blog post which was a life saver, even though it was missing a few minor pieces.

1. Set the main class to scala.tools.nsc.Main
2. Set the VM args to -Dscala.usejavacp=true
3. Set the program arguments to first point to the file containing the macro, then the file to compile that uses the macro:

-cp types.Types macros/src/main/scala/com/devshorts/common/macros/MethodNames.scala config/src/test/scala/config/ConfigProxySpec.scala

Now you can actually debug your macro!

First let me show the test

case class MethodNameTest(field1: Object) {
  def getFoo(arg: Object): Unit = {}
  def getFoo2(arg: Object, arg2: Object): Unit = {}
}

class MethodNamesMacroSpec extends FlatSpec with Matchers {
  "Names macro" should "extract from an function" in {
    methodName[MethodNameTest](_.field1) shouldEqual MethodName("field1")
  }

  it should "extract when the function contains an argument" in {
    methodName[MethodNameTest](_.getFoo(null)) shouldEqual MethodName("getFoo")
  }

  it should "extract when the function contains multiple argument" in {
    methodName[MethodNameTest](_.getFoo2(null, null)) shouldEqual MethodName("getFoo2")
  }

  it should "extract when the method is curried" in {
    methodName[MethodNameTest](m => m.getFoo2 _) shouldEqual MethodName("getFoo2")
  }
}

macro

methodName here is a macro that extracts the method name from a lambda passed in of the parameterized generic type. What’s nice about how scala set up their macros is you provide an alias for your macro such that you can re-use the macro but type it however you want.

object MethodNames {
  implicit def methodName[A](extractor: (A) => Any): MethodName = macro methodNamesMacro[A]

  def methodNamesMacro[A: c.WeakTypeTag](c: Context)(extractor: c.Expr[(A) => Any]): c.Expr[MethodName] = {
     ...
   }
}

I’ve made the methodName function take a generic and a function that uses that generic (even though no actual instance is ever passed in). The nice thing about this is I can re-use the macro typed as another function elsewhere. Imagine I want to pin [A] so people don’t have to type it. I can do exactly that!

case class Configuration (foo: String)

implicit def config(extractor: Configuration => Any): MethodName = macro MethodNames.methodNamesMacro[Configuration]

config(_.foo) == "foo"

At this point its time to build the bulk of the macro. The idea is to inspect parts of the AST and potentially walk it to find the pieces we want. Here’s what I ended up with:

def methodNamesMacro[A: c.WeakTypeTag](c: Context)(extractor: c.Expr[(A) => Any]): c.Expr[MethodName] = {
  import c.universe._

  @tailrec
  def resolveFunctionName(f: Function): String = {
    f.body match {
      // the function name
      case t: Select => t.name.decoded

      case t: Function =>
        resolveFunctionName(t)

      // an application of a function and extracting the name
      case t: Apply if t.fun.isInstanceOf[Select] =>
        t.fun.asInstanceOf[Select].name.decoded

      // curried lambda
      case t: Block if t.expr.isInstanceOf[Function] =>
        val func = t.expr.asInstanceOf[Function]

        resolveFunctionName(func)

      case _ => {
        throw new RuntimeException("Unable to resolve function name for expression: " + f.body)
      }
    }
  }

  val name = resolveFunctionName(extractor.tree.asInstanceOf[Function])

  val literal = c.Expr[String](Literal(Constant(name)))

  reify {
    MethodName(literal.splice)
  }
}

For more details on parts of the AST here is a great resource

In the first case, when we pass in methodName[Config](_.method) it gets mangled into a function with a body that is of x$1.method. The select indicates the x$1 instance and selects the method expression of it. This is an easy case.

In the block case that maps to when we call methodName[Config](c => c.thing _). In this case we have a function but its curried. In this scenario the function body is a block who’s inner expression is a function. But, the functions body of that inner function is an Apply.

Apply takes two arguments — a Select or an Ident for the function and a list of arguments

So that makes sense.

The rest is just helper methods to recurse.

The last piece of the puzzle is to create an instance of a string literal and splice it into a new expression returning the MethodName case class that contains the string literal.

All in all a fun afternoons worth of code and now I get semantic string safety. A great use case here can be to type configuration values or other string semantics with a trait. You can get compile time refactoring + type safety. Other use cases are things like database configurations and drivers (the .NET mongo driver uses expression trees to type an object to its underlying mongo collection).