Strongly typed powershell csv parser

Somehow I missed the powershell boat. I’ve been a .NET developer for years and I trudged through using the boring old cmd terminal, frequently mumbling about how much I missed zsh. But something snapped and I decided to really dive into powershell and learn why those who use it really love the hell out of it. After realizing that the reason everyone loves it is because everything is strongly typed and you can use .NET in your shell I was totally sold.

My first forays into powershell included customizing the shell environment. First I got conemu and made it look nice and pretty. Next was to get an ls highlighting module, since I love that about unix shells.

I set up a few fun aliases in my profile and felt ready to conquer the world! My next experiment was to try and create an actual binary cmdlet. I figured, what better way than to create a csv reader. Now, I realize there is already an Import-Csv cmdlet that types your code, but I figured I’d write one from scratch, since apparently that’s what I tend to do (instead of inventing anything new).

My hope was to make it so that it would emit strongly typed objects (which it does), but forwarning, you don’t get intellisense on it in the shell. This is due to the fact that types are generated at runtime and not compile time.

For the lazy, here is a link to the github.

The Plan

At first I thought I’d just wrap the F# csv type provider, but I realized that the type provider needs a template to generate its internal data classes. That won’t do here because the cmdlet needs to accept any arbitrary csv file and strongly type at runtime.

To solve that, I figured I could leverage the F# data csv library which would do the actual csv parsing, and then emit runtime bytecode to create data classes representing the header values.

As emitting bytecode is a pain in the ass, I wanted to keep my data classes simple. If I had a csv like:

Name,Age,Title
Anton,30,Sr Engineer
Faisal,30,Sr Engineer

Then I wanted to emit a class like

public class Whatever{
    public String Name;
    public String Age;
    public String Title;

    public Whatever(String name, String age, String title){ 
       Name = name;
       Age = age;
       Title = title;
    }
}

Since that would be the bare minimum that powershell would need to display the type.

Emitting bytecode

First, lets look at the final result of what we need. The best way to do this is to create a sample type in an assembly and then to use Ildasm (an IL disassembler) to view the bytecode. For example, the following class

using System;

namespace Sample
{
    public class Class1
    {
        public String foo;
        public String bar;

        public Class1(String f, String b)
        {
            foo = f;
            bar = b;
        }
    }
}

Decompiles into this:

.method public hidebysig specialname rtspecialname 
        instance void  .ctor(string f,
                             string b) cil managed
{
  // Code size       24 (0x18)
  .maxstack  8
  IL_0000:  ldarg.0
  IL_0001:  call       instance void [mscorlib]System.Object::.ctor()
  IL_0006:  nop
  IL_0007:  nop
  IL_0008:  ldarg.0
  IL_0009:  ldarg.1
  IL_000a:  stfld      string Sample.Class1::foo
  IL_000f:  ldarg.0
  IL_0010:  ldarg.2
  IL_0011:  stfld      string Sample.Class1::bar
  IL_0016:  nop
  IL_0017:  ret
} // end of method Class1::.ctor

While I didn’t just divine how to write bytecode by looking at the IL (I followed some other blog posts), when I got an “invalid bytecode” CLR runtime error, it was nice to be able to compare what I was emitting which what I expected to emit. This way simple errors (like forgetting to load something on the stack) became pretty apparent.

To emit the proper bytecode, we need a few boilerplate items: an assembly, a type builder, an assembly builder, a module builder, and a field builder. These are responsible for the metadata you need to finally emit your built type.

let private assemblyName = new AssemblyName("Dynamics")

let private assemblyBuilder = AppDomain.CurrentDomain.DefineDynamicAssembly(assemblyName, AssemblyBuilderAccess.RunAndSave)

let private moduleBuilder = assemblyBuilder.DefineDynamicModule(assemblyName.Name, assemblyName.Name + ".dll")

let private typeBuilder typeName = moduleBuilder.DefineType(typeName, TypeAttributes.Public)

let private fieldBuilder (typeBuilder:TypeBuilder) name fieldType : FieldBuilder = 
    typeBuilder.DefineField(name, fieldType, FieldAttributes.Public)

let private createConstructor (typeBuilder:TypeBuilder) typeList =
    typeBuilder.DefineConstructor(MethodAttributes.Public, CallingConventions.Standard, typeList |> List.toArray)

None of this is really all that interesting and hopefully is self explanatory.

The fieldBuilder is important since that will let us declare our local fields. In fact, once we’ve declared our local fields using the builder, the only bytecode we have to emit is the constructor (which accepts arguments and instantiates fields in them).

Here is the necessary code to build such a constructor.

let private callDefaultConstructor (gen: ILGenerator) = 
    let objType = typeof<obj>
    gen.Emit(OpCodes.Call, objType.GetConstructor(Type.EmptyTypes))
    gen.Emit(OpCodes.Ldarg_0)

let private loadThis (gen: ILGenerator) = 
    gen.Emit(OpCodes.Ldarg_0)
    gen

let private emitNewInstanceRef (gen : ILGenerator) =
    gen |> loadThis |> callDefaultConstructor

let private assignField (argIndex : int) (field : FieldBuilder) (gen : ILGenerator) =     
    gen.Emit(OpCodes.Ldarg, argIndex)
    gen.Emit(OpCodes.Stfld, field)
    gen

let private loadConstructorArg (gen : ILGenerator) ((num, field) : int * FieldBuilder) = 
    gen |> loadThis |> assignField num field

let private completeConsructor (gen : ILGenerator) = gen.Emit(OpCodes.Ret)

let private build (fields : FieldBuilder list) (cons : ConstructorBuilder) = 
    let generator = cons.GetILGenerator()

    generator |> emitNewInstanceRef

    let fieldsWithIndexes = fields |> List.zip [1..(List.length fields)]

    fieldsWithIndexes
        |> List.map (loadConstructorArg generator)
        |> ignore

    generator |> completeConsructor

A few points of interest.

  • Calls that make reference to OpCodes.Ldarp_0 are loading the “this” object to work on.
  • OpCodes.Stdfld sets the passed in field to the value previously pushed on the stack.
  • Opcodes.Ldarg with the index passed to it is a dynamic way of saying “load argument X onto the stack”

The final piece of the block is to tie it all together. Create field instances, take the target types and create a constructor, then return the type.

type FieldName = string
type TypeName = string

let make (name : TypeName) (types : (FieldName * Type) list)= 
    let typeBuilder = typeBuilder name
    let fieldBuilder = fieldBuilder typeBuilder
    let createConstructor = createConstructor typeBuilder        
    let fields = types |> List.map (fun (name, ``type``) -> fieldBuilder name ``type``)
    let definedConstructor = types |> List.map snd |> createConstructor

    
    definedConstructor |> build fields

    typeBuilder.CreateType()

Instantiating your type

Lets say we have a record that describes a field, its type, and a target value

type DynamicField = {
    Name : String;
    Type : Type;
    Value: obj;
}

Then we can easily instantiate a target type with

let instantiate (typeName : TypeName) (objInfo : DynamicField list) =
    let values = objInfo |> List.map (fun i -> i.Value) |> List.toArray
    let types  = objInfo |> List.map (fun i -> (i.Name, i.Type))

    let t = make typeName types

    Activator.CreateInstance(t, values)

It’s important to note that values is an obj []. Because its an object array we can pass it to the activates overloaded function that wants a params obj[] and so it’ll treat each object in the object array as another argument to the constructor.

Dynamic static typing of CSV’s

Since there is a way to dynamically create classes at runtime, it should be easy for us to leverage this to do the csv strong typing. In fact, the entire reader is this and emits to you a list of strongly typed entries:

open System
open System.Reflection
open System.IO
open DataEmitter
open FSharp.Data.Csv

module CsvReader = 
    let rand = System.Random()

    let randomName() = rand.Next (0, 999999) |> string

    let defaultHeaders size = [0..size] |> List.map (fun i -> "Unknown Header " + (string i))

    let load (stream : Stream) = 
        let csv = CsvFile.Load(stream).Cache()

        let headers = match csv.Headers with 
                        | Some(h) -> h |> Array.toList
                        | None -> csv.NumberOfColumns |> defaultHeaders

        let fields = headers |> List.map (fun fieldName -> (fieldName, typeof<string>))

        let typeData = make (randomName()) fields

        [
            for item in csv.Data do       
                let paramsArr = item.Columns |> Array.map (fun i -> i :> obj)
                yield Activator.CreateInstance(typeData, paramsArr)         
        ]

The randomName() is a silly workaround to make sure I don’t create the same Type in an assembly. Each time you run the csv reader it’ll create a new random type representing that csv’s data. I could maybe have optimized this that if someone calls in for a type with the same list of headers that another type had then to re-use that type instead of creating a duplicate, oh well.

Using the reader from the cmdlet

Like I mentioned in the beginning, there is a major flaw here. The issue is that since my types are generated at runtime (which was really fun to do), it doesn’t help me at all. Cmdlet’s need to expose their output types via an OutputType attribute, and since its an attribute I can’t expose the type dynamically.

Either way, here is the entire csv cmdlet

namespace CsvHandler

open DataEmitter
open System.Management.Automation
open System.Reflection
open System
open System.IO

[<Cmdlet("Read", "Csv")>]
type CsvParser() =
    inherit PSCmdlet()
    
    [<Parameter(Position = 0)>]
    member val File : string = null with get, set

    override this.ProcessRecord() = 
        let (fileNames, _) = this.GetResolvedProviderPathFromPSPath this.File
             
        for file in fileNames do   
            use fileStream = File.OpenRead file

            fileStream 
                |> CsvReader.load 
                |> List.toArray 
                |> this.WriteObject

This reads an implicit file name (or file with wildcards) and leverages the inherited PsCmdlet class to resolve the path from the passed in file (or expand any splat’d files like some*). All we do now is pass each file stream to the reader, convert to an array, and pass it to the next item in the powershell pipe.

See it in action

Maybe this whole exercise was overkill, but let’s finish it out anyways. Let’s say we have a csv like this:

Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00

We can do the following

output1

And filter on items

filter

Cleanup

After getting draft one done, I thought about the handling of the IL generator in the Data Emitter. There are two things I wanted to accomplish:

1. Clean up having to seed the generator reference to all the functions
2. Clean up passing an auto incremented index to the field initializer

After some mulling I realized that implementing a computation expression to handle the seeded state would be perfect for both scenarios. We can create an IlBuilder computation expression that will hold onto the reference of the generator and pass it to any function that uses do! syntax. We can do the same for the auto incremented index with a different builder. Let me show you the final result and then the builders:

let private build (fields : FieldBuilder list) (cons : ConstructorBuilder) =
    let generator = cons.GetILGenerator()

    let ilBuilder = new ILGenBuilder(generator)

    let forNextIndex = new IncrementingCounterBuilder()

    ilBuilder {   
        do! loadThis
        do! callDefaultConstructor
        do! loadThis

        for field in fields do
            do! loadThis
            do! forNextIndex { return loadArgToStack }
            do! field |> setFieldFromStack  
            
        do! emitReturn
    }

And both builders:

(* encapsulates an incrementable index *)
type IncrementingCounterBuilder () = 
    let mutable start = 0
    member this.Return(expr) = 
        start <- start + 1            
        expr start 
          
(* Handles automatically passing the il generator through the requested calls *)
type ILGenBuilder (gen: ILGenerator) = 
    member this.Bind(expr, func)= 
        expr gen
        func () |> ignore

    member this.Return(v) = ()
    member this.Zero () = ()
    member this.For(col, func) = for item in col do func item
    member this.Combine expr1 expr2 = ()
    member this.Delay expr = expr()

Now all mutability and state is contained in the expression. I think this is a much cleaner implementation and the functions I used in the builder workflow didn’t have to have their function signatures changed!

Conclusion

Sometimes you just jump in and don’t realize the end goal won’t work, but I did learn a whole lot figuring this out so the time wasn’t wasted.

Check out full source at my github.

One comment

  1. Pingback: F# Weekly #12, 2014 | Sergey Tihon's Blog

Post a comment

You may use the following HTML:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>