Matt's Software Blog

15Nov/110

Linq to HPC is dead – being replaced with Windows Azure HPC Scheduler

I recently found out that Linq to HPC has been killed. This was surprising since the Linq to HPC beta was only just released in October (Build conference).

Read about it here HPC team Blog

I find this unfortunate. Azure is not set-up for handling Big Data. Jobs will likely take 100x + longer to run than what could be implement on a purpose built private cluster. Add to the fact that Azure is rather expensive for what it is.

Tagged as: , No Comments
8Nov/110

Silverlight 5 Pivot with Facebook Graph API

Not long ago I decided to test out Silverlight 5 Beta and the new Pivot control. I figured the easiest way to get a bunch of images was to hook up to the Graph API. The result looks pretty cool.

Unfortunately given Microsoft’s recent developments with Windows 8; Silverlight 5 and the Pivot control will most likely be stillborn. As Microsoft is throwing its weight behind tablet applications Silverlight will probably end up being relegated to eking out an existence on enterprise intranets. There are not many enterprises willing to go the extra mile to bring their employees the Silverlight experience. Few have progressed beyond Web 1.0.

By slaughtering WPF and quietly killing off Silverlight, Microsoft is severely limiting the options for building cutting edge enterprise grade UIs. The Metro toolkit is looking woefully inadequate.
For the first time in a long time I am going to bet against Microsoft and focus on Linux. The Big Data / HPC crowd mostly use Linux anyway so it is a more natural fit. I’ll be trading in F#, WPF, and Silverlight for Haskell, C++, and OpenGL. All the while lamenting every ounce of extra work that this entails because Microsoft forgot it was about developers and tried to be Apple.

- Matt

Filed under: Uncategorized No Comments
12Apr/110

K-means clustering in C# LINQ

The standard K-means algorithm is actually quite simple, and surprisingly powerful.

The use of LINQ makes the code more expressive, and in my view, easier to understand than the English and formal mathematical definitions.

The two phases of K-means, assignment and update are performed by the LINQ functions GroupBy and Select respectively.

The iterations are performed with the Scan0 extension function from Rx .net framework.

For this example I’m using the Iris data set. (http://en.wikipedia.org/wiki/Iris_flower_data_set)

It is known ahead of time that there are only three species in the dataset, so only three centers were picked.

The 3 species:

IRIS pair plot

Pair Plot for the Iris dataset

The 3 K-means clusters:

8Apr/110

Comparing distributed processing languages

In the paper “No Silver Bullet – Essence and Accidents of Software Engineering” (Fred Brooks 1986), Brooks argues that most of what software engineers do now is targeted to essential complexity so there is little scope for another order of magnitude developer productivity improvements. i.e. No Silver Bullet

However the use of General Programing Languages for domain specific tasks and the manual mixing of concerns means that accidental complexity still dominates software development. By taking a simple program (word count) and mixing it with domain of distributed computing I intend on demonstrating where accidental complexity sneaks back in and techniques that can be used to reduce it.

7Apr/110

Hadoop and the MapReduce paradigm

Big Data means that it is no longer possible to scale up so it becomes necessary to scale out. MapReduce restricts the way you can process data to a manner which is easy to scale out, and Hadoop is an open source implementation of MapReduce. This post is about optimizing Big Data processing beyond vanilla MapReduce.

There are a few optimizations to the MapReduce paradigm which Hadoop has implemented;

  • Combiners: for when the operation on the data is cumulative and associative. The naive way is to send all output from the Map phase to the Reduce phase. Combiners aggregate the output on from the Mappers to greatly reduce network traffic. This can easily give multiple orders of magnitude performance improvements.
  • Multiple Output: for when splitting data into multiple partitions. The naive approach runs over and applies different filters in the Map phase for each output partition. Hadoops multiple output allows this process to be done in a single phase.
  • Map side joins: when one side of the join is small it is much more efficient to send just this small amount over the network than it is to send both sides over the network into a reduce phase.

And some that Hadoop has not;

  • Reducer count selection: the most obvious heuristic improvement is the selecting of number of Reducers required by a Map Reduce job. Too large of a number and there is wastage, too small of a number and the job may fail. Picking the best number of reducers is a bit of a dark art and a huge time sink for the developers.
  • Merge Sorting Map output: sorting the output of the Mappers and merge sorting in the Reducers will allow for Reducer processing to start sooner, and would require less space.
  • Heterogeneous clusters: tasks which can benefit from GPUs need a way to be sent to computers with GPUs. Hadoop does not differentiate between types of resources and does not resource level accordingly.

The essential problem with Hadoop and the MapReduce paradigm is that it does not keep enough metadata or metrics about job execution to be able to optimize effectively.

For example, most tasks on Hadoop are actually a series of MapReduce jobs which form a Directed Acyclic Graph. The structure of this DAG is codified into the JAR and is not available to the job tracker for optimization. Many hand written jobs are not properly optimized, e.g. combiners, multiple outputs, map side joins, and reducer count selection. If Hadoop had knowledge of the DAG it would be able to automatically look at runtime data and re-write the DAG to make appropriate use of these optimizations.

The difference between naïve implementations and optimized implementations can be many orders of magnitude. Hadoops unfriendly API, and general fragility makes writing such optimized jobs very difficult.

This is what makes Dryad so interesting.

Filed under: Big Data, Hadoop No Comments
25Feb/110

Building a simple Directed WebCrawler in Rx and F#

There are two important concepts portrayed here;

  • The advantage of the Async monad; and
  • The use of the Rx Merge instead of Fork Join

The advantage of the Async monad is that it provides an elegant way of expressing and composing asynchronous operations such that the thread is automatically freed up while the asynchronous request is in progress. Crawlers maintain a large number of open connections as they wait for responses, it would be incredibly inefficient to block the threads as they waited.

In parallel programming, fork join operations take a list of functions, perform them in parallel, and then join the results. The results are only returned when all functions have finished. For a crawler, a single request can take a very long period of time, blocking the join operation. In Rx, the merge operator merges the results into a single stream as they arrive – no blocking involved.

These concepts used together form a very elegant and efficient solution (see code).

Code...

25Feb/111

Plotting the running mean and standard deviation in Rx, F#, and WPF

In this project I examine the use of Rx and F# for creating interactive charts. Such charting would be useful for all kinds of monitoring operations; from the status of a compute clusters to the movement of the stock markets.

F#, Rx offers the following advantages;

  • Compositionality: The compositionality of Rx offers a very intuitive and expressive way for specifying charts (see code).
  • Runtime modifications: The ability to modify behavioral code at runtime allows the users to arbitrarily create new charts to explore the data in real time as new data is coming in.

New functionality;

  • TrailingWithTime(Timespan): Rx already had BufferWithTime and WindowWithTime but I needed a new behavior that I will call ‘trailing with time’. For each value sent in I wanted all the values previously seen in the last time span to be sent back out. For these charts I plotted the values for the last 60 seconds every time a new value was seen.
  • Non-blocking aggregators: Unfortunately Rx mixes blocking functions with non-blocking functions e.g. Sum and Average. This means they cannot be used for this style of interactive charting. So instead I created several non-blocking functions for Mean and Variance using the running Mean and Variance formulas (see code).

Random walk over time

This chart displays a random walk (and associated stats). 10 ticks a second over a period of 60 seconds.

  • Blue line: The random walk
  • Yellow line: The mean for the last 60 seconds
  • Purple lines: The mean (+-) standard deviation for the last 60 seconds
  • Green line: The running mean
  • Red lines: The running mean (+-) the running standard deviations

Code....

26Jan/110

Introducing Fingerprint Comparison Software V3

Some time ago I wrote the 3rd version in a series of Fingerprint Markup programs and made it available commercially at my website at www.forensic-software.com.

The software is used by forensics experts to perform manual comparisons and to create charts for presentation at court. It is in use throughout Australia, and several other countries. Aside from the obvious benefit in productivity users experience reduced OHS issues from no longer leaning over a magnifying glass.

FCS

Traditional vs New

Benefits:

  • By providing a replacement to the manual process of printing out and examining digital images with a magnifying glass.
  • Reduces eye strain associated with magnifying glasses due to the ability to use both eyes. The image is displayed on a high definition monitor at the magnification that suits the operator.
  • The upright position is an ergonomically safe method of examining fingerprints, helping to reduce neck and back injury.
  • Displays images for comparison purposes side by side and is easily adjustable to suit the needs of the operator.
  • Provides the means to mark features and record the observations of the operator for later retrieval. It also features drawing and note tools for reporting purposes.
  • Provides quick and easy means of producing charts for court purposes.

In action:

3rd level detail:

Filed under: Uncategorized No Comments
18Dec/100

WPF Charts as a REST service

WPF is awesome for building high fidelity UI’s and beautiful charts. Traditionally, the only ways to show off WPF is through Silverlight or Client Apps. Client apps are a huge barrier to entry and Silverlight is not yet ubiquitous enough to rely on. So an alternative is needed.

Fortunately WCF now supports REST, and given that WPF can be rendered into a buffer it is possible to expose WPF Charts through REST. By using REST we are playing nice with the rest of the internet. Including a chart becomes as simple as setting the source url on an image.

Caching charts also becomes a lot simpler. Generating charts on the fly for every request can be both expensive and wasteful. News or a broad email can cause a large group of people to check the same chart at the same time. Regenerating the same chart for each request can be enough to cause serious lag right when everyone is looking at it. Exposing the charts through REST means you can use the existing caching mechanisms for free. Careful management of caching will greatly improve the responsiveness of the site.

Here is an Infragistics chart I hosted;

Heat Map by Infragistics

And one I built from scratch;

POC - simple WPF chart

Code:

Filed under: C#, Learning, WPF Continue reading
15Aug/100

Undo / Redo using the Memento Pattern and Command Pattern in F#

This project tackles the problem off managing undo functionality.  It leverages the F# Actor model to implement the Memento pattern, and uses first class functions to implement the Command pattern. This code takes functionality that is ostensibly quite difficult and makes it much easier.

Undo Redo in F#

Undo Redo in F#

[Update]

Don Syme posted a really nice write up on this project: A variation on Matt Moloney's Undo/Redo Memento pattern. Readers of this post must check his out, for both better code and nicer descriptions.

Code...