Saturday, 7 November 2009

OpenCL access to both GPUs on MBP

I've been tinkering with OpenCL for a few weeks now, primarily on my custom Java wrappers. One of the first tests I ran was to dump out the OpenCL platform info for my 15" MacBook Pro.

After being disappointed that Snow Leopard wasn't going to give hybrid-SLI access to the NVIDIA GeForce 9400M and 9600M GT, I gave up on the prospect on using both devices simultaneously.

But...

When running the platform query tests I got pleasant surprise. Looking at the following output:

Platform[0]
Name:       Apple
Profile:    FULL_PROFILE
Vendor:     Apple
Version:    OpenCL 1.0 (Jul 15 2009 23:07:32)
Extensions:
Device[0]:
  VendorID:           0x1022600
  Type:               CL_GPU
  MaxComputeUnits:    32
  MaxWorkItemDims:    3
  MaxWorkItemSize[0]: 512
  MaxWorkItemSize[1]: 512
  MaxWorkItemSize[2]: 64
  MaxWorkGroupSize:   512
Device[1]:
  VendorID:           0x2022600
  Type:               CL_GPU
  MaxComputeUnits:    16
  MaxWorkItemDims:    3
  MaxWorkItemSize[0]: 512
  MaxWorkItemSize[1]: 512
  MaxWorkItemSize[2]: 64
  MaxWorkGroupSize:   512
Device[2]:
  VendorID:           0x1020400
  Type:               CL_CPU
  MaxComputeUnits:    2
  MaxWorkItemDims:    3
  MaxWorkItemSize[0]: 1
  MaxWorkItemSize[1]: 1
  MaxWorkItemSize[2]: 1
  MaxWorkGroupSize:   1
This shows there are two GPU devices. One with 32 compute units (Geforce 9600M GT) and one with 16 compute units (Geforce 9400M), giving a total of 48 compute units. That's not bad for a laptop.

Initially I thought that I would need to activate the "Performance Graphics" setting to use the 9600M for OpenCL. But after running the above test I switched the graphics to the 9400M. This means that the 32 cores, 512MB VRAM, plus the full PCIe (x16) bus of the 9600M are totally dedicated to OpenCL processing... very cool!

Monday, 31 August 2009

IntelliJ and Snow Leopard

Just wondering if anyone else on the interweb is experiencing Intellij IDEA 8.x regular crashes? I'm sure it has to do with Apple dropping all 1.5 JDKs in favour of 1.6 in Snow Leopard. I call it pretty brave (others may think of less complimentary words) just replacing all 32-bit JDK 1.5 versions with symlinks to the 64-bit 1.6 JDK. Clearly they don't do much Java development at Apple :-P W.O.R.A. uurrmm.. yeah doesn't quite work like that... Update: Looks like it was because I was using the Nimbus L&F, which is hind sight is probably not a good idea as the native OSX L&F is much better.

Sunday, 19 July 2009

Keeping your MacBook Pro cool

A few weeks ago the London news were reporting that we were in a heatwave with temperatures around 30'C (86'F). Well as an Australian living in London, there was at least one person actually enjoying the "heatwave". However, the same can't be said for my 15" uni-body MacBook Pro, which was overheating with CPU and GPU temps over 100'C (212'F) under heavy compute loads.

The MBP gets hottest on the top-right corner, around the power adapter, and coolest in opposite corner. On one occasion it was too hot to touch for longer than a couple of seconds, after playing Left 4 Dead under Boot Camp for an hour using the hotter NVIDIA 9600M GPU.

It seems that using the 9600M GPU for everyday computing also kept the MBP hot, on occasion hitting 90'C (194'F) for non-graphics intensive tasks. So switching to the cooler 9400M for every day usage seemed like a good idea, you can do this by changing System Prefs -> Energy Saver -> Graphics from High Performance to Better Battery Life.

There's more you can do to keep your MBP cool. After much googling for laptop coolers, I bought a Cooler Master Notepal W1. This is a simple, cleanly styled, laptop cooler that does a great job of passively cooling my laptop, and is backed up by two near silent USB power fans should you need them. Even with the fans off, the heat is dissipated over the entire stand as it's made of solid aluminium, and seems to keep temperatures a little more consistent.
Cooler Master Notepal W1 The Notepal fits a 15" laptop fairly well with about an inch of overhang on either side, and it's styling near enough matches the MBP. And, although I didn't really do any scientific analysis, the Notepal passes the "it feels cooler" test.

Sunday, 5 July 2009

Perf Notes - Conflation Concurrency Perf Analysis

When I first heard the word conflation, my first thought was, errrm is that word made up? But apparantly it's legit, and in the context I heard it the term related to maintaining an up-to-date and coherent view of data for financial securities (such as a stocks or indexes).

The following is a short and direct definition from Wiktionary:
"A blend or fusion, esp. a composite reading or text formed by combining the material of two or more texts into a single text."
One way to think about conflation is, given a continuous feed of pricing information from an upstream source, it maintains up-to-date and consistent stock price data. This includes numbers such as: bid, ask, last (traded) and (traded) volume.

VERSION BID ASK LAST VOLUME
1 20 25 24.5 100000
2 22.5 25 24.5 100000
3 22.5 26 25 200000
4 24 26 25 200000


In he above table the bold numbers indicate an updated value, and the green rows indicate a single get from the client application. Conflating fuses all the previous updates into a single consistent set for each requested version, while discarding the historic versions. This kind of thing is also called baselining, or snapshoting.

Conflation has a number of performance considerations:
  • How to model updatable price information?
  • How to efficiently store all pricing information?
  • How to manage concurrent reads and writes of the pricing store?
To help answer the above questions I've produced a simple test application, if you're really keen you can checkout the source from SVN here, or browse here.

The perf app has the following objects:
  • Price Object: This needs to store the price information, and should be immutable when accessed by the client app.
  • Price Store: Maintains an internal collection of the Price objects, and support updates and gets.
To test a wide range of scenarios the following different combinations where implemented and tested:
  • PriceStore internal collection
    • Plain Java HashMap
    • Trove Primitve (Long) Map
  • PriceStore concurrency
    • Unsynchronized Get/Update (Used only as a control test, not a valid scenario)
    • Synchronized Get/Update
    • Read-Write-Lock Get/Update
  • PriceStore Price object management
    • Immutable price object store with New-On-Update.
    • Mutable price store with Copy-On-Read (get).
The steps implemented in the performance app:
  1. Pre-populated the PriceStore with all Prices.
    • Out in the wild this type of application is always on, and always full.
    • Not included in timings.
  2. Create Writer threads.
    • Write all prices to Price Store in bursts with small lag.
    • Timed
  3. Create Read threads.
    • Read all prices from Price Store in bursts with small lag.
    • Timed
The following parameters were used to setup the app runs:
Price Count 1,000,000 Number of disctinct prices stored.
Burst Count (Burst Size) 4 (250,000) Used to simulate bursts of price updates/gets.
Writer Count 10 Number of writer threads.
Write Burst Lag 10 millis The delay between each update burst.
Per Writer Loops 3 Number of times each writer updates all prices.
Reader Count 100 Number of reader threads.
Read Burst Lag 5 millis The delay between each get burst.
Per Reader Loops 5 Number of times all readers get all prices.
The results were produced on my MacBook Pro Unibody 15", with an Intel Core 2 Duo 2.8GHz, with 4GB RAM, running OS X 10.5.7. The Java VM used was the latest Apple JDK 1.6.0_13 64-bit. The perf app was run with the following JVM args: -Xms512M -Xmx512M. Note also the Apple JDK6 only supports the -server JVM, and no further GC tweaks were made.

The test application was run on two passes: the ORDERED run, where all prices are accessed in sequential order; and the SHUFFLED run, where all prices are accessed in random order.

This chart shows the CPU and Heap usage for the consecutive runs of the different configurations, for the ORDERED access of price objects.
Ordered Access Results
Name Duration (Seconds) Description
PLAIN 47.13 Unsychronized, HashMap, New-On-Update, Control Test Only
SYNC_MAP 152.71 Synchronized, HashMap, New-On-Update
RWLOCK_MAP 74.60 Read/Write Lock, HashMap, New-On-Update
SYNC_MAP_ALT 149.24 Synchronized, HashMap, Copy-On-Read
RWLOCK_MAP_ALT 64.36 Read/Write Lock, HashMap, Copy-On-Read
SYNC_LONGMAP 157.25 Synchronized, Trove Long Map, New-On-Update
RWLOCK_LONGMAP 87.20 Read/Write Lock, Trove Long Map, New-On-Update
SYNC_LONGMAP_ALT 179.10 Synchronized, Trove Long Map, Copy-On-Read
RWLOCK_LONGMAP_ALT 91.83 Read/Write Lock, Trove Long Map, Copy-On-Read


This chart shows the CPU and Heap usage for the consecutive runs of the different configurations, for the SHUFFLED access of price objects.

Shuffled Access Results
Name Duration (Seconds) Description
PLAIN 138.95 Unsychronized, HashMap, New-On-Update, Control Test Only
SYNC_MAP 489.97 Synchronized, HashMap, New-On-Update
RWLOCK_MAP 174.79 Read/Write Lock, HashMap, New-On-Update
SYNC_MAP_ALT 441.71 Synchronized, HashMap, Copy-On-Read
RWLOCK_MAP_ALT 159.02 Read/Write Lock, HashMap, Copy-On-Read
SYNC_LONGMAP 327.84 Synchronized, Trove Long Map, New-On-Update
RWLOCK_LONGMAP 122.76 Read/Write Lock, Trove Long Map, New-On-Update
SYNC_LONGMAP_ALT 333.35 Synchronized, Trove Long Map, Copy-On-Read
RWLOCK_LONGMAP_ALT 126.38 Read/Write Lock, Trove Long Map, Copy-On-Read


From the results above, we can see that the RWLOCK_MAP_ALT configuration gives the best results for the ORDERED access of elements, and the RWLOCK_LONGMAP_ALT config gives the best results for SHUFFLED access.

The clear winners are using the Copy-On-Read pattern, in favour of New-On-Update immutable object, and the Read/Write Lock in favour of synchronized methods. I expected the Trove Long Map to perform best in both cases, but given the ORDERED access is only an academic excercise, the fact that it performs best on SHUFFLED (most likely real-world usage) is a good thing.

There are some curious patterns in the Heap Usage for both runs. In particular I expected for SYNC_MAP and RWLOCK_MAP heaps to have the same saw-tooth pattern, but RWLOCK_MAP back loads the Heap/GC activity. Usage of the Trove Long Map offers better random access time, but also lower heap utilisation across the board.

Future investigation is to run the test on JDK 1.6.0_14 with the new G1 GC, to see what impact this has. As well as other GC configurations, concurrency patterns, price object designs, distributed data access (Oracle Coherence/Terracotta/Other...).

I hope you found this analysis interesting.

Tuesday, 30 June 2009

64bit Cocoa Galileo Keyboard Problem Fixed (kinda)

I refused to give in to my keyboard problems I posted about yesterday and it looks like a bit of perseverance has paid off.

My problem was to do the the Microsoft Keyboard Mac driver (no surprises there). Specifically the keyboard is a PC keyboard, and it also has a British layout. This is significant because the Mac British keyboard layout is fairly different, and although it worked with Gailieo the keys were all mapped incorrectly.

In the image below you can see the Microsoft British layout. With this selected, neither the USB Keyboard or my MBP keyboard worked in Galilieo. So I searched around and came across a helpful fellow who produces a home brew British PC Keyboard layout (also seen below). With this installed I have British keys, and 64 bit Galileo, yay!