A few years ago I held a session about MemoryBlocks
at the Xojo Developer Conference where I discussed how, generally, MemoryBlocks
(and Ptrs
) should be avoided except for cases where you must use them, e.g., Declare
, or when speed is absolutely critical. I offered this advice because a MemoryBlock
can be tedious to work with, and can lead to hard-to-trace bugs.
But when you do need that extra boost, it’s an option to consider, and I recently came across a scenario where it made a huge difference.
The (One Billion Row) Challenge
This came about from a discussion of the “One Billion Row Challenge” on the Xojo Forum, where a programmer is tasked with reading a billion rows of temperature data and consolidating it into statistics for each given city. Our friend Mike D started a project to demonstrate different techniques, and I eventually started my own project. Using MemoryBlock
, Ptr
, and preemptive threading, I was able to process a billion rows in roughly 8 seconds.
But that’s not the point of this post.
See, in order to process the data, you must first create it, which isn’t as straightforward as it seems.
Creating The Data
Each row of the data file takes the form “City;temp”, where “temp” is single-place decimal between -99.9 and 99.9. For purpose of limits, I used, arbitrarily, 413 random cities out of a list of all cities. The original code looked something like this where “bs” represents a BinaryStream
:
Var r As New Random
For i As Integer = 1 To rowCount
Var city As String = cities(r.InRange(0, cities.LastIndex))
Var temp As Double = r.InRange(-999, 999) / 10.0
bs.Write city + ";" + temp.ToString("#0.0") + EndOfLine
Next
This was simple, easy, and slow. To generate the full billion rows took about 2.5 hours.
Memory-Unblocking
Upon investigation, I rediscovered what I already knew. Dealing with strings, both in conversion and concatenation, can be a bottleneck. I won’t go through all the iterations here, but nothing I tried made a significant difference. The only solution was to ditch strings entirely.
I started with creating a MemoryBlock
“buffer” (“outMB”) of 1 MB with an associated Ptr
(“outPtr”). (You can access the contents of a MemoryBlock
through its methods, but those are function calls, which have an overhead. Ptr
methods are operators that work with the bytes directly so they are faster.) The plan was to fill the buffer as much as I could, write it to the file, then start again at the top of the buffer.
Keeping a position index, I started with writing the city using outMB.StringValue
since there is no equivalent Ptr
method for this. Next, I plugged in the value of a semicolon with outPtr.Byte(outMBIndex) = 59
.
Working with integers is faster than doubles, so I used a little math to plug in the temperature values directly using If
statements and outPtr.Byte
.
Finally, I used outPtr.Byte(outMBIndex) = 10
to plug in the linefeed (ASCII 10).
The final code looked something like this:
Const kEOL As Integer = 10
Const kHyphen As Integer = 45
Const kDot As Integer = 46
Const kZero As Integer = 48
Const kSemicolon As Integer = 59
Var r As New Random
Var outMB As New MemoryBlock(1000000)
Var outPtr As Ptr = outMB
Var outMBIndex As Integer = 0
For row As Integer = 1 To rows
Var cityIndex As Integer = r.InRange(0, cities.LastIndex)
Var city As string = cities(cityIndex)
Var cityBytes As Integer = city.Bytes
If (outMBIndex + cityBytes + 10) >= outMB.Size Then
bs.Write outMB.StringValue(0, outMBIndex)
outMBIndex = 0
End If
outMB.StringValue(outMBIndex, cityBytes) = city
outMBIndex = outMBIndex + cityBytes
outPtr.Byte(outMBIndex) = kSemicolon
outMBIndex = outMBIndex + 1
If r.InRange(0, 4) = 0 Then
outPtr.Byte(outMBIndex) = kHyphen
outMBIndex = outMBIndex + 1
End If
Var temp As Integer = r.InRange(0, 999)
Var t1 As Integer = temp \ 100
Var t2 As Integer = (temp \ 10) Mod 10
Var t3 As Integer = temp Mod 10
If t1 <> 0 Then
outPtr.Byte(outMBIndex) = t1 + kZero
outMBIndex = outMBIndex + 1
End If
outPtr.Byte(outMBIndex) = t2 + kZero
outMBIndex = outMBIndex + 1
outPtr.Byte(outMBIndex) = kDot
outMBIndex = outMBIndex + 1
outPtr.Byte(outMBIndex) = t3 + kZero
outMBIndex = outMBIndex + 1
outPtr.Byte(outMBIndex) = kEOL
outMBIndex = outMBIndex + 1
next
If outMBIndex <> 0 Then
bs.Write outMB.StringValue(0, outMBIndex)
End If
This code is far longer, harder to follow, and difficult to maintain, which goes back to my original point of why MemoryBlock
should be avoided.
It also generates one billion rows of data in about a minute (as opposed to 2.5 hours).
It’s nice to have the option.
Kem Tekinay is a Mac consultant and programmer who has been using Xojo since its first release to create custom solutions for clients. He is the author of the popular utilities TFTP Client and RegExRX (both written with Xojo) and lives in Connecticut with his wife Lisa, and their cat.