I Second That Emotion
So Tim Bray finds out that Erlang IO is slow. I can attest to this fact, as my recent work on reading large files in Erlang has shown that IO and string manipulation is much slower than I would have wanted.
Yes, like Bray, my file reading is single threaded (although, what I do with the line is very multi-threaded) so I suppose using a single thread for Erlang isn’t very Erlang-like in the first place.
In the meantime, I’m porting my OLAP cube generator to Scala. The assumption (and shortly, hopefully proof) is that the JVM can do file IO much better than Erlang, yet I can still take advantage of Scala’s Actors to retain my concurrency.
Update: OK, some numbers and code. This is a benchmark for Erlang and Scala to read in a file line by line.
First, the Erlang code:
process_file2(Filename) ->
{ok, File} = file:open(Filename, read),
process_lines2(File).
process_lines2(File) ->
case io:get_line(File, '') of
eof -> file:close(File);
_ -> process_lines2(File)
end.
Now the Scala code:
object LineReader {
def foreachline(in: BufferedReader, f: String => Unit): Unit = {
val line = in.readLine()
if (line == null) return
else f(line)
foreachline(in, f)
}
def forLines(filename: String, f: String => Unit) = {
val in = new BufferedReader(new FileReader(filename))
foreachline(in, f)
in.close()
}
}
OK, so these aren’t exactly the same. The Scala example is dispatching to a function, so Scala is even at a disadvantage.
The timings, three runs each, on my MacBook Pro 2.2 Ghz Intel Core 2 Duo. Erlang is the BEAM emulator 5.5.5 and Scala is 2.6 running on JDK 1.5 on Mac OS X. Erlang code was compiled with HIPE.
I am reading in a 1028071833 bytes file with 10037355 lines.
| Code | Run 1 | Run 2 | Run 3 |
|---|---|---|---|
| Erlang | 205.830 sec | 208.999 sec | 207.454 sec |
| Java | 36.094 sec | 39.917 sec | 34.337 sec |
September 25th, 2007 at 4:11 am
Reading the comments on Tim Bray’s post it seems the difference is in the buffering.
Igwan advocates buffering the whole file before processing it and claims faster times, also Java’s buffered reader implements (who’d have thought it) a reasonable sized buffer (ISTR its 8kb) by default. This means each disk read for Erlang is 1 or 2 bytes and it repeats the process to identify lines, but Java hoovers up a few Kb each time and works out where the lines are afterwards.
With this in mind, the stats you posted are no great surprise - Erlang will be working the IO subsystem pretty hard and that will slow down the whole thing.
September 26th, 2007 at 6:07 am
The real question seems to be; why is there no buffered implementation in io:get_line(File). Did Tim not find it, does the standard Erlang library not provide one or should one use a 3rd party lib?
Igwans comment is not really a solution. Sometimes reading files into memory works, sometimes it doesn’t. Sometimes your files are just do big to read into memory.
When comparing the in memory solution to a Java/Scala solution, one should implement the Scala solution with NIO which just uses memory mapped files and OS memory managment. This is much faster than reading files to a buffer by hand (I did use NIO to parse large log files for analysis).
And if one is supposed to develop a buffered implementation or a memory mapped one then my main concern with all languages beside Java comes into play: With Java you usually do not develop applications but assemble them. There are so many great open source libraries around (lucene, svnkit, spring, seam, camel, …) that most of the time you only write plumbing code.
Peace
-stephan
–
Stephan Schmidt :: stephan@reposita.org
Reposita Open Source - Monitor your software development
http://www.reposita.org
Blog at http://stephan.reposita.org - No signal. No noise.