Benchmarking APL interpreters

#3971

Author: jimw@math.umass.
Date: Wed, 20 Apr 1994 15:26

47 lines
2564 bytes

Timing "+/1e6 {reshape} 1.0" is not a reasonable way to benchmark an APL
system.  It can make an APL interpreter seem much faster than it really is.

1.  Many APL systems (including APL*PLUS/386) don't care how you format
numbers on input, and "1.0" is Boolean.  1e6{reshape}1.0 is a 125K byte
bit vector instead of an 8M byte vector of reals.  Summing a Boolean
vector can be done very fast using a 256-byte translate table containing
the population count for each possible byte.  When C programmers write
"1.0", they mean floating point.  APLers have to be careful not to
compare grapes with watermelons.

2.  Assume you change 1.0 to 1.1 so you get floating point numbers.
Most of the execution time will be spent in +/, so mainly you are
evaluating the quality of the plus-reduction loop.  This is a common
operation and is usually carefully crafted by the interpreter authors.
How carefully?  Well, the plus-reduction loop for the IBM mainframe-
based VSAPL consists of TWO machine instructions!  An Add and a "BXH"
loop instruction.  Pretty hard to beat (although it can be done by
unrolling the loop).  This benchmark should make VSAPL look about as
fast as any compiled language.

3.  You gotta be realistic about the problems with the APL statement.
For example, did anyone mention Workspace Full to the C programmers?
APL has to be able to materialize that 8 MEGAbyte array or it can't
execute the statement.

   Even admitting the memory requirements, the benchmark gives an
unrepresentative picture of APL.  It is rare that a real-life APL
application can achieve this degree of "parallelism" (pass arrays this
large to primitives without any there being a lot of other operations on
MUCH smaller arrays).  And it is rare that an application requires only
summation within the loop.  There are plenty of simple calculations that
force an APL program to loop at a scalar or short vector level.

   My own rule of thumb, based on benchmarks that unfortunately I have
not cataloged, is that well-written (i.e., no unnecessary loops)
real-life APL code typically runs about 4 times slower than compiled
code, give or take a factor of two.  (Assuming the problem can be
vectorized to some extent.)  Note that this is within the 10:1 ratio
that John Nagle was looking for.  If the APL code has to loop at a
scalar level, it typically runs about 150 times slower (+/- 2x) than
compiled code.  Not great, but compare that with the figure of >1,000
times slower that Nagle quotes for Smalltalk and Python.

                                                Jim

Re: Benchmarking APL interpreters

#3975

Author: wchang@phage.csh
Date: Wed, 20 Apr 1994 15:39

148 lines
5869 bytes

In article <2p3hj2$lmc@nic.umass.edu> jimw@math.umass.edu (Jim Weigang) writes
many useful warnings.  Perhaps it should be cross-posted to comp.programming.

I can only wish my APL (APL.68000 Level II with FPU) did some of these
optimizations.  From what I can tell, the only numerical type is float.
Booleans are slow.  From past experience I know APL*PLUS booleans are
"packed" and quite fast.  (Alas, they lost me as a customer, perhaps
permanently, when they orphaned APL*PLUS for Macintosh instead of making
it run on the MacII.  I had just spent several hundred dollars, on a
grad student salary.)  So, how fast is APL*PLUS at +/1000000 rho 1.1?

>... the plus-reduction loop for the IBM mainframe-
>VSAPL consists of TWO machine instructions!  An Add and a "BXH"
>loop instruction.  Pretty hard to beat (although it can be done by
>unrolling the loop).  This benchmark should make VSAPL look about as
>fast as any compiled language.

Why didn't they unroll the loop in the interpreter? :-)
Or did the optimizing compiler do it for them?

>3.  You gotta be realistic about the problems with the APL statement.
>For example, did anyone mention Workspace Full to the C programmers?
>APL has to be able to materialize that 8 MEGAbyte array or it can't
>execute the statement.

Have 14 Meg on my PowerBook :-)  Of course you are right.  But then,
this "benchmark" was only a test of speed; in practice an array to be
summed already exists.

>   My own rule of thumb, based on benchmarks that unfortunately I have
>not cataloged, is that well-written (i.e., no unnecessary loops)
>real-life APL code typically runs about 4 times slower than compiled
>code, give or take a factor of two.  (Assuming the problem can be
>vectorized to some extent.)

I'm not sure that currently available APLs are as fast as you describe,
even on unrealistically favorable examples.  At APL'93, I compared APL2,
APL*PLUS, and Dyalog APL (making allowance for hardware, different RS/6000s)
on a piece of non-loopy integer-intensive code (see below).  APL*PLUS had
just been sped up, and beat out Dyalog.  The brand new APL2 (implemented in C,
 but at the time compiled _without_ optimization according to IBM reps) was
slightly faster still, 2-3 times faster than the old APL2.

(Summary: new APL2 > new APL*PLUS > Dyalog > old APL2, on IBM RS/6000.)

But compared to C it was no contest.  More than an order of magnitude
difference.  Of course the optimizing compiler on the RS/6000 was pretty nice.
One prevailing conjecture on comp.programming is that modern compilers do
better on modern (RISC) workstations, compared to interpreters.

Can people chip in with more benchmarks for common operations or idioms?
Personally I find this subject interesting (and not too intellectually taxing).

-- Bill Chang (wchang@cshl.org)

p.s. code

From: wchang@phage.cshl.org (William Chang in Marr Lab - CSHL)
Subject: Re: Quality of Implementation
Date: Tue, 2 Feb 1993 22:37:56 GMT

Here's a simple dynamic programming algorithm for
"approximate string matching":

P[1..m] = pattern
T[1..n] = text
D = (n+1)x(m+1) table whose entry
D[i,j] = minimum number of edit operations needed to
         convert P[1..j] to _some_ T[i'..i]
where edit operations are substitutions, insertions,
or deletions of single letters,

so D[0,j] = 0; D[i,0] = i; /* boundary conditions */
D[i,j] = min(D[i-1,j] + 1, /* insert T[i] */
             D[i,j-1] + 1, /* delete P[j] */
             D[i-1,j-1] + (0 if T[i]=P[j],
                           1 o.w. /* substitute */) );
and D[i,m] = "best match of pattern P ending at T[i]".

The naive double nested loop APL code is very slow;
however, with a simple transformation
  D'[i,j] = D[i,j] - i
the inner loop on i can be replaced by SCAN (\):

(! iota  ? rho  @ quad  ~ drop  [APL?!])

    Z <- P ASM T ; @IO
[1] @IO <- 0         o} origin 0
[2] Z <- -!1+?T      o} transformation
[3] J <- 1
[4] LOOPJ: Z <- min\ J, (1~Z+1) min ((_1~Z)-P[J-1]=T)
[5] ->LOOPJ -! (?P)>=J<-J+1   o} dyadic -! is "if"
[6] Z <- 1~Z+!1+?T   o} undo transformation

which is not too slow (but still slower than C).

Give this a try!

-- Bill Chang (wchang@cshl.org)

***************************************************************************
From: richard.levine@canrem.com (Richard Levine)
Date: Tue, 22 Jun 93 02:39:00 -0500

[announcement of transliteration functions]

--- Function <asm> for "approximate string matching" recently
    shared over e-mail by William Chang

      shortlist
apple betty cat dog jimmy jam
      'dig' asm shortlist
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 3 2 2 3 3 3 3 3 3
      'dog' asm shortlist
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 0 1 2 3 3 3 3 3 3 3 3

z .is p asm t;@io;j
 .comment approximate string matching
 .comment .v 1.0 / 02feb93
 .comment .n william chang (internet communication)
 .comment origin 0
@io .is 0
 .comment transformation
z .is - .iota 1+ .rho t
j .is 1
loopj:
z .is  .downstile \j,(1 .drop z+1) .downstile ((#1 .drop z)-p[j-1]=t)
j .is j+1
 .comment <if> is common APL function:  condition/label
 .goto loopj if( .rho p) .gte j
 .comment undo transformation
z .is 1 .drop z+ .iota 1+ .rho t

Uses of the Translation Functions:
(1) Include APL statements in messages on ASCII-based e-mail
(2) Provide method of sending APL code over the Internet
    that is comparable to sending J code using a J script
(3) Encourage people to experiment with ASCII transliteration
    and reach consensus for keyword names for APL names
(4) Pre-processing and post-processing phase for WSIS0 workspace
    interchange file so transfer file uses only ASCII
(5) Easy method to pre- and post- process APL statements for user
    groups who prefer words (cf. Homer Hortung's
    Letter to the Editor in APL News, Vol 19, No 4, 1987)
(6) Pre- and post-process APL statements so ASCII editors can
    be used for APL editing

Re: Benchmarking APL interpreters

#3988

Author: Doug White
Date: Thu, 21 Apr 1994 12:07

49 lines
2547 bytes

In article <2p3hj2$lmc@nic.umass.edu>, <jimw@math.umass.edu> writes:
(snip)
>    My own rule of thumb, based on benchmarks that unfortunately I have
> not cataloged, is that well-written (i.e., no unnecessary loops)
> real-life APL code typically runs about 4 times slower than compiled
> code, give or take a factor of two.  (Assuming the problem can be
> vectorized to some extent.)  Note that this is within the 10:1 ratio
> that John Nagle was looking for.  If the APL code has to loop at a
> scalar level, it typically runs about 150 times slower (+/- 2x) than
> compiled code.  Not great, but compare that with the figure of >1,000
> times slower that Nagle quotes for Smalltalk and Python.
>
>                                                 Jim

Several years ago, I had to rescue a large amount of microwave circuit
analysis code from our Mainframe.  Most of it was in APL, but there was
a Fortran program that we called from APL that also had to be salvaged.
Being allergic to Fortran, I decided to re-write the program in APL.

The program does a 'full-field' solution of the characteristics of
couple microstrip transmission lines.  It basically involves a LOT of
numerical integration (i.e. lots of loops).  I have a 33 MHz '486,
running APL*PLUS II/386.  When originally translated directly from the
Fortran (with all the loops entact), it took almost a full minute to run.
Eliminating all of the looping, and taking advantage of standard APL
programming techniques, the time was reduced to 1.48 seconds!  This was
with the integrals being chopped into relatively coarse chunks.  In order
to enhance the accuracy, I enlarged some of the arrays, and finally
settled on a version that takes 4.34 seconds.  This is still about 5 times
faster than our Mainframe used to run the less accurate version.

In a parallel effort, one of my co-workers (who is also allergic to Fortran,
and gets a mild rash when exposed to APL) rewrote the same program in 'C'.
Using the same integration parameters and running on the same machine, his
program takes 1.81 seconds to execute.

All in all, I was quite pleased with the performance of APL*PLUS on this
task.  The various floating point primatives have clearly been optimized
to a high level.  The program uses various hyperbolic and regular trig
functions, as well as matrix inversion {domino}.  Anytime anyone sticks
their nose in the air about how slow APL is, I show them this program.
They usually get very quiet after that ...

(Your Mileage May Vary)

Doug White
MIT Lincoln Laboratory

Re: Benchmarking APL interpreters

#3983

Author: bruce@news.srv.u
Date: Thu, 21 Apr 1994 13:06

33 lines
1544 bytes

In article <2p40dqINNj7q@phage.cshl.org> wchang@phage.cshl.org (William Chang in Marr Lab) writes:

> In article <2p3hj2$lmc@nic.umass.edu> jimw@math.umass.edu (Jim Weigang) writes
>
> So, how fast is APL*PLUS at +/1000000 rho 1.1?
> (Summary: new APL2 > new APL*PLUS > Dyalog > old APL2, on IBM RS/6000.)

I can confirm that. I'm running both APL2 for OS/2 and APL*PLUS II/386 v5.1
side by side on a 486/66 PC in OS/2 2.1. Because I'm short of swap file disk
space, I'm using a 1.5 MB workspace. I did 60 iterations of +/167000 rho
1.1 and divided the time by 10. I'm also running 18 other applications, but
none of them are CPU intensive. Here are my results:

APL2 for OS/2			      1.4 seconds
APL*PLUS  (running in OS/2)	      1.8 seconds
VS APL on a very old Amdahl mainframe 0.6 seconds

For comparison, the Fortran LINPACK benchmark on my PC is 2.1 MFLOPS (double
precision) and on the mainframe is 2.8 MFLOPS. To get 2.8 MFLOPS on this
mainframe, you have to run the benchmark when noone else is using it. But
since it's about to be scrapped, probably noone else is using it right now.

Since my PC is 75% of the raw speed of our obsolete mainframe, and mainframe
VS APL did the calculation in 0.6 seconds, a PC version of VS APL should do it
in 0.8 seconds. Is VS APL normally about twice as fast as APL2?

--
Bruce Clarke, Dept of Chemistry, University of Alberta,
	      Edmonton AB, Canada, T6G 2G2
Internet:     Bruce_Clarke@mts.ucs.ualberta.ca	|  voice     (403) 492-5739
Compuserve:   707840,3135			|  fax	     (403) 492-8231

🚀 go-pugleaf

Thread View: comp.lang.apl

Benchmarking APL interpreters

Re: Benchmarking APL interpreters

Re: Benchmarking APL interpreters

Re: Benchmarking APL interpreters

Thread Navigation