ARPA speech recognition benchmarks are very bogus!

#3808

Author: bovik@eecs.nwu.e
Date: Thu, 06 Apr 1995 00:00

42 lines
1872 bytes

Today at ASAT'95, I learned something that completely disgusts me:

The ARPA Automatic Speech Recognition (ASR) benchmarks, which have
been the final arbiter of speech recognition research funding for well
over a decade, do not take into account the time spent by the ASR
system to perform the recognition task (the "times real-time" statistic.)

To understand the implications of this, consider this illustration:

System A, from a commercial automatic dictation system vendor, scores
an 11% word error rate running IN REAL TIME on a 486/66 PC with 16MB
of RAM and an off-the-shelf sound card.

System B, from a government-funded research laboratory, scores an
8% word error rate on the same task.  However system B is running
at ONE HUNDRED times real-time and requires a 120 MIPS workstation
with 100MB of RAM and a dedicated DSP.

So, system B scores clearly superior in the eyes of ARPA, and wins a
lucrative research contract renewal for its developers.  What do the
developers of system A get?  Nothing, except maybe a shareholder
lawsuit stemming from a stock price drop when news hits the press that
their latest technology performed considerably worse that other
participants at the ARPA Spoken Language Systems Technology Workshop.

This is not an effective use of taxpayer dollars.

I understand that the NIST, part of the Department of Commerce, is
designing new ASR benchmarks that may not have this flaw.  But nearly
the entire Republican leadership is on record as opposing the
continued existence of the Department of Commerce as well as
supporting the expansion of the Department of Defense.  So....

Sincerely,
:James Salsman

P.S.  ASAT is the Advanced Speech Applications and Technologies
conference and exhibition.  ARPA is the Depatment of Defense's
Advanced Research Projects Agency.  NIST is the National Institute of
Standards and Technology.

Re: ARPA speech recognition benchmarks are very bogus!

#3809

Author: entropy@virek.wo
Date: Thu, 06 Apr 1995 00:00

25 lines
1208 bytes

In article <3m07kt$oul@news.eecs.nwu.edu>,
James Salsman <bovik@eecs.nwu.edu> wrote:
>System B, from a government-funded research laboratory, scores an
>8% word error rate on the same task.  However system B is running
>at ONE HUNDRED times real-time and requires a 120 MIPS workstation
>with 100MB of RAM and a dedicated DSP.

This is exactly why government should keep its hands out of commercial
research. In the few cases where they do end up discovering something
useful they give the patent away as a form a welfare for big companies.
The least they could do after squandering our money is make the results
available to everyone!

The market for good speech recognition is potentially massive, there is no
reason the government needs to become involved. The only place government
research money belongs is in science that will not/cannot be funded by
private industry (astronomy, etc.).

--
------ Call the skeptic hotline 1-900-666-5555 talk to your own personal .
\    / skeptic 24 hours/day.                                            . .
 \  / Exonize- 1. To censor. 2. To crap on civil rights as a lame duck . . .
  \/   3. To trade liberty for security from bad words.               . . . .

Re: ARPA speech recognition benchmarks are very bogus!

#3815

Author: mmalcolm Crawfor
Date: Sat, 08 Apr 1995 00:00

61 lines
2571 bytes

Today, in comp.speech, I read something that completely disgusts me:

> So, system B scores clearly superior in the eyes of ARPA, and wins
> a lucrative research contract renewal for its developers.  What do
> the developers of system A get?  Nothing, except maybe a shareholder
> lawsuit stemming from a stock price drop when news hits the press
> that their latest technology performed considerably worse that
> other participants at the ARPA Spoken Language Systems Technology
> Workshop.
>
> This is not an effective use of taxpayer dollars.
>
Oh heavens, what drivel.

(1) Consider Phil Woodland's excellent summary (on comp.speech) -- there are
currently no commercial products in the same league as the best research
systems *in terms of raw error rates*.


(2) What evidence do we have for the validity of your conclusions re the
effects the company developing system B, or on the effectiveness of the use
of taxes?

What alternatives scenarios can we envisage...?

The company selling system B has a marketing department worth its salt which
runs a campaign comparing its product favourably with the academic system,
which is state-of-the-art but which performs only marginally better,
requires a major investment in hardware, and isn't even practically useful.

Furthermore, the company's hard-working engineering team looks carefully at
the papers published by the academic developers, gains a number of insights,
and reduces the error rate to 9% for product version 1.1.

Both of these lead to a considerable sales increase, win a vote of
confidence from the shareholders, and massive perks and bonuses for the
already overpaid staff.


(3) There seems to be an implication in what you write that the company
selling system B should benefit from tax dollars (presumably if speed were a
factor they would win the benchmark tests, and so also the fictitious
lucrative research contract from ARPA).  It's not clear why.  Why should my
taxes support a commercial organisation which should be making its own way
in the marketplace, rather than an academic institution the fruits of whose
labours are freely available to all?


(4) Why should the time spent by the ASR system to perform the recognition
task be such a significant factor in the first place?  Does it necessarily
matter how quickly you get the wrong answer?  There are a number of
application domains where getting better recognition slowly may be
preferable to getting a sloppy result immediately -- offline transcription
of interviews, broadcasts etc. for example.


Malcolm

=personal opinions

Re: ARPA speech recognition benchmarks are very bogus!

#3816

Author: bovik@eecs.nwu.e
Date: Sat, 08 Apr 1995 00:00

72 lines
3147 bytes

In article <3m570q$je1@news.eecs.nwu.edu>,
 James Salsman <bovik@eecs.nwu.edu> wrote:
> In article <3m3l0f$o6i@lyra.csx.cam.ac.uk>,
>  Phil Woodland <pcw@eng.cam.ac.uk> wrote:
>>
>> You are correct that the ARPA CSR (continuous speech recognition) tests
>> do not require reporting of the amount of time spent performing recognition.
>
> Nor do they require reporting of the memory usage, processor speed, or
> auxiliary hardware used (e.g., DSP banks or multiprocessing.)
>...
> The purpose of my illustration was to highlight the the extremely poor
> experimental design of the benchmarks.  I would be very interested to
> read any defense of the utility of the benchmarks as they are now.

At this point, I have had no followups or replies that suggest the
ARPA benchmark numbers have any utility or independent meaning at all.

> In ignoring so many factors, the benchmarks are virtually meaningless.
> What's to keep someone from letting their system run all night just to
> search a few more plies, or adding an extra 500MB of RAM so they can
> keep bigger lookup tables?

To make this as clear as possible, here is another hypothetical illustration:

System X, running in ten times real-time, scores a 10% word error rate.

System Y, identical in every respect to X, with the sole exception
that Y runs in one-hundred times real-time, scores a 5% word error rate.

The resulting ARPA benchmark numbers would show system Y far superior
to system X, even though they are identical in hardware and software.

>>8. Its an interesting research area to explore the algorithms for making
>>   systems real-time without too much impact on performance.
>
> Indeed it is!  And on affordable hardware.
>
> When I was coding for benchmark evaluation under RSTS/E many years
> ago, algorithm performance was described in "kilo-core ticks," a
> measure of the amount of RAM in use for each machine instruction
> (regardless of processor speed.)  It's a simple concept, really.

The obvious way to correct the flaw in the ARPA benchmarks is to
report word error rates for a system running within a specific
established number of, say, "giga-core ticks."  For example, a 50MHz
system with 16MB or RAM would be allowed four times as long as an
100MHz system with 32MB of RAM to perform the same recognition task.

However, I am not suggesting that the trade off between time and space
should be judged in a directly proportional (or even linear) fashion.
For example, the 100MHz system above might be allowed only 3.8 times
as long as the 50MHz system with half the memory.

Most important is to consider the computational power brought to bear,
not just the unqualified error rate.

> ARPA should have adopted such realistic measures long ago.

With so many independent factors not held constant, the ARPA benchmark
numbers clearly have no discernable meaning.  This mistake has cost
the American taxpayer (and the financial supporters of foreign labs
submitting to the flawed ARPA benchmarks) untold millions in wasted
research funding.

Those responsible for this oversight should be held accountable
for their negligence.

Sincerely,
:James Salsman

Re: ARPA speech recognition benchmarks are very bogus!

#3817

Author: bovik@eecs.nwu.e
Date: Sat, 08 Apr 1995 00:00

33 lines
1241 bytes

In article <950408212009.213AACUW.malc@daneel>,
 mmalcolm Crawford  <m.crawford@dcs.shef.ac.uk> wrote:
>
> ... There are currently no commercial products in the same league as
> the best research systems *in terms of raw error rates*.

There are currently no commercial products in the same league as the
best research systems *in terms of raw computational power*.

> Why should the time spent by the ASR system to perform the
> recognition task be such a significant factor in the first place?

Because in more time an identical algorithm can do more work.

> There are a number of application domains where getting better
> recognition slowly may be preferable to getting a sloppy result
> immediately -- offline transcription of interviews, broadcasts
> etc. for example.

Agreed, however, this is absolutely irrelevant.  If you are
transcripting 10 hours of interviews and broadcasts each day, do you
really want to use a research system which will take 1,000 hours on
your most expensive workstations?  You would need 42 of them just to
keep up!

A benchmark which limits the number of machine cycles/RAM would be a
better arbiter of off-line transcription systems, as well as real-time
dictation systems.

Sincerely,
:James Salsman

Re: ARPA speech recognition benchmarks are very bogus!

#3818

Author: nickm@netaxs.com
Date: Sun, 09 Apr 1995 00:00

13 lines
444 bytes

ARPA is correct in iqnoring response time. Continuous speech recognition
so bad right now that every effort should be made to improve it's
accuracy.

My main complaint about the benchmarks are that they don't really
present real-world scenarios. If exactly the same text used in the
WSJ task were recorded using a different microphone, or in a room with
an echo, or heavy background noise, ALL of the participants would
fail miserably.

Re: ARPA speech recognition benchmarks are very bogus!

#3819

Author: mmalcolm Crawfor
Date: Sun, 09 Apr 1995 00:00

101 lines
4338 bytes

James Salsman <bovik@eecs.nwu.edu> wrote:
> The purpose of my illustration was to highlight the the extremely
> poor experimental design of the benchmarks.  I would be very
> interested to read any defense of the utility of the benchmarks as
> they are now.
>
It is usually argued that the benefit of the benchmarks is to focus tax
dollars into those research establishments which hold most hope of success
(I do not necessarily believe in this philosophy).

> I am not interested in attacks on my hypothetical illustration.
>
You don't seem to be interested in anything other than your own blinkered
and unsubstantiated opinion.  The very fact that your illustration was
hypothetical is in itself telling: Phil Woodland's summary explained quite
clearly that any "commercial" systems attempting to perform the task had
hardware requirements well in excess of those found on most desktops (and
comparable to those of academic research systems).

Furthermore your hypothetical illustration is the basis of much of your
grievance.  If instead we consider the following:

System A, from a commercial automatic dictation system vendor, scores
an 50% word error rate running IN REAL TIME on a 486/66 PC with 16MB
of RAM and an off-the-shelf sound card.

System B, from a government-funded research laboratory (e.g. the Hochberg,
Renals & Robinson ABBOT system), scores a 10% word error rate on the same
task.  However system B is running at FIVE times real-time and requires a
100MHz Pentium workstation with 64MB of RAM and an off-the-shelf sound card.

as is would be closer to the truth, then your argument looks very weak
indeed.
You've put up a straw man, it deserves to be burned down.

---

> However, I am not suggesting that the trade off between time and
> space should be judged in a directly proportional (or even linear)
> fashion.  For example, the 100MHz system above might be allowed
> only 3.8 times as long as the 50MHz system with half the memory.
>
So, come up with a trade-off which will be entirely fair, and agreeable to
all participants... should be easy enough...
Not.

---

> In ignoring so many factors, the benchmarks are virtually meaningless.
> What's to keep someone from letting their system run all night just
> to search a few more plies, or adding an extra 500MB of RAM so they
> can keep bigger lookup tables?
>
You might be interested to know that one development system used by IBM
(they're a commercial organisation, you know) uses more than 1GB RAM.  Maybe
you should write to Dr. Jellinek and tell him to cut his memory usage...

---

>>8. Its an interesting research area to explore the algorithms for making
>>   systems real-time without too much impact on performance.
>
>Indeed it is!  And on affordable hardware.
>
If academic institutions were developing systems which could be sold in the
marketplace, you would be complaining of unfair competition.  It is people
such as yourself, however, who are ensuring that this will be happening
real-soon-now.

---

>With so many independent factors not held constant, the ARPA benchmark
>numbers clearly have no discernable meaning.  This mistake has cost
>the American taxpayer (and the financial supporters of foreign labs
>submitting to the flawed ARPA benchmarks) untold millions in wasted
>research funding.
>
You still haven't explained why this is the case.  What evidence do we have
that this is so?

>Those responsible for this oversight should be held accountable
>for their negligence.
>
Those responsible for the development of the current suite should be
commended for their foresight in not restricting science to the development
of what would be of immediate commercial interest, but rather expanding its
horizons to what may be of benefit a few (or many) years from now.  Would it
have made sense five years ago to have constrained all research systems to
running on 80286 machines in 1MB RAM?  Does it not make more sense to show
what will be possible as computing power on the desktop increases, and leave
it to commercial enterprises to (try to) engineer marketable solutions?

Directors of businesses who do not ensure full exploitation of the advances
made possible through world-class, ground-breaking research carried out in
academic institutions should be held accountable by their shareholders.

Malcolm

=personal opinions

Re: ARPA speech recognition benchmarks are very bogus!

#3820

Author: spp@bob.eecs.ber
Date: Sun, 09 Apr 1995 00:00

13 lines
310 bytes

nickm@netaxs.com (nickm) writes:

> ARPA is correct in iqnoring response time.

I will ask again: could somebody please document this
claim that ARPA funds projects without considering
response time / compute power etc. , before we go
further down the path of debating the ethics of such
a position ?

Steve

Re: ARPA speech recognition benchmarks are very bogus!

#3822

Author: bovik@eecs.nwu.e
Date: Mon, 10 Apr 1995 00:00

53 lines
2086 bytes

In article <950409181430.213AACUa.malc@daneel>,
 mmalcolm Crawford  <m.crawford@dcs.shef.ac.uk> wrote:
>
> You've put up a straw man, it deserves to be burned down.

I agree that my first illustration was badly flawed.  Don't you agree
that my second illustration (two identical systems running for
different durations achieving vastly different benchmark scores) fully
highlights the flaws in the ARPA benchmark scheme?

>> However, I am not suggesting that the trade off between time and
>> space should be judged in a directly proportional (or even linear)
>> fashion.  For example, the 100MHz system above might be allowed
>> only 3.8 times as long as the 50MHz system with half the memory.
>
> So, come up with a trade-off which will be entirely fair, and agreeable to
> all participants... should be easy enough...

Agreed.  Here is a proposal:  use a memory-to-cycles trade-off
coefficient equal to the median of votes cast by all evaluation
participants.

> If academic institutions were developing systems which could be sold
> in the marketplace, you would be complaining of unfair competition.

Oh?  Perhaps you do not fully understand my motivations.

> It is people such as yourself, however, who are ensuring that this
> will be happening real-soon-now.

Believe me, I can live with that.

> Would it have made sense five years ago to have constrained all
> research systems to running on 80286 machines in 1MB RAM?

Of course not; talk about straw men!  It always makes sense to employ
principles of sound experimental design, holding independent factors
constant when drawing conclusions from a single dependant variable
(the word error rate, in this case.)

> Directors of businesses who do not ensure full exploitation of the
> advances made possible through world-class, ground-breaking research
> carried out in academic institutions should be held accountable by
> their shareholders.

Any director basing a business decision only on ARPA benchmark
numbers and not of the computational situation behind them would
clearly be negligent.

Sincerely,
:James Salsman

Re: ARPA speech recognition benchmarks are very bogus!

#3823

Author: ajr@compute.demo
Date: Mon, 10 Apr 1995 00:00

27 lines
1125 bytes

In article <3m8s34$1dp@netaxs.com> nickm@netaxs.com (nickm) writes:
> ARPA is correct in iqnoring response time. Continuous speech recognition
> so bad right now that every effort should be made to improve it's
> accuracy.
>
> My main complaint about the benchmarks are that they don't really
> present real-world scenarios. If exactly the same text used in the
> WSJ task were recorded using a different microphone, or in a room with
> an echo, or heavy background noise, ALL of the participants would
> fail miserably.

For the last few years ARPA has adopted a "Hub and Spoke" paradigm.

The "Hub" last year was transcription of unlimited vocabulary read text
from north American business news.

The "Spokes" included speaker adaptation, domain adaptation, background
noise, microphone independence, telephone speech, etc.

The evaluations were carefully thought out to be both challanging and
informative on the "clean" case and also extend this towards
"real-world" conditions.  This is a useful framework as it factors the
problem into roughly independent tasks and so allows more rapid progress.

Tony Robinson

Re: ARPA speech recognition benchmarks are very bogus! Not.

#3877

Author: James Salsman
Date: Fri, 05 May 1995 00:00

46 lines
1955 bytes

Regarding the fact that ARPA automatic speech recognition benchmarks
have been strongly unscientific in that many of the contributing
factors to the benchmark results were left uncontrolled,

In article <AJR.95Apr11062211@compute.demon.co.uk>,
 Tony Robinson <ajr@compute.demon.co.uk> suggests a solution:
>
> The correct way to implement this within the current framework is to
> create a new sub-task (a new spoke).  Lets call this a 'resource
> limited' sub-task.  A reasonable set up would be to make the speech
> recognition problem the same as the central task (the hub) - that is
> unlimited vocabulary recognition.  This way, every site that completes
> both tasks contributes by:
>
>   * Improving basic speech recognition disregarding resource limits
>
>   * Developing techniques for porting systems to resource bound platforms
>
>   * Providing data points for relative loss in accuracy to help estimate
>       the loss in performance of other systems were they also ported
>
> This approach lies with the ARPA CSR framework used in Nov 1993 and Nov
> 1994.  It is conceivable that something like this will be implemented in
> 1995.

After almost a month of careful consideration, I report that I fully
approve of phasing in scientific experimental design as a "spoke" off
the "hub" if that is what it will take to get good experimental design
instituted.

Such a plan of action will considerably reduce the legal liability of
many researchers, not just in the United States.

It is clear to me that there are currently grounds for at least a few
such negligence and unscientific conduct lawsuits.  However, I am less
familiar with the statutes of limitations in the International Court
of Justice, so current non-U.S. liability for participation in
improperly controled ARPA benchmarks is less certain.  However, if the
uncontroled experiments continue, there will be no question.

Sincerely,
:James Salsman

May Peace Prevail on Earth!

🚀 go-pugleaf

Thread View: comp.dsp

ARPA speech recognition benchmarks are very bogus!

Re: ARPA speech recognition benchmarks are very bogus!

Re: ARPA speech recognition benchmarks are very bogus!

Re: ARPA speech recognition benchmarks are very bogus!

Re: ARPA speech recognition benchmarks are very bogus!

Re: ARPA speech recognition benchmarks are very bogus!

Re: ARPA speech recognition benchmarks are very bogus!

Re: ARPA speech recognition benchmarks are very bogus!

Re: ARPA speech recognition benchmarks are very bogus!

Re: ARPA speech recognition benchmarks are very bogus!

Re: ARPA speech recognition benchmarks are very bogus! Not.

Thread Navigation