Thread View: comp.dsp
11 messages
11 total messages
Started by bovik@eecs.nwu.e
Thu, 06 Apr 1995 00:00
ARPA speech recognition benchmarks are very bogus!
Author: bovik@eecs.nwu.e
Date: Thu, 06 Apr 1995 00:00
Date: Thu, 06 Apr 1995 00:00
42 lines
1872 bytes
1872 bytes
Today at ASAT'95, I learned something that completely disgusts me: The ARPA Automatic Speech Recognition (ASR) benchmarks, which have been the final arbiter of speech recognition research funding for well over a decade, do not take into account the time spent by the ASR system to perform the recognition task (the "times real-time" statistic.) To understand the implications of this, consider this illustration: System A, from a commercial automatic dictation system vendor, scores an 11% word error rate running IN REAL TIME on a 486/66 PC with 16MB of RAM and an off-the-shelf sound card. System B, from a government-funded research laboratory, scores an 8% word error rate on the same task. However system B is running at ONE HUNDRED times real-time and requires a 120 MIPS workstation with 100MB of RAM and a dedicated DSP. So, system B scores clearly superior in the eyes of ARPA, and wins a lucrative research contract renewal for its developers. What do the developers of system A get? Nothing, except maybe a shareholder lawsuit stemming from a stock price drop when news hits the press that their latest technology performed considerably worse that other participants at the ARPA Spoken Language Systems Technology Workshop. This is not an effective use of taxpayer dollars. I understand that the NIST, part of the Department of Commerce, is designing new ASR benchmarks that may not have this flaw. But nearly the entire Republican leadership is on record as opposing the continued existence of the Department of Commerce as well as supporting the expansion of the Department of Defense. So.... Sincerely, :James Salsman P.S. ASAT is the Advanced Speech Applications and Technologies conference and exhibition. ARPA is the Depatment of Defense's Advanced Research Projects Agency. NIST is the National Institute of Standards and Technology.
Re: ARPA speech recognition benchmarks are very bogus!
Author: entropy@virek.wo
Date: Thu, 06 Apr 1995 00:00
Date: Thu, 06 Apr 1995 00:00
25 lines
1208 bytes
1208 bytes
In article <3m07kt$oul@news.eecs.nwu.edu>, James Salsman <bovik@eecs.nwu.edu> wrote: >System B, from a government-funded research laboratory, scores an >8% word error rate on the same task. However system B is running >at ONE HUNDRED times real-time and requires a 120 MIPS workstation >with 100MB of RAM and a dedicated DSP. This is exactly why government should keep its hands out of commercial research. In the few cases where they do end up discovering something useful they give the patent away as a form a welfare for big companies. The least they could do after squandering our money is make the results available to everyone! The market for good speech recognition is potentially massive, there is no reason the government needs to become involved. The only place government research money belongs is in science that will not/cannot be funded by private industry (astronomy, etc.). -- ------ Call the skeptic hotline 1-900-666-5555 talk to your own personal . \ / skeptic 24 hours/day. . . \ / Exonize- 1. To censor. 2. To crap on civil rights as a lame duck . . . \/ 3. To trade liberty for security from bad words. . . . .
Re: ARPA speech recognition benchmarks are very bogus!
Author: mmalcolm Crawfor
Date: Sat, 08 Apr 1995 00:00
Date: Sat, 08 Apr 1995 00:00
61 lines
2571 bytes
2571 bytes
Today, in comp.speech, I read something that completely disgusts me: > So, system B scores clearly superior in the eyes of ARPA, and wins > a lucrative research contract renewal for its developers. What do > the developers of system A get? Nothing, except maybe a shareholder > lawsuit stemming from a stock price drop when news hits the press > that their latest technology performed considerably worse that > other participants at the ARPA Spoken Language Systems Technology > Workshop. > > This is not an effective use of taxpayer dollars. > Oh heavens, what drivel. (1) Consider Phil Woodland's excellent summary (on comp.speech) -- there are currently no commercial products in the same league as the best research systems *in terms of raw error rates*. (2) What evidence do we have for the validity of your conclusions re the effects the company developing system B, or on the effectiveness of the use of taxes? What alternatives scenarios can we envisage...? The company selling system B has a marketing department worth its salt which runs a campaign comparing its product favourably with the academic system, which is state-of-the-art but which performs only marginally better, requires a major investment in hardware, and isn't even practically useful. Furthermore, the company's hard-working engineering team looks carefully at the papers published by the academic developers, gains a number of insights, and reduces the error rate to 9% for product version 1.1. Both of these lead to a considerable sales increase, win a vote of confidence from the shareholders, and massive perks and bonuses for the already overpaid staff. (3) There seems to be an implication in what you write that the company selling system B should benefit from tax dollars (presumably if speed were a factor they would win the benchmark tests, and so also the fictitious lucrative research contract from ARPA). It's not clear why. Why should my taxes support a commercial organisation which should be making its own way in the marketplace, rather than an academic institution the fruits of whose labours are freely available to all? (4) Why should the time spent by the ASR system to perform the recognition task be such a significant factor in the first place? Does it necessarily matter how quickly you get the wrong answer? There are a number of application domains where getting better recognition slowly may be preferable to getting a sloppy result immediately -- offline transcription of interviews, broadcasts etc. for example. Malcolm =personal opinions
Re: ARPA speech recognition benchmarks are very bogus!
Author: bovik@eecs.nwu.e
Date: Sat, 08 Apr 1995 00:00
Date: Sat, 08 Apr 1995 00:00
72 lines
3147 bytes
3147 bytes
In article <3m570q$je1@news.eecs.nwu.edu>, James Salsman <bovik@eecs.nwu.edu> wrote: > In article <3m3l0f$o6i@lyra.csx.cam.ac.uk>, > Phil Woodland <pcw@eng.cam.ac.uk> wrote: >> >> You are correct that the ARPA CSR (continuous speech recognition) tests >> do not require reporting of the amount of time spent performing recognition. > > Nor do they require reporting of the memory usage, processor speed, or > auxiliary hardware used (e.g., DSP banks or multiprocessing.) >... > The purpose of my illustration was to highlight the the extremely poor > experimental design of the benchmarks. I would be very interested to > read any defense of the utility of the benchmarks as they are now. At this point, I have had no followups or replies that suggest the ARPA benchmark numbers have any utility or independent meaning at all. > In ignoring so many factors, the benchmarks are virtually meaningless. > What's to keep someone from letting their system run all night just to > search a few more plies, or adding an extra 500MB of RAM so they can > keep bigger lookup tables? To make this as clear as possible, here is another hypothetical illustration: System X, running in ten times real-time, scores a 10% word error rate. System Y, identical in every respect to X, with the sole exception that Y runs in one-hundred times real-time, scores a 5% word error rate. The resulting ARPA benchmark numbers would show system Y far superior to system X, even though they are identical in hardware and software. >>8. Its an interesting research area to explore the algorithms for making >> systems real-time without too much impact on performance. > > Indeed it is! And on affordable hardware. > > When I was coding for benchmark evaluation under RSTS/E many years > ago, algorithm performance was described in "kilo-core ticks," a > measure of the amount of RAM in use for each machine instruction > (regardless of processor speed.) It's a simple concept, really. The obvious way to correct the flaw in the ARPA benchmarks is to report word error rates for a system running within a specific established number of, say, "giga-core ticks." For example, a 50MHz system with 16MB or RAM would be allowed four times as long as an 100MHz system with 32MB of RAM to perform the same recognition task. However, I am not suggesting that the trade off between time and space should be judged in a directly proportional (or even linear) fashion. For example, the 100MHz system above might be allowed only 3.8 times as long as the 50MHz system with half the memory. Most important is to consider the computational power brought to bear, not just the unqualified error rate. > ARPA should have adopted such realistic measures long ago. With so many independent factors not held constant, the ARPA benchmark numbers clearly have no discernable meaning. This mistake has cost the American taxpayer (and the financial supporters of foreign labs submitting to the flawed ARPA benchmarks) untold millions in wasted research funding. Those responsible for this oversight should be held accountable for their negligence. Sincerely, :James Salsman
Re: ARPA speech recognition benchmarks are very bogus!
Author: bovik@eecs.nwu.e
Date: Sat, 08 Apr 1995 00:00
Date: Sat, 08 Apr 1995 00:00
33 lines
1241 bytes
1241 bytes
In article <950408212009.213AACUW.malc@daneel>, mmalcolm Crawford <m.crawford@dcs.shef.ac.uk> wrote: > > ... There are currently no commercial products in the same league as > the best research systems *in terms of raw error rates*. There are currently no commercial products in the same league as the best research systems *in terms of raw computational power*. > Why should the time spent by the ASR system to perform the > recognition task be such a significant factor in the first place? Because in more time an identical algorithm can do more work. > There are a number of application domains where getting better > recognition slowly may be preferable to getting a sloppy result > immediately -- offline transcription of interviews, broadcasts > etc. for example. Agreed, however, this is absolutely irrelevant. If you are transcripting 10 hours of interviews and broadcasts each day, do you really want to use a research system which will take 1,000 hours on your most expensive workstations? You would need 42 of them just to keep up! A benchmark which limits the number of machine cycles/RAM would be a better arbiter of off-line transcription systems, as well as real-time dictation systems. Sincerely, :James Salsman
Re: ARPA speech recognition benchmarks are very bogus!
Author: nickm@netaxs.com
Date: Sun, 09 Apr 1995 00:00
Date: Sun, 09 Apr 1995 00:00
13 lines
444 bytes
444 bytes
ARPA is correct in iqnoring response time. Continuous speech recognition so bad right now that every effort should be made to improve it's accuracy. My main complaint about the benchmarks are that they don't really present real-world scenarios. If exactly the same text used in the WSJ task were recorded using a different microphone, or in a room with an echo, or heavy background noise, ALL of the participants would fail miserably.
Re: ARPA speech recognition benchmarks are very bogus!
Author: mmalcolm Crawfor
Date: Sun, 09 Apr 1995 00:00
Date: Sun, 09 Apr 1995 00:00
101 lines
4338 bytes
4338 bytes
James Salsman <bovik@eecs.nwu.edu> wrote: > The purpose of my illustration was to highlight the the extremely > poor experimental design of the benchmarks. I would be very > interested to read any defense of the utility of the benchmarks as > they are now. > It is usually argued that the benefit of the benchmarks is to focus tax dollars into those research establishments which hold most hope of success (I do not necessarily believe in this philosophy). > I am not interested in attacks on my hypothetical illustration. > You don't seem to be interested in anything other than your own blinkered and unsubstantiated opinion. The very fact that your illustration was hypothetical is in itself telling: Phil Woodland's summary explained quite clearly that any "commercial" systems attempting to perform the task had hardware requirements well in excess of those found on most desktops (and comparable to those of academic research systems). Furthermore your hypothetical illustration is the basis of much of your grievance. If instead we consider the following: System A, from a commercial automatic dictation system vendor, scores an 50% word error rate running IN REAL TIME on a 486/66 PC with 16MB of RAM and an off-the-shelf sound card. System B, from a government-funded research laboratory (e.g. the Hochberg, Renals & Robinson ABBOT system), scores a 10% word error rate on the same task. However system B is running at FIVE times real-time and requires a 100MHz Pentium workstation with 64MB of RAM and an off-the-shelf sound card. as is would be closer to the truth, then your argument looks very weak indeed. You've put up a straw man, it deserves to be burned down. --- > However, I am not suggesting that the trade off between time and > space should be judged in a directly proportional (or even linear) > fashion. For example, the 100MHz system above might be allowed > only 3.8 times as long as the 50MHz system with half the memory. > So, come up with a trade-off which will be entirely fair, and agreeable to all participants... should be easy enough... Not. --- > In ignoring so many factors, the benchmarks are virtually meaningless. > What's to keep someone from letting their system run all night just > to search a few more plies, or adding an extra 500MB of RAM so they > can keep bigger lookup tables? > You might be interested to know that one development system used by IBM (they're a commercial organisation, you know) uses more than 1GB RAM. Maybe you should write to Dr. Jellinek and tell him to cut his memory usage... --- >>8. Its an interesting research area to explore the algorithms for making >> systems real-time without too much impact on performance. > >Indeed it is! And on affordable hardware. > If academic institutions were developing systems which could be sold in the marketplace, you would be complaining of unfair competition. It is people such as yourself, however, who are ensuring that this will be happening real-soon-now. --- >With so many independent factors not held constant, the ARPA benchmark >numbers clearly have no discernable meaning. This mistake has cost >the American taxpayer (and the financial supporters of foreign labs >submitting to the flawed ARPA benchmarks) untold millions in wasted >research funding. > You still haven't explained why this is the case. What evidence do we have that this is so? >Those responsible for this oversight should be held accountable >for their negligence. > Those responsible for the development of the current suite should be commended for their foresight in not restricting science to the development of what would be of immediate commercial interest, but rather expanding its horizons to what may be of benefit a few (or many) years from now. Would it have made sense five years ago to have constrained all research systems to running on 80286 machines in 1MB RAM? Does it not make more sense to show what will be possible as computing power on the desktop increases, and leave it to commercial enterprises to (try to) engineer marketable solutions? Directors of businesses who do not ensure full exploitation of the advances made possible through world-class, ground-breaking research carried out in academic institutions should be held accountable by their shareholders. Malcolm =personal opinions
Re: ARPA speech recognition benchmarks are very bogus!
Author: spp@bob.eecs.ber
Date: Sun, 09 Apr 1995 00:00
Date: Sun, 09 Apr 1995 00:00
13 lines
310 bytes
310 bytes
nickm@netaxs.com (nickm) writes: > ARPA is correct in iqnoring response time. I will ask again: could somebody please document this claim that ARPA funds projects without considering response time / compute power etc. , before we go further down the path of debating the ethics of such a position ? Steve
Re: ARPA speech recognition benchmarks are very bogus!
Author: bovik@eecs.nwu.e
Date: Mon, 10 Apr 1995 00:00
Date: Mon, 10 Apr 1995 00:00
53 lines
2086 bytes
2086 bytes
In article <950409181430.213AACUa.malc@daneel>, mmalcolm Crawford <m.crawford@dcs.shef.ac.uk> wrote: > > You've put up a straw man, it deserves to be burned down. I agree that my first illustration was badly flawed. Don't you agree that my second illustration (two identical systems running for different durations achieving vastly different benchmark scores) fully highlights the flaws in the ARPA benchmark scheme? >> However, I am not suggesting that the trade off between time and >> space should be judged in a directly proportional (or even linear) >> fashion. For example, the 100MHz system above might be allowed >> only 3.8 times as long as the 50MHz system with half the memory. > > So, come up with a trade-off which will be entirely fair, and agreeable to > all participants... should be easy enough... Agreed. Here is a proposal: use a memory-to-cycles trade-off coefficient equal to the median of votes cast by all evaluation participants. > If academic institutions were developing systems which could be sold > in the marketplace, you would be complaining of unfair competition. Oh? Perhaps you do not fully understand my motivations. > It is people such as yourself, however, who are ensuring that this > will be happening real-soon-now. Believe me, I can live with that. > Would it have made sense five years ago to have constrained all > research systems to running on 80286 machines in 1MB RAM? Of course not; talk about straw men! It always makes sense to employ principles of sound experimental design, holding independent factors constant when drawing conclusions from a single dependant variable (the word error rate, in this case.) > Directors of businesses who do not ensure full exploitation of the > advances made possible through world-class, ground-breaking research > carried out in academic institutions should be held accountable by > their shareholders. Any director basing a business decision only on ARPA benchmark numbers and not of the computational situation behind them would clearly be negligent. Sincerely, :James Salsman
Re: ARPA speech recognition benchmarks are very bogus!
Author: ajr@compute.demo
Date: Mon, 10 Apr 1995 00:00
Date: Mon, 10 Apr 1995 00:00
27 lines
1125 bytes
1125 bytes
In article <3m8s34$1dp@netaxs.com> nickm@netaxs.com (nickm) writes: > ARPA is correct in iqnoring response time. Continuous speech recognition > so bad right now that every effort should be made to improve it's > accuracy. > > My main complaint about the benchmarks are that they don't really > present real-world scenarios. If exactly the same text used in the > WSJ task were recorded using a different microphone, or in a room with > an echo, or heavy background noise, ALL of the participants would > fail miserably. For the last few years ARPA has adopted a "Hub and Spoke" paradigm. The "Hub" last year was transcription of unlimited vocabulary read text from north American business news. The "Spokes" included speaker adaptation, domain adaptation, background noise, microphone independence, telephone speech, etc. The evaluations were carefully thought out to be both challanging and informative on the "clean" case and also extend this towards "real-world" conditions. This is a useful framework as it factors the problem into roughly independent tasks and so allows more rapid progress. Tony Robinson
Re: ARPA speech recognition benchmarks are very bogus! Not.
Author: James Salsman
Date: Fri, 05 May 1995 00:00
Date: Fri, 05 May 1995 00:00
46 lines
1955 bytes
1955 bytes
Regarding the fact that ARPA automatic speech recognition benchmarks have been strongly unscientific in that many of the contributing factors to the benchmark results were left uncontrolled, In article <AJR.95Apr11062211@compute.demon.co.uk>, Tony Robinson <ajr@compute.demon.co.uk> suggests a solution: > > The correct way to implement this within the current framework is to > create a new sub-task (a new spoke). Lets call this a 'resource > limited' sub-task. A reasonable set up would be to make the speech > recognition problem the same as the central task (the hub) - that is > unlimited vocabulary recognition. This way, every site that completes > both tasks contributes by: > > * Improving basic speech recognition disregarding resource limits > > * Developing techniques for porting systems to resource bound platforms > > * Providing data points for relative loss in accuracy to help estimate > the loss in performance of other systems were they also ported > > This approach lies with the ARPA CSR framework used in Nov 1993 and Nov > 1994. It is conceivable that something like this will be implemented in > 1995. After almost a month of careful consideration, I report that I fully approve of phasing in scientific experimental design as a "spoke" off the "hub" if that is what it will take to get good experimental design instituted. Such a plan of action will considerably reduce the legal liability of many researchers, not just in the United States. It is clear to me that there are currently grounds for at least a few such negligence and unscientific conduct lawsuits. However, I am less familiar with the statutes of limitations in the International Court of Justice, so current non-U.S. liability for participation in improperly controled ARPA benchmarks is less certain. However, if the uncontroled experiments continue, there will be no question. Sincerely, :James Salsman May Peace Prevail on Earth!
Thread Navigation
This is a paginated view of messages in the thread with full content displayed inline.
Messages are displayed in chronological order, with the original post highlighted in green.
Use pagination controls to navigate through all messages in large threads.
Back to All Threads