Saturday, 18 May 2019

BLIND TEST Results Part 3: "Do digital audio players sound different playing 16/44.1 music?" - Listener Results.


Thanks for the patience everyone. We are now into Part 3 of the report on the Internet Blind Test on the audibility of 16/44.1 digital playback using various devices. In Part 1 we talked about the test procedure itself and unblinded the devices (ASRock Z77 Extreme4 motherboard, Apple iPhone 6, Oppo UDP-205 as ethernet DAC, and Sony SCD-CE775 playing a burned CD-R).

Last week in Part 2, we reviewed the objective measurements of the 4 devices. I hope the readership recognizes the importance of doing this to set the context of what we're looking at this time as we dive into the results from the blind test respondents. As with many things in life, it is only with having facts at our disposal first, then we can make comparisons and develop ideas based on this foundation of knowledge.

Part I: Demographics of the Respondents

As I mentioned in Part 1, in total, I received 101 unique responses to this survey (this is the total number after removing 4 erroneous or duplicate submissions). Of the 101, it was almost universally men who responded with a ratio of 100 men : 1 woman. This is not uncommon; in fact when I looked back at the "24-Bit vs. 16-Bit Audio Test" in 2014, of 140 responses, there were only 2 women.

How old were those who responded?


I suspect nobody's surprise by that "normal" looking curve of audiophile ages. The majority of respondents for these blind tests over the years are from 41-60. For the most part, audiophiles tend to be older which I believe is normal, especially those interested in "high end" products. I suspect only those who are comfortable with playing hi-res files would bother to participate, so this is perhaps a more representative curve of "computer audiophiles" specifically.

Where were these respondents from?


Wow, that's great participation from Europe with more than 60%! About 30% from North America, and 5% each from Asia and Australia/Oceania. Nice to see the international effort :-).

What did the respondents use? Headphones or loudspeakers or both?


That's quite close between the "head-fi" users and those using loudspeakers alone to evaluate. Hats off to the 21 who listened with both headphones and speakers; I think it's a nice reflection of the effort put in!

Approximately how expensive were the systems used by the respondents?

Playback systems in the US$1000-2000 range were the most common. A good sized proportion - 41% - of the respondents indicated that they spent >US$5000 on the system used.

Looking into the data set, the "high end" at >US$100,000 consisted of a system based on the Benchmark DAC3 (~US$2000), Meridian 818v3 "Audio Core" (~US$16,000) driving Meridian DSP8000SE (>US$60k) active flagship speakers. Furthermore, this tester used a Roon Nucleus server and microRendu streamer. He also listened with the Benchmark's HPA4 (~US$3,000) headphone amp with Audeze LCD-4z headphones (~US$4,000). Nice!

An example of the "low end" of the price range is a respondent who used a FiiO X1 (capable of up to 24/192) and Sennheiser PX 100-II, and another respondent with what looks like a similar type of source (not specified) with Audio-Technica headphones.

In between we have all kinds of gear. Here's most of what the respondents used (somewhat in the order of the submission, apologies if I missed a few devices here and there):

Speakers:  ATC SCM40A, Joseph Audio Pulsar, Dynaudio 62, Linkwitz Orion active, Piega Coax 30.2, DIY transmission line, Tannoy DC10ti, Omega Alnico, PMC IB2, Infinity Renaissance 90, Elac Debut B4, Definitive Studio SM45, Magnepan MMG, Impulse Model 24, Totem 100, PSB Alpha PS1 + SubSeries 100, Elac FS407, PMC Fact.8, Harbeth M40, Vivid Giya G3, ProAc D40/R, Amphion Krypton3, Dali Zensor 1, Linkwitz LX-Mini, Quad ESL 63, JBL LSR305, Focal Chorus 726, Rogers LS3/5A, Linkwitz LXmini+2, Harbeth P3ESR, KEF LS50, Magico S3 MkII, KEF Blade, Ino Audio piP, Dynaudio Special Forty, System Audio Pandion 20, JK Acoustics Optima IV, Zaph ZRT 2.5, Wavecor Facette, Goldenear Triton 5, Genelec 7070A sub, Amphion One18, Martin Logan Vista + SVS PC-Ultra subs, Kreisel Sound Quattro Cinema, Focal Electra 1028 Be, AVI DM12, Martin Logan ESL, Usher Dancer Mini 2, Magico V3, Verity Sarastro II, Paradigm Persona 5F

Headphones: 1More Quad IEM, AKG K701, Sennheiser HD650, Stax SR202, Ultimate Ears Reference IEM, Beyerdynamic DT1350, Oppo PM3, Audio-Technica ATH-M50x, B&W P3, Klipsch Heritage HP-3, Bose Quiet Comfort 2, Denon AH-D600, AKG K271, Sennheiser HD280Pro, MEE Pinnacle P1, AKG K518LE, FiiO F5, Sony MDR-Z900, Sennheiser HD800, AKG K702, NAD Viso HP50, MrSpeakers Aeon, Etymotic HF5, Sennheiser HD380, HifiMan HE-400i, Sennheiser Momentum 2, LZ Audio A4, Beyerdynamic DT 1350 Pro, FiiO F9 Pro, Beyerdynamic DT 770 Pro, Mad Dog modified Fostex T50rp, Beyerdynamic DT 1990 Pro, Sonus Faber Pryma, Shure SRH440, Sennheiser HD700

DACs: RME ADI-2 DAC, Oppo BDP-105D, Monarchy M24, Allo DigiOne with Pi, Oppo UDP-205, Aune X1S, Denafrips Terminator, TEAC UD-503, Schiit Yggdrasil, BlueSound Node2, Audio-GD NFB 11.28, iFi nano iOne, Mytek Liberty, Yamaha WXC-50, Linn Akurate DSM, Benchmark DAC2 HCG, Objective DAC, Lynx Hilo, Topping D50, Holo Audio Spring DAC L2, RME ADI-2 Pro FS, Light Harmonic Vi DAC, Slimdevices/Logitech Transporter, Cambridge Azur 851D, Tascam US-2x2, Berkeley Alpha DAC 2, T+A DAC 8

Amps: Devialet 120 Expert, ATI 6012, Pass Aleph 3, TACT, NAD M32 with Bluesound MDC module, Bryston 4B SST2, Bow Technologies Walrus, Parasound A23, Yamaha RX-V2092, "vintage" Classe, Cambridge Audio CXR120, Cyrus 8 XPd, Marantz PM6006, Pioneer SC-LX79, Devialet 250 Pro, Devialet D440 Expert Pro, Benchmark AHB-2, Cambridge A1, B&W AV5000, Quad 606, Wadia 151 PowerDAC, Pioneer Elite VSX-30, Hypex UcD180 modules, Peachtree nova150, Gainclone amp, Hypex NC252MP module, XTZ A2-300, Devialet Expert 200, Devialet Expert 1000Pro, Simaudio Moon 240i, NAD C390DD2, JK Acoustics Active 65, NAD M22 V2, Denon X7200, Q-Watt DYI, Red Dragon S500, DA&T A38, Ayre MX-R Twenty, Coincident Dragon 211P

Headphone Amps/DAPs: Oppo HA-2, Stax SRM212, Geek Out 1000, Topping NX4 DSD, Naim DAC-V1, iFi nano DSD, Dragonfly V1.2, Behringer U-Phoria UMC204HD, Schiit Fulla 2, SMSL iDEA, iFi xDSD, Chord Hugo, Pono Player, FiiO Q1MkII, iFi iDAC, Focusrite Scarlett, FiiO X1, Schiit Jotenheim, Chord Mojo, Sennheiser HDV 820

Preamps/Others: MiniDSP DDRC-22D, Lyngdorf DPA-1 preamp, Bow Technologies Warlock, miniDSP 4x10HD, Quad 34, Hypex DLCP, Daphile music server, Pink Faun 2.16 streamer, Chromecast Audio digital out, RME DigiFace, iFi iSilencer 3.0, JK Acoustics Reference PreAmp, Marantz AV 8003, iFi iTube2, Ayre KX-R

Whew! What a list... As you can see, some of the devices are DACs/pre-amps/amps so I just listed them in the most relevant category based on the system description. Notice that the list includes quite a variety of gear ranging from vintage to modern HiFi, commercial products and DIY projects, well known and esoteric brands... Most importantly, I think this is a nice cross-section of the devices "real people" in the audiophile world use, at least the guys (and gal) interested in audiophile technical discussions and participate in testing on a blog like this!

By going through each response and reviewing the equipment list, it allowed me to check that the entries were complete and that the gear looked reasonable for the hi-res (24/96 playback) demands of this blind test. This also gave me a sense of the lengths many went through in performing the listening test as well as the caliber of the audio systems used! Some of you are clearly adept in DIY audio, constructing devices from Hypex amp modules, Linkwitz speakers, and I see custom speakers with Scan Speak drivers and such. Some described their treated and custom sound rooms. Some tried listening with and without DSP room correction. Some of you used ABX testing and other blind-test software. The impression I get is that overwhelmingly this is a group of audiophiles who know what they're doing regardless of the price tag listed. Thank you for doing the "work"!

Part II: Was it easy to hear a difference? What device(s) did the listeners prefer?

This section gets to the heart of the questions being asked. Let me do this in a bit of a narrative fashion to walk you through the main analysis and sub-samples examined. First, let's just talk about all 101 respondents...

A. ALL Respondents (n=101)
Looking at the full sample, we can start answering the question:

Was it easy to hear a difference between the devices?



I think it's clear from the graphs above that for most listeners, the audible differences between the devices were small, it was not an "easy" test. Keep this in mind as we go through all the subgroups! The first graph asked if respondents could subjectively quantify the difference between what was thought to be the "best" device from the "worst". As you can see, about 20% thought there was either a "huge" or "big" difference, 80% thought at best the difference was "small". Included in that 80% who thought they heard a small difference, the majority, 58% felt it either wasn't worth spending money on an upgrade or felt they heard no difference at all.

An even harder question was whether the "best" and "second best" devices sounded different as shown in the second graph. The reason I did this was because the price difference, use of XLR cabling, and the lack of post-digitization volume correction with the Oppo UDP-205 test samples all in theory might have separated the Oppo from the rest of the devices, thus potentially creating a significant gap between "best" and "second best" sounding devices. Hypothetically, if this happened, then this second question's results could have been similar to the first graph; those who thought the difference was "huge" or "big" might have detected that most of the difference was between the Oppo and everything else. As you can see, there did not appear to be any special ability to differentiate between "best" and "second best". Only 3% thought there was a "big" difference. Only 14% thought the difference was small but worth spending money to achieve an upgrade. And a large majority of 84% of respondents thought that either there was no noticeable difference or even if present, "not worth money to upgrade".

So what devices did respondents prefer?

Without any filtering of responses or looking at sub-samples in the 101 respondents, this was what the result looked like:


This graph is the average score if we were to assign the number 1 as "best" and 4 as "worst" for each device "voted" on by the respondents. Therefore a lower score means that on average more listeners ranked the device as sounding "better". Notice that Device B (iPhone 6) on the whole scored best followed by Device A (ASRock motherboard)!

Let's not get too excited about this and crown Apple the winner just yet :-). Remember that often we do need to look deeper into the numbers to discern what's actually going on... Instead of just averaging things out, let's actually count the number of votes and look at the preference pattern for each device:



Isn't that interesting? For each device, the largest number of "votes" was in the order of presentation! Many respondents, specifically the ones who thought there was "no noticeable difference" simply voted "A-B-C-D" to create this pattern. In fact, for those who thought there was "no noticeable difference", the "A-B-C-D" pattern of response from best to worst accounted for almost 80% of those votes. This is basically "noise" that needs to be filtered out if we are to hopefully understand the true preferences of those who felt they could hear a difference.

B. Respondents who reported hearing a difference (n=73)

If we filter out the "no noticeable difference" group (80% of which simply voting "A-B-C-D" as mentioned above), the total number of respondents to analyze goes down to 73, and here are the average scores:

Aha! As expected, this is a significant change. We now see a shift towards Devices D and C as the preferred (Sony SACD player and Oppo UDP-205) over B and A (iPhone 6 and ASRock motherboard).

The average scores now can then be expanded in the same way as what I did above to show preference patterns for each device. To keep it simple, if we assume that this is all random, with 73 raters, distributed 4 ways (best to worst) for each device, we would predict an average of 18.25 "votes" for each level of preference. Statistically, we can run a simple χ-square test with 3 degrees of freedom (4 ranks for each device) and compare the respondent preferences versus the "null hypothesis" of a purely random distribution.


As you can see. Based on the usual two-tailed p-value of <0.05 as threshold of significance, indeed the pattern of preference shown for Device A (the ASRock motherboard) is significant! What it suggests is that the blind test respondents ranked this device as "worst" to a significant degree.

In comparison, none of the other devices had a pattern that significantly deviated from the random "null hypothesis". However if you examine the distributions, we see that both the Oppo and Sony SACD players had fewer people ranking them as the "worst" sounding devices which is why on average they scored better than the iPhone 6. In this group of 73, there did appear to be a trend with preference for the sound of the old Sony SCD-CE775 player but the pattern was not statistically significant.

C. What can we say about other subgroups?

Did the older groups (41+ years) compared to younger age groups (<41 years) have different preferences?

If we look again at the complete data set of 101 respondents and just picked out the "younger" folks <41 years old, here's what their demographics look like (there were only 24 respondents in this "younger" age group):


We can compare this to the "older" group of 41+ who were more numerous (n=77):


Although the sample size for the "younger" testers is smaller, the results do support the impressions we might have as audiophiles that the younger folks are more inclined to be listening with headphones, generally have less expensive systems (many in this age group used systems in the US$200-500 range), and interestingly felt the Joe Satriani "Crowd Chant" (more "modern" production, lower DR sound from 2006) was more resolving of differences if they had to choose one of the tracks.

For the "older" age groups 41+ (I would be included in this category), "we" tended to listen to our music through speakers, had more pricey sound systems (largest number of systems in the US$1000-2000 range), and more respondents thought the Maxi Priest "Wild World" track from 1987 provided better sonic differentiation. The "older" age group thought the least of the Joe Satriani track as audibly different between devices.

As for the magnitude of audible difference heard, both the younger and older subgroups had ~60% of respondents saying the difference was either unnoticeable or too small of a difference to spend money for an upgrade between the "best" and "worst" sounding devices.

As above, if we now filter out the listeners who were unable to hear a difference (and their tendency to vote "A-B-C-D"), how did the younger and older subgroups rank the devices?


While both the younger and older subgroups were able to rank Device A (ASRock computer motherboard) as lowest quality, the older subgroup as a whole ranked Device C (Oppo UDP-205) as being "best" followed by the Sony SACD player and then the iPhone 6!

What's interesting is that the younger group ranked the iPhone 6 as being tied as "best" with the Sony player. In particular, it's the "30 somethings" who selected the iPhone as "best". Perhaps it's tempting to think that younger folks are more "used" to the sound of ubiquitous devices like our cell phones? With only 18 "younger" respondents who thought there was a difference in sound, the number is too small for statistical significance. A finding to keep in mind though for future consideration and further testing perhaps.

As above, we can look deeper at the 41+ "older" subgroup with 55 respondents and examine their preferences in greater detail:


Despite a reduction in total number of respondents from 73 to 55, though not strictly p<0.05, the motherboard's pattern of being rated as "worst" was essentially a significant finding (p=0.055). With this subsample, the Oppo UD-205 did quite well with most respondents ranking it consistently as "best" or "second best" and few thought it sounded "worst".

How did the musicians, audio engineers, and audio reviewers/writers rank the devices?

Remember that I had asked the respondents whether they have musical performance experience, had formal training in audio engineering / sound evaluation, and if they published audio equipment reviews. It's complicated in that there is some overlap between these groups as you might imagine; about half of the audio engineers and audio equipment reviewers also had musical performance background. Let's keep this simple and examine each subgroup separately despite the understood overlap.

Once we filtered out the "no noticeable difference" group, here's how they ranked the devices:


Not bad. All 3 subgroups were able to rank Device A (ASRock motherboard) as the "worst", again consistent with objective testing expectations. I am impressed that the "musicians" as a subgroup had a particularly strong tendency to vote the motherboard as sounding "worst"! Here's their breakdown of preferences:


Despite the smaller number, it's obvious that the musicians had a strong distaste for the ASRock motherboard! To the point where it's clearly statistically significant.

For all 3 subgroups consisting of listeners with extra expertise in audio and music, both the Oppo UDP-205 and Sony SACD player again beat out the iPhone 6 and took turns at being "best" sounding.

Remember though that this is after excluding those who could not hear a difference. Of all the engineers/trained listeners, 85% felt they could hear a difference of any magnitude. Of the musicians, 66% thought there was an audible difference. And finally, of the equipment reviewers, 64% felt there was a difference.

I wondered which music track each subgroup thought was best to hear a difference with (when they reported hearing a difference of course):


Whereas the audio equipment reviewers and audio engineers were more agnostic as to which track was best, the musicians focused on the ones with more dynamic range, especially the Stephen Layton & Britten Sinfonia "Handel Messiah" track (DR15). Again, an interesting contrast with the "younger" <41 y.o. group above who used the Joe Satriani track (DR9) more and picked the iPhone 6 as equivalent to the Sony as the "best" sounding device.

Let's examine one more subgroup...

How did those with more expensive audio systems fare? Let's focus on the groups using >US$10,000 worth of gear...

Remember that we have multiple overlapping variables here. Keep in mind the age correlation for example:


As expected, those with "higher end", more expensive systems are generally older and in this test, there was a large cohort in the 51-60 age group using US$10k+ of equipment for this test. This group was 100% male. While not shown here, 84% of the respondents in this group used speakers, while 16% used both speakers and headphones. It would be rather unusual to have just a headphone system cost over $10k although given today's prices, far from impossible!

Within this group, 28% could not hear a difference and another 24% thought any difference was so slight that there was no point paying to upgrade. So even with rather expensive sound systems, 52% were not reporting significant differences in the sound. Also, notice that none of those using >US$10,000 worth of gear for this test thought the difference was "huge".

If we now take out the 28% who reported not hearing a difference, this is the device preference ranking (n=18):

Remember: lower average number means ranked as "better".
So the guys with more expensive systems listed from best to worst: Oppo UDP-205 > Sony SACD player > iPhone 6 > ASRock motherboard. That certainly correlates nicely with the objective test results last week (which begs the question, why are some purely subjective audiophiles so afraid of blind tests?).

This ranking is similar to the 41+ "older" group (again, remember the overlap between age and price of sound system). Though subtle, this "more expensive system" subgroup's average score for the Oppo was slightly lower ("better") and the ASRock motherboard slightly higher ("worse") than those in the the 41+ age group; in other words, the "spread" widened. While I would not be able to make a case for statistical significance, this is at least consistent with the idea that better equipment might provide better resolution to make differences more noticeable. Alternatively, maybe the "serious" audiophiles who buy expensive equipment were more attentive or capable to tease out the sonic differences.

Part III: Discussions

There are more subgroups I can look into but I think the above captures the core of the data set reasonably completely. With 101 total respondents, once we get into these subgroups and filter out the individuals who indicated that they could not hear a difference, numbers do drop quickly which makes it very difficult to interpret significance.

What are the take-home messages from this blind test?

1. I think these 101 respondents provided an interesting demographic glimpse into what I suspect are a rather "serious" bunch of audiophiles ;-).

I don't think it's too bold to suggest that audiophiles/music lovers reading this blog are likely better educated about computer technology for handling the downloaded files and digital playback than the average audiophile. So while some demographic factors might be skewed, I think we're seeing the results from a discerning group of respondents interested in high-quality audio reproduction. Other than a small number of submissions that had to be removed or duplicate submissions corrected, the respondents answered all questions completely and appropriately.

As a publicly distributed test, there are limitations due to lack of  controls in place as one might have in a formal "lab" situation. We can't be sure the hardware is set up properly or is adequate for 24/96 playback, can't confirm that drivers are bit-perfect, that the files transferred without error, we cannot check auditory acuity of the listeners, or that the software might not have inadvertently altered the sound... However, respondents listening in the comforts of their own homes and using their own equipment allowed for a level of familiarity and intimacy one would not be able to replicate in an artificial test site. Despite the limitations, I think there's value in this kind of "naturalistic" distributed blind test which can give us an idea of how things sound "in the wild" and with "real audiophiles".

2. On the whole, despite the expected disparity in sound quality between a computer motherboard, Apple iPhone 6 headphone output, Oppo UDP-205 (with current "flagship" ES9038Pro DAC) XLR output, and a Sony SACD player with RCA out, the results suggest that audible differences from 16/44.1 playback are not easily heard.

About 60% of audiophiles in this sample of 101 either did not report hearing a difference or did not think the difference was worth spending money on an upgrade for. Only 20% thought there was a large difference.

I referred to the Steve Hoffman Forum poll in Part 1 which started me thinking about doing this blind test. Clearly from the results I've presented here, the idea of "CD players" or more generally 16/44.1 digital players having (audibly) significant "sonic signatures" is not a simple answer that can be spoken of in a binary "yes" or "no" fashion. Objectively, from last week's post we can easily see that there indeed is a different "signature" to each device with various noise levels, distortion amounts, crosstalk differences, jitter, etc... But by the time we ask people to listen, it's no longer so obvious with factors like nonuniform hearing acuity, age, experience, musical preference and equipment used influencing the final result.

3. Despite small differences reported by many, the data did suggest a significant ability for those who heard a difference to identify the computer motherboard as sounding "worst", this is consistent with the objective measurements.

However, the Apple iPhone 6, Oppo UDP-205, and Sony SCD-CE775 SACD players were essentially equivalent with no statistical advantage between them although there were some trends depending on the subgroups. This suggests that there is such a thing as a point of diminishing returns, a "threshold" in quality (and probably price) beyond which it's unlikely that listeners would be able to differentiate between reasonably competent devices.

Where exactly that "threshold" lies is of course up for debate and would depend on the listener him/herself, quality of equipment, perhaps listening experience as suggested by the subgroups. Looking at the objective results from last week, my feeling is that the most likely audible difference is the relatively poor noise floor of the ASRock Z77 Extreme4 motherboard with that CPU / GPU / power supply combination. In particular, the 60Hz hum with various harmonics, and other low frequency noise. If indeed it is the hum, this listening test suggests that the threshold of audibility as a group effect for those who reported hearing a difference is somewhere below -104dBFS and perhaps above the -114dBFS from the Sony SACD player (since many ranked the Sony highly in sound quality). Noise level must also be referenced to the music used. Remember that with more dynamic music, typically with lower average amplitude, one will need to increase the playback volume when listening. This could be why subgroups that focused on more dynamic music (eg. the musicians who preferred the Handel Messiah track with average RMS volume of -20.5dB and DR15) were able to tease out the imperfections of the ASRock motherboard remarkably well. While ideally noise floor should be as low as possible and hum should be absent, it's good to have these numbers in mind as reference when doing objective testing.

Remember that the Realtek ALC898 DAC on the ASRock motherboard is one of Realtek's better audio chips so this suggests that less expensive sound chips in budget motherboards like the ALC892 might perform worse if put into a blind test like this. As you can see in last week's comments, it is quite possible that my nVidia GTX 1080 GPU card is a major source of the poor noise floor. On paper at least, newer solutions like the ALC1150 should sound better with lower noise assuming that its other technical qualities are good.

As a reminder, note that "statistical significance" just means that with a large enough number of trials, we can detect that the "roll of the dice" appears to be weighted unevenly; specifically in this blind test, against the ASRock motherboard to a degree that approaches or surpasses a typical 95% confidence (p value of 0.05). Remember this does not imply that any one person necessarily should or even could hear the difference in the same pattern. Also, it doesn't imply that the difference is strong! As sample size increases, the "power" of the study improves at picking out subtle differences as a group effect.

Always remember this issue of actual perceptible magnitude when you look at studies reporting statistical significance and especially when conclusions are drawn using large sample sets such as meta-analyses (consider this one from 2016 combining all kinds of studies looking at hi-res audio).

4. It's interesting that the "younger" audiophiles (<40 y.o.), who were also more likely to be using headphones ranked the Apple iPhone 6 higher. Although the sample is small, this was the only subgroup that ranked the iPhone quite highly (tied for first place with the Sony SACD/CD player). It is tempting to wonder if there might be some subtle familiarity with the sound itself given the ubiquity of Apple devices these days. Perhaps it's simply the use of headphones. While headphones remove room effects and can improve clarity, the presentation of the sound (eg. the impression of a "soundstage") is different. This difference could affect how listeners evaluated the sound and what qualities listeners paid particular attention to.

Another interesting difference with the "younger" listeners is the music sample they thought was most useful to detect differences in sound between devices. My kids are starting to enjoy pop and rock and the production quality is much different these days (louder, more bass-heavy, "harder", "crunchier", more "synthetic" sounding) than when I was growing up. Maybe this also affected device preferences. The <41y.o. respondents thought a modern mastering like the Joe Satriani "Crowd Chant" was better to hear differences compared to those 41+ preferring to use more dynamic older pop (Maxi Priest) and the multilayered vocals of "Chorus" from Handel Messiah. Again, could this preference affect what was listened for when comparing?

For completeness, here are graphs highlighting this difference in track preference:


Who knows, maybe there's an interdisciplinary dissertation here on studying the effects of modern audio production, audio hardware, human perception, and sociological trends. :-)

In any event, regardless of age, use of headphones or speakers, and musical experience (eg. musicians, audio engineers), each subgroup agreed that the sound from the motherboard was the "worst" as per item (3) above.

5. Unlike previous blind tests where I could identify "golden ear" individuals who scored 100%, this is not that kind of test.

However if we believe that the Oppo UDP-205 "should" be the best device in this field of 4 as shown objectively, then I'd have to give the "Golden Ear Award" to the "older" listeners 41+ years of age as a subgroup.

With 55 in that subgroup reporting hearing a difference, they were able to rank the ASRock motherboard as the "worst" sounding essentially to a statistically significant level, while on the whole suggesting that the Oppo sounded the "best", with the Sony CD player and iPhone in the middle of the pack. This is also the group who used loudspeakers more than headphones, overall spent more money on the sound systems, and felt that more dynamic music helped differentiate the sound.

Also, I must congratulate the musicians in this blind test who reported hearing a difference. With only 19 respondents, you "nailed it" by selecting out the lower quality motherboard DAC. Phenomenal job!

Part IV: Conclusions

Remember that 16-bits PCM and the 44.1kHz sampling rate were not parameters selected out of thin air. Within the limits of technology available at the time, significant research and listening tests helped Sony and Philips develop the Compact Disc and their claim of "Pure, Perfect Sound Forever". These days, after many generations of products with refinements in low-level linearity, lower noise floor, filter optimizations, and jitter reduction, the logical expectation is that 16/44.1 DACs have matured to the point where one has to assume that it should not be easy to detect differences between devices. The idea of "transparency to the digital source" should result in a "common" kind of sound among many devices (especially those with "high fidelity" aspirations) once we equalized listening levels.

Considering the multitude of DACs and players out there, of course some devices can sound vastly different. However, it's also quite likely that such a different-sounding device is either of very low quality (like this) or a company purposely "colored" the sound to differentiate their product (perhaps like this). Needless to say, do not assume that an expensive "high end" DAC/player is necessarily also of "high fidelity" because it sounds different or is subjectively preferred by some!

While opinions are plentiful, unfortunately documented blind tests are precious few. Thanks to vwestlife on the Steve Hoffman Forums, we see this little article from January 1997 in Stereo Review:

BTW - if anyone's looking for a huge catalog of back issues of Stereo Review to remember what audio journalism looked like and what audiophiles discussed back in the day, check it out here!
I think this Internet Blind Test corroborates quite nicely with the article above, suggesting that the audible differences between devices are not heard by everyone and even if they are heard, more likely than not, rather subtly. In 2019, thanks to the progress made in providing digital playback options, we've clearly gone well beyond just using CD players for testing.

Let's extend the discussions here a bit to another area where I think the results of this test are relevant. The audio Industry has been wanting to sell us "hi-res" music. In the process, for years, 16/44.1 has been re-branded as "standard resolution". The inconvenient truth is that the majority of albums (especially in the pop and rock genres) in the "hi-res" catalogue really have no business being anything else but 16/44.1. As I expressed previously, most of the time, "hi-res" is more of a marketing tool than as a serious attempt at providing better sound quality to albums that could benefit (see here, here and here). Over the years, listening tests have generally not demonstrated good audibility of hi-res over CD quality. This is no surprise in the context of our findings here. In fact, other than a difference in ultrasonic content (which is of course ultrasonic!), it would be easy to show objectively that taking a 24-bit recording and dithering it down to 16-bits would have a smaller effect than the differences recorded between the devices in this blind test as measured last time! If already differentiating between an Oppo UDP-205, the Sony SCD-CE775, and the iPhone is difficult while playing back 16/44.1 music, why should anyone think that even more subtle differences based on resolution beyond 16-bits should be audible to a significant degree as consumers? (Remember, there are good reasons to use 24-bits in the studio during production.)

Despite the fact that a number of respondents could not hear a difference, for those who felt they could, it was certainly interesting to show that the data wasn't all just random. Even if the objective superiority of the Oppo UDP-205 and its ESS ES9038Pro "flagship" DAC could not be demonstrated to a significant degree in the listening test, the obvious objective and in turn subjective limitations of the computer motherboard were heard by a significant number.

This is overall good news I think.

For those who don't want to spend too much money on audio gear, this is good news because it means that reasonably low noise, low distortion digital audio devices sound great already (the Oppo, Sony SACD player, and iPhone 6 all did well). Therefore, one should prioritize upgrading other parts of the system like the room acoustics, speakers, and amplifiers (probably in that order) which will yield far greater improvements in the sound.

For those who want to spend more money on the digital source - by all means! There's nothing wrong with seeking out essentially ideal objective performance when we compare the Oppo UDP-205 with something like the Sony SACD player. There are even hints in our results that blinded listeners using more expensive (and presumably better) systems were able to show a preference towards the Oppo with its superior objective performance. The important thing is to be mindful of diminishing returns, linked with value. In the big picture, a device like the Oppo UDP-205 which I bought at around US$1300 (before discontinued) is still "cheap" considering its performance compared to so much of the "high end" these days!

I think this blind test reminds us that ultimately it is important to be realistic and not boast about remarkable audible differences. I've generally felt that flowery descriptions and dramatic backstories in equipment reviews of DACs and various players simply come across as hard to believe and lacking in credibility. I suspect for anyone who has tried blinded listening tests like this or took their time to do volume controlled, blinded, "shoot-outs" of various hardware have also found that even when differences are there, they are typically subtle between competent devices these days. (Yeah, I know some companies, certain press people, and their ad departments don't like to hear this...)

Remember that science is based on empirical observations to confirm or reject results. Consider these test results as "data points". Nothing here is dogma. In time, perhaps the conclusions may change with further systematic testing.

Feel free to do your own blind tests and document what you found. I believe there is no better way to keep oneself honest! Let me know how it goes.

---------------------------

As usual with my blind test results, I'll post a Part 4 to publish some of the subjective comments the respondents submitted. I've looked at many already and am finding them quite fascinating! They reveal the thoughts and feelings of those who participated. Sometimes the impressions are "right on the money", other times, it's remarkable how subjective opinions can be very different from expectations once the blind is removed! Such is the nature of subjectivity...

A word of thanks again for all the participants/respondents. You guys (and gal) are to be commended. I hope the blind test exercise provided for an interesting experience with the objective results last week helping to "calibrate" what you heard. Regardless of whether you heard a difference, I know it wasn't a simple exercise - if it were obvious, what gain or challenge would there be? I certainly respect those who take up challenges that promote reality-based perspectives, especially in our little corner of the universe called the audiophile hobby where sometimes reality, Industry-sponsored hype, and pure fantasy can be difficult to tease out. Even worse, at times reality-testing appears to be discouraged by some.

Have a great week ahead. Enjoy the Victoria Day long weekend fellow Canucks. Time for me to spend time with the family, relax, and enjoy the music after this long write-up :-). Cheers!

** Part IV: Listener Subjective Responses posted **

25 comments:

  1. Awesome, awesome, awesome test and write-up. I do wonder, if an ALC1220 or asus S1220A variant was used instead of the ALC898, there would still be an "obvious worst" among the group :) in a way, using the ALC8 series in this test which isn't exactly state of the art did help the testers. I think.

    ReplyDelete
    Replies
    1. In any case archimago, congratulations on seeing this through and I do think the results are highly relevant!

      Delete
    2. Thanks for the comments verifonix.

      Yeah, maybe the ALC898 "helped". But I think that was a good thing because the results gave us a sense of what "minimum quality" would look/sound like and with this "lowish" motherboard audio bar, how many respondents could/could not hear a difference. Note that it's also possible that the ALC898 was fine but by nature of this being in an older i7 computer with a powerful nVidia GTX 1080, it was the computer components that primarily created the distortions.

      It took me a bit of time to plan, produce the test samples, construct the survey... And certainly if I expect folks to listen to up to 16 samples, fill out a pretty detailed questionnaire, and had ~100 respondents, I would certainly see this through!

      We see plenty of subjective opinions. Some objective results. But way too few tests that try to bring the two together in audiophile writings - which is what ultimately matters in the real world! I certainly was not going to let the opportunity pass without making sure to get the job done on this one. :-)

      Delete
    3. Including a device that really is audibly colored is a *good thing* in such tests: a positive control.

      A negative control (like, including the same device twice, anonymously) is good, too ...as well as potentially amusing.

      Delete
    4. Yup, good point Steven.

      Delete
  2. In other words, the relatively high out of band noise and ugly 30kHz idle tone of the Sony SACD player and visually abysmal stopband attenuation of iPhone have little (if any) impact in your experiment. These artifacts can be identified in the IMD sweep test of RMAA as well.

    A good lesson for all of us to determine what things are psychoacoustically important in audio measurements.

    ReplyDelete
    Replies
    1. Hi Bennet,
      Yes, I would put ultrasonic distortions like that 30kHz tone low on the list of audibility... Sure, it could worsen IMD measurements and perhaps we can hear it with test tones designed to show the anomaly, but during complex music playback, I just would not be concerned even though I certainly would not want these imperfections (especially inherent in the hardware). For years, we've seen ugly tones above 20kHz from tape bias for example in our music. The irony is that a lot of this stuff shows up in our "hi-res" music and vinyl LPs which would have been filtered out in 44.1/48kHz.

      Likewise the stopband attenuation in the high frequencies... The lack of audible sensitivity is the saving grace for NOS DACs and their -3dB high frequency roll-off and imaging distortions, companies like Audio Note to still tout NOS designs, and blessed the old Philips TDA154X DAC chips with an unnaturally long commercial life :-).

      Delete
  3. How "correct" answer did those who said they hear a HUUUGE or big difference give? I mean, where they as good as they said? :)

    ReplyDelete
    Replies
    1. Hi Tell,
      I looked into this with the 2 who thought the difference was "HUUUUGE"... Despite the strong confidence shown, their preferences were:

      Person #1: ASRock motherboard > Sony SACD/CD player > iPhone > Oppo

      Person #2: Oppo > Sony SACD/CD player > iPhone > ASRock motherboard

      So it looks like the second individual did very well at least. "It's a wash!"

      Delete
    2. What do the data look like when the "big difference" results are pooled together with the two? There’s about 20 data points that’ll be interesting to see. Thx

      Delete
    3. Hi Gnu,
      Yeah, if I put together the "HUUUUGE" and "Big difference" groups there are 20 total and the average scores from "best" to "worst" look like:

      Sony SACD/CD - 2.05
      Oppo UDP-205 - 2.15
      iPhone 6 - 2.65
      ASRock Mobo - 3.15

      Looking at the subgroups, I can make a case for statistical significance again with the Device A motherboard clearly being voted as "worst" (votes distributed best to worst: 2, 4, 3, *11*). Not statistically significant for the other devices although the Sony SACD/CD player came close.

      So looks like the folks who felt "surest" about hearing a difference were able to detect out the motherboard.

      Delete
    4. Woops... Forgot to log in to post that comment above.

      Delete
  4. Hi Archimago! Detailed and thorough analysis as usual, and findings mostly as I expected...I'll reserve more comments from my own results for Part 4, but I would like to thank you now for the Stereo Review catalog you mentioned in passing. I was an avid reader of that magazine when I was young and looking at some of those covers from 1968 onward seems like I read them yesterday! Curious how and image from 50 years back is still fresh in the brain, as useless as it would seem to be to recall it perfectly! Never ceases to be amazed at the diversity of what really forms the fabric of our personality...

    ReplyDelete
    Replies
    1. Greetings Gilles!
      Hope you're keeping well...

      Yeah, it's amazing how the brain neural networks keep stuff over a lifetime ;-).

      I as well was impressed by that catalogue of scanned magazines. I noticed that a few were missing some pages but overall, a great effort. It is amazing how back in the day the writers and reviewers IMO just felt more genuine in their description of products, less prone to hyperbole, and how many articles were seeking to educate the public in a passionate manner instead of come across as selling products...

      Perhaps at some point this will be again. Though not holding my breath! :-(

      Delete
  5. Hello,

    Long time lurker, first time poster.

    Thank you conducting this extensive test and posting the results. May I tell you little anecdote before I comment on your results and my interpretation of them?

    Some 20 years ago I did a DBT with a close friend of mine. We were both music lovers, musically trained, low income university student spending money on CDs, concerts, beer etc. But we were also very much into IT.

    We decided to try a DBT for the differences between 320Kbps LAME encoded mp3, and a original CDDA track. First we'd rip the track using an ordinary desktop computer, encode it with LAME, then decode the track to .WAV, then burn it to a fresh CDR, along with the original track. We did this to about 10 test tracks. So in all we had 20 tracks on a CDR, and randomly allocated whether the even numbered track was original ripped version, or the LAME processed one.

    We used an excellent (for the era Panasonic portable CD player- model number. SL-CT470
    (This was 2 years prior to Apple releasing the iPod and obliterating the portable CD/DAP market)
    I never actually saw measurements for this SL-CT470, but it was considered a neutral player because it's relatively good headphone. It was one of the recommended models from www.headphone.com (when Tyll was still running it)

    With this methodology, we were unable to reliably tell the difference between the 320Kbs encoded MP3, or original CDDA. For most tracks it was a 50/50 split. But those tracks where we thought that COULD hear a difference, and confidently so, we tended to PREFER the 320Kbps MP3!!!

    Obviously there are limitations to this DBT. eg. small sample size of 2, but what did I take away from it?

    To me, we interpreted it the whole process as- LISTENING FOR DIFFERENCES. So there's either no discernible difference for a 320Kbps MP3 vs original (audibly transparent), or a small difference (?statistically significant)

    But the from the science of perceptual coding, we do know that loss compression can introduce artefacts.
    The artefacts may be not be audible "distortion", per se.
    We actually thought them to be "more details", "more sparkle", or "more spacious" , for lack of better terms.
    Whereas the original CDDA felt a bit bland.
    So the effects of lossy compression creates 'special effects' that can be interpretted in a variety of ways, even though it was IMPOSSIBLE for them to be in the original recording. And these special effects may be even perceived as BETTER!

    In 2011 I'm no longer a university student. This time I do a single blind trial with a few DACs i have laying around, including a Logitech Squeezebox 3, Centrance Dacmini, Apogee Duet. And to be honest, they all sounded the same when matched for level output!!!

    Fast forward to 2019. No longer a poor university student but very time poor.
    Also much older. 40+. $10-100K+ sound system. Main listening room 6x9x3.8metres.
    Multiple listening rooms, multiple speakers, DACs, amps.
    You name, I've listened to it, or read about it. Visited studios, international hi-fi shows, chatted to manufacturers eg. Bruno Putzeys. Yep, fallen into the audiophile bug big time.

    Your study design may have some limitations, as detractors will say, but worrying or fussing over DACs is nonsense. Because we don't connect DACs directly to our ears. We listen to them via transducers ie. headphones or speakers. And when even the best transducers have about 0.1% THD, this is a couple orders of magnitude worse than the most basic 24/96Khz commodity DAC.

    So there is just no good reason to be spending big bucks on a 32Khz/768KHz player when playing ordinary 16bit/44khz material (99.9% of the music out there) in terms of AUDIBLE DIFFERENCES.

    ReplyDelete
    Replies
    1. Hi KillerT,
      Thanks for the comment!

      Wow, nice story man! Good discussion about the 320kbps MP3 vs. lossless test you did back in the day. In fact, you might not be aware that the blind test that was done here that kickstarted my blog in 2013 between high-bitrate MP3 vs. lossless also had similar conclusions to yours!

      http://archimago.blogspot.com/2013/02/high-bitrate-mp3-internet-blind-test_2.html

      Among 151 respondents, there was an actual preference for the high bitrate MP3 as well!

      I agree, in the big picture, the DAC is only a small part of the equation to make the sound of a system what it is. Nothing wrong with getting a high quality, "high end" DAC/digital streamer/CD/SACD player of course. But we just have to be mindful that unlike the days of analogue when turntables and cartridges could make quite a bit of difference (which are of course measurable), digital playback systems are much more uniform!

      All the best enjoying the audiophile bug :-).

      Delete
  6. I agree! The weak point is and always will be where an electric signal is transformed into sound wave, so speakers (and room) or headphones. Electronics have improved to the point they exceed our hearing capacity while transducers are still imperfect, that's why people spend gazillion dollars for huge arrays of cones and domes and other air moving apparatus. High sampling rate DACs are useful if you want to adapt the music to your taste by playing with filters in my opinion. Standard DAC chips are now good enough for only listening to music without fear of missing anything.

    ReplyDelete
  7. Great analysis Arch. Again, this was a great exercise, which I feel was incredibly eye-opening.

    To me the biggest hot-take is "84% of listeners could not hear a cost-justifiable difference (or reach a statistically significant consensus) between best and second best."

    My biggest surprise is the generational-divide in preference. Seeing it, I suppose it makes sense, but I'd have never expected it going in. Credit due for picking tracks and devices that sussed that out.

    As for one thing I learned about myself... well, it seems I easily spent the most time with the least helpful track for this type of thing: Cecile McLoren. Max Priest was my close 2nd. (This is another thing where upon reflection it makes sense. I generally dislike loud music. Indeed, the aircraft take-off sound at the start of the Max Priest track drove me nuts. Many times I fast-forwarded past it. Also part of the reason I didn't listen to the choir piece as much as the other two was it got relatively too loud in the middle for me. Um, in case anyone is wondering... no, I'm only in my 40's. :))

    My other personal take-away: I was able to challenge my personal hypothesis that above a very modest cost level, DAC's are a technically solved problem with one's money being better spent elsewhere, and it withstood the challenge. That's pretty nice to know.

    Thank you for organizing and writing up the results in such glorious detail.

    ReplyDelete
    Replies
    1. Hey Allan,
      A pleasure and thank you for the result submission.

      I think it's a good thing to challenge ourselves once awhile with tests like this and see if our "worldview" is backed up when we take a "bird's eye view" and examine the choices and perspectives of something like 100 others. Although we don't talk about it as much, you're right that the music we choose to evaluate with can make a big difference! Not all genres are equal when it comes to challenging hardware resolution.

      With the cost of housing high, and "high end" devices like speakers reaching stratospheric levels, many, especially the younger audiophile/music lovers have no option but focus their energies on headphones. The "mobile" lifestyle also clearly will have an impact on what devices we use (ie. smartphones) and audio mastering with loudness and compression have followed this trend as well whether many of us like it or not. As an aside, I wrote about some of these themes awhile back in 2016:

      http://archimago.blogspot.com/2016/08/musings-convenience-lossy-audio.html

      I think it is good that high quality headphones are plentiful and hope that in that market, competition can drive prices down instead of crazy "high end" pricing of people chasing "luxury". I still hope the day comes when a "loudness"/"Mobile Boost" button can be standard so the mobile folks can always have their compressed/loud sound when listening on subways, while allowing recordings to remain of high dynamic quality.

      All the best down south man...

      Delete
  8. Excellent article with results as I expected. I do have a question. I have some hi res files that I bought due to better mastering "not brickwalled" I usually use foobar2000 the resample them to 16/44. Have you done any tests to see whether the dithering done by foobar changes the audible quality?

    ReplyDelete
    Replies
    1. Hi,
      No I haven't looked at foobar's downsample to 16/44. As far as I can tell, a simple triangular dither like what was done in the 16-bit vs. 24-bit Blind Test did not result in a significant issue and I suspect foobar will be fine.

      Delete
  9. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. Hi Blumlein,
      Thanks, I did sent a note to congratulate the author for an excellent program! Certainly much better UI, more stable results, and less crashes compared to Audio Diffmaker.

      I've run some of these through for a quick look. Yup, there are differences; and every little change like the filter setting will show up.

      Will need some time to see how best to organize and show the findings though.

      Delete
  10. Hi Archimago,
    I sent you a PM over on ASR. Might be worth reading.
    Thanks.

    ReplyDelete
  11. Interesting findings. Especially with the group of people that have nice / expensive / resolving systems. Although these findings contradict what Ethan Winer proved via his famous world renowned Null Tester, that all DACs sound the same and a simple computer DAC chip works the same as the super high end DAC that costs many thousands of dollars.

    ReplyDelete